A Survey of Generative Adversarial Neural Networks (GAN) for Text-to-Image Synthesis

6 min readApr 18, 2020

Text-to-image synthesis applies to optimization algorithms that, in the form of key phrases and keywords or phrases and sentences, convert written textual content into pictures with a related semantic meaning to the document. In recent work, image synthesis depends primarily on word-to-image correlation analysis in conjunction with supervised learning to identify the best adjustment of the text-matching visual content. Advancement in deep learning (DL) has provided with it a modern range of unsupervised deep learning approaches, especially deep generative models that can produce meaningful visual images utilizing adequately trained neural network models. The transition from computer vision-based methods to artificial intelligence (AI) based technologies generated extreme market involvement, like virtual reality, visual gaming, and computer-aided modeling, to automatically create convincing pictures from text-based representations of natural languages. This survey is based on a paper [1] which discusses image synthesis and its problems first and afterward examines main ideas like generative adversarial networks (GANs) and deep convolutional neural network decoders (DCNNs). This survey shows how to use GANs and DCNNs to produce exciting outcomes in groups including human faces, animals, flowers, house interiors, object recreation from edge maps (games). In the end, a summary of the suggested approaches, remaining unresolved problems and potential innovations in the area of text-to-image synthesis is concluded.

1 INTRODUCTION

Generating images from explanations of the text, i.e. text-to-image synthesis, is a complex issue in computer vision and machine learning that has seen significant success in recent years. Automatic picture production from natural language can help users to define visual elements using explanations of the visually-rich text. The capability to do just that successfully is extremely beneficial because it may be used in artificial intelligence technologies like computer-aided design, image processing, game engines in the next-generation computer games creation, and the production of visual art.

1.1 Traditional Learning-Based Text-to-image Synthesis

In the beginning step of this study, text-to-image synthesis was performed predominantly by a combination search and supervised learning process (Zhu et al., 2007), as seen in Figure 1.

Figure 1. Recent research into the conversion of text to picture (Zhu et al., 2007). The program uses the similarity between keywords (or keyphrases) and images and recognizes descriptive and “picture able” text objects, then looks for the most possible text-conditioned image pieces, and eventually optimizes the text-conditioned image structure as well as the image sections.

To link text explanations to pictures, one may use similarity among keywords (or keyphrases) & pictures that define descriptive and “picturable” text units; then these units would look for the most probable text-conditioned image pieces, ultimately improving the text-conditioned image structure. These approaches also combined several core elements of artificial intelligence, like manipulation of the natural language, computer vision, computer graphics, and machine learning. The main drawback of the conventional text-to-image synthesis methods focused on experience is that they lack the capacity to produce new picture content; they can just change the characteristics of the images given/trained. Conversely, generative model work has progressed dramatically and offers strategies for learning pictures from training and creating new visual content. Furthermore, in order to produce visual data, a layered generative model with extricated latent variables is taught to use a variational autoencoder. Since learning is customized/conditioned by provided attributes, the related conceptual algorithms will produce images with regard to various attributes, like gender, hair color, age, etc., as seen in Figure 2.

Figure 2. Supervised learning based text-to-image synthesis (Yan et al., 2016a).

1.2 GAN Based Text-to-image Synthesis

Multimodal learning techniques include generative adversarial networks and deep convolutional decoder networks as their key drivers for generating text-based images. The generative adversarial networks (GANs), first proposed by Ian Goodfellow et al., composed of two neural networks paired with a discriminator and a generator. The generator attempts to generate synthetic/fake samples which will deceive the discriminator and the discriminator trying to distinguish across real (genuine) and synthetic samples. Figure 3 shows the generator and discriminator networks.

Figure 3. A text-to-image synthesis based on the generative adversarial neural network (GAN) (Huang et al. 2018). GAN-based text-to-image synthesis incorporates discriminative and generative learning to train neural networks outputting in pictures being semantically similar to the training samples or matched to a sub-set of training photos (i.e. conditioned outputs).

Scientists also define multimodal learning as a system integrating features from multiple approaches, algorithms, and ideas. These could involve suggestions through two or more learning methods to generate a functional framework that can resolve an uncommon issue or develop a solution. The visual overview and GAN structures are seen in Figure 4.

Figure 4. A graphic overview of the GAN-based text-to-image (T2I) synthesis process and the survey description of GAN-based frameworks/methods.

2 FRAMEWORKS

2.1 Generative Adversarial Neural Network

GANs were presented in 2014 by Ian Goodfellow et al. (Goodfellow et al., 2014) and contain two deep neural networks, a generator, and a discriminator, that are independently equipped with contrasting goals: the generator attempts to produce samples closely linked to the original data distribution and the discriminator attempts to differentiate among samples from the generator model and samples from the real data distribution by computing the probability of the sample from both sources. A functional overview of the framework of the adversarial generative network (GAN) as seen in Figure 5.

Figure 5. A computational interpretation of the Framework of the Generative Adversarial Network (GAN). Generator G(z) is equipped, from a random noise distribution, to produce synthetic/fake resemblance to actual samples. The fake samples are fed together with real samples to the Discriminator D(x) The Discriminator is qualified to differentiate counterfeit samples from real data.

2.2 cGAN: Conditional GAN

Conditional Generative Adversarial Networks (cGANs) are an improvement of the GANs suggested by Mirza and Osindero (2014a) shortly after Goodfellow et al. (2014) implemented GANs. As it is shown in Figure 6, the condition vector is the “Red bird” class label, which is the input to the generator and the discriminator.

Figure 6. A functional overview of the conditional GAN. Generator G(z) produces samples and several condition vector (in this case text) by a random noise distribution. The fake inputs are passed to Discriminator D(x) together with real data and a similar condition vector, and the Discriminator measures the probability that the fake input resulted from the real data distribution of the results.

2.3 Simple GAN Frameworks for Text-to-Image Synthesis

To create a picture from the text, one basic approach is to use the conditional GAN (cGAN) models and to apply conditions to the training examples, while the GAN is equipped with regard to the conditions influencing them.

2.4 Advanced GAN Frameworks for Text-to-Image Synthesis

Determined by the GAN and conditional GAN (cGAN) concept, several GAN processor architectures have been suggested to create a picture of various styles and frameworks, like the use of several discriminators, the use of increasingly qualified discriminators or the use of hierarchical discriminators. Figure 7 summarizes the synthesis of many mature GAN systems. Moreover to these systems, other news projects are being developed with very complex designs to expand the sector.

Figure 7. A top-level analysis for text-to-image replication with many specialized GANs systems. All structures carry text (red triangle) as their input data and generate images for output. (b) using multiple phased GANs in which the output from one GAN is inputted to the next GAN as a generator (Zhang et al., 2017b; Denton et al., 2015,b) slowly uses symmetrical discriminators and generators (Huang et al. 2017), and (d) uses one-stage generators with a hierarchical generator (D) (D) uses several disk distinction and one generator (A) uses multiple disks and one (Nguyen et al., 2017)

3 CATEGORIZATION of TEXT-TO-IMAGE SYNTHESIS

The GAN frameworks are categorized into four major groups, such as Semantic Enhancement GANs, Resolution Enhancement GANs, Diversity Enhancement GANs, and Motion Enhancement GAGs, as shown in Figure 8.

Figure 8. Advanced GAN frameworks taxonomy and categorization for Text-to-Image Synthesis.

4 GAN Based Text-to-image Synthesis Results Comparison

Figure 9 shows the performance comparison of 14 GANs regarding their Inception Scores (IS).

Figure 9. Performance comparison between 14 GANs with respect to their Inception Scores (IS).

In comparison to other GANs, HDGAN is seen to generate better visual results on the CUB and Oxford datasets when AttnGAN on the more robust COCO dataset generated a much more interesting output than the others. This is proof that AttnGAN’s Attentive Model and DAMSM are quite successful in generating photographs of top quality. Instances of the best bird outcomes and vegetable plates produced by each model are shown in Figures 10 and 11.

Figure 10. Some best images of “birds” generated by GAN-INT-CLS, StackGAN, StackGAN++, AttnGAN, and HDGAN.

Figure 11. Some best images of “a plate of vegetables” generated by GAN-INT-CLS, StackGAN, StackGAN++, AttnGAN, and HDGAN.

5 CONCLUSION

The latest progress in the study of text-to-image synthesis provides various persuasive techniques and algorithms. At first, the primary goal of text-to-image synthesis was to generate images from simple texts, and that goal later adjusted to natural languages. In this survey, new techniques were explained which can create the best visual and image-realistic pictures from text-based natural language. These pictures created usually based on adversarial generative networks (GANs), deep convolutional decoder networks, and multimodal learning methods. These techniques will be outstandingly expanded in the near future. Making less human interaction and maximizing the scale of the generated images can be impressive improvements in the future.

6 Reference:

This article is a short story of the following paper:

Jorge Agnese and Jonathan Herrera and Haicheng Tao and Xingquan Zhu, “A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis”, arXiv, 2019.