A Survey of Generative Adversarial Neural Networks (GAN) for Text-to-Image Synthesis
Text-to-image synthesis applies to optimization algorithms that, in the form of key phrases and keywords or phrases and sentences, convert written textual content into pictures with a related semantic meaning to the document. In recent work, image synthesis depends primarily on word-to-image correlation analysis in conjunction with supervised learning to identify the best adjustment of the text-matching visual content. Advancement in deep learning (DL) has provided with it a modern range of unsupervised deep learning approaches, especially deep generative models that can produce meaningful visual images utilizing adequately trained neural network models. The transition from computer vision-based methods to artificial intelligence (AI) based technologies generated extreme market involvement, like virtual reality, visual gaming, and computer-aided modeling, to automatically create convincing pictures from text-based representations of natural languages. This survey is based on a paper [1] which discusses image synthesis and its problems first and afterward examines main ideas like generative adversarial networks (GANs) and deep convolutional neural network decoders (DCNNs). This survey shows how to use GANs and DCNNs to produce exciting outcomes in groups including human faces, animals, flowers, house interiors, object recreation from edge maps (games). In the end, a summary of the suggested approaches, remaining unresolved problems and potential innovations in the area of text-to-image synthesis is concluded.
1 INTRODUCTION
Generating images from explanations of the text, i.e. text-to-image synthesis, is a complex issue in computer vision and machine learning that has seen significant success in recent years. Automatic picture production from natural language can help users to define visual elements using explanations of the visually-rich text. The capability to do just that successfully is extremely beneficial because it may be used in artificial intelligence technologies like computer-aided design, image processing, game engines in the next-generation computer games creation, and the production of visual art.
1.1 Traditional Learning-Based Text-to-image Synthesis
In the beginning step of this study, text-to-image synthesis was performed predominantly by a combination search and supervised learning process (Zhu et al., 2007), as seen in Figure 1.
To link text explanations to pictures, one may use similarity among keywords (or keyphrases) & pictures that define descriptive and “picturable” text units; then these units would look for the most probable text-conditioned image pieces, ultimately improving the text-conditioned image structure. These approaches also combined several core elements of artificial intelligence, like manipulation of the natural language, computer vision, computer graphics, and machine learning. The main drawback of the conventional text-to-image synthesis methods focused on experience is that they lack the capacity to produce new picture content; they can just change the characteristics of the images given/trained. Conversely, generative model work has progressed dramatically and offers strategies for learning pictures from training and creating new visual content. Furthermore, in order to produce visual data, a layered generative model with extricated latent variables is taught to use a variational autoencoder. Since learning is customized/conditioned by provided attributes, the related conceptual algorithms will produce images with regard to various attributes, like gender, hair color, age, etc., as seen in Figure 2.
1.2 GAN Based Text-to-image Synthesis
Multimodal learning techniques include generative adversarial networks and deep convolutional decoder networks as their key drivers for generating text-based images. The generative adversarial networks (GANs), first proposed by Ian Goodfellow et al., composed of two neural networks paired with a discriminator and a generator. The generator attempts to generate synthetic/fake samples which will deceive the discriminator and the discriminator trying to distinguish across real (genuine) and synthetic samples. Figure 3 shows the generator and discriminator networks.
Scientists also define multimodal learning as a system integrating features from multiple approaches, algorithms, and ideas. These could involve suggestions through two or more learning methods to generate a functional framework that can resolve an uncommon issue or develop a solution. The visual overview and GAN structures are seen in Figure 4.
2 FRAMEWORKS
2.1 Generative Adversarial Neural Network
GANs were presented in 2014 by Ian Goodfellow et al. (Goodfellow et al., 2014) and contain two deep neural networks, a generator, and a discriminator, that are independently equipped with contrasting goals: the generator attempts to produce samples closely linked to the original data distribution and the discriminator attempts to differentiate among samples from the generator model and samples from the real data distribution by computing the probability of the sample from both sources. A functional overview of the framework of the adversarial generative network (GAN) as seen in Figure 5.
2.2 cGAN: Conditional GAN
Conditional Generative Adversarial Networks (cGANs) are an improvement of the GANs suggested by Mirza and Osindero (2014a) shortly after Goodfellow et al. (2014) implemented GANs. As it is shown in Figure 6, the condition vector is the “Red bird” class label, which is the input to the generator and the discriminator.
2.3 Simple GAN Frameworks for Text-to-Image Synthesis
To create a picture from the text, one basic approach is to use the conditional GAN (cGAN) models and to apply conditions to the training examples, while the GAN is equipped with regard to the conditions influencing them.
2.4 Advanced GAN Frameworks for Text-to-Image Synthesis
Determined by the GAN and conditional GAN (cGAN) concept, several GAN processor architectures have been suggested to create a picture of various styles and frameworks, like the use of several discriminators, the use of increasingly qualified discriminators or the use of hierarchical discriminators. Figure 7 summarizes the synthesis of many mature GAN systems. Moreover to these systems, other news projects are being developed with very complex designs to expand the sector.
3 CATEGORIZATION of TEXT-TO-IMAGE SYNTHESIS
The GAN frameworks are categorized into four major groups, such as Semantic Enhancement GANs, Resolution Enhancement GANs, Diversity Enhancement GANs, and Motion Enhancement GAGs, as shown in Figure 8.
4 GAN Based Text-to-image Synthesis Results Comparison
Figure 9 shows the performance comparison of 14 GANs regarding their Inception Scores (IS).
In comparison to other GANs, HDGAN is seen to generate better visual results on the CUB and Oxford datasets when AttnGAN on the more robust COCO dataset generated a much more interesting output than the others. This is proof that AttnGAN’s Attentive Model and DAMSM are quite successful in generating photographs of top quality. Instances of the best bird outcomes and vegetable plates produced by each model are shown in Figures 10 and 11.
5 CONCLUSION
The latest progress in the study of text-to-image synthesis provides various persuasive techniques and algorithms. At first, the primary goal of text-to-image synthesis was to generate images from simple texts, and that goal later adjusted to natural languages. In this survey, new techniques were explained which can create the best visual and image-realistic pictures from text-based natural language. These pictures created usually based on adversarial generative networks (GANs), deep convolutional decoder networks, and multimodal learning methods. These techniques will be outstandingly expanded in the near future. Making less human interaction and maximizing the scale of the generated images can be impressive improvements in the future.
6 Reference:
This article is a short story of the following paper:
- Jorge Agnese and Jonathan Herrera and Haicheng Tao and Xingquan Zhu, “A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis”, arXiv, 2019.