DALL·E^[1][1] We decided to name our model using a portmanteau of the artist Salvador Dalí and Pixar’s WALL·E. is a 12-billion parameter version of[2]GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images. __________________________________________________________________ Text prompt an illustration of a baby daikon radish in a tutu walking a dog AI-generated images [091432009673a3a126fdec860933cdce_1.png] [091432009673a3a126fdec860933cdce_2.png] [091432009673a3a126fdec860933cdce_10.png] [091432009673a3a126fdec860933cdce_26.png] [091432009673a3a126fdec860933cdce_14.png] Edit prompt or view more images Text prompt an armchair in the shape of an avocado. . . . AI-generated images [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG ltaXRhdGluZyBhbiBhdm9jYWRvLg==_4.png] [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG ltaXRhdGluZyBhbiBhdm9jYWRvLg==_7.png] [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG ltaXRhdGluZyBhbiBhdm9jYWRvLg==_11.png] [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG ltaXRhdGluZyBhbiBhdm9jYWRvLg==_17.png] [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG ltaXRhdGluZyBhbiBhdm9jYWRvLg==_15.png] Edit prompt or view more images Text prompt a store front that has the word ‘openai’ written on it. . . . AI-generated images [e45259de88f6361db852504739eb9255_0.png] [e45259de88f6361db852504739eb9255_1.png] [e45259de88f6361db852504739eb9255_4.png] [e45259de88f6361db852504739eb9255_10.png] [e45259de88f6361db852504739eb9255_19.png] Edit prompt or view more images Text & image prompt the exact same cat on the top as a sketch on the bottom AI-generated images [555bab3964e21adc8bb8807c556ede1f_2.png] [555bab3964e21adc8bb8807c556ede1f_7.png] [555bab3964e21adc8bb8807c556ede1f_9.png] [555bab3964e21adc8bb8807c556ede1f_12.png] [555bab3964e21adc8bb8807c556ede1f_13.png] Edit prompt or view more images __________________________________________________________________ GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. [3]Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach. Overview Like GPT-3, DALL·E is a transformer language model. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another.^[4][2] A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192. The images are preprocessed to 256x256 resolution during training. Similar to VQVAE,^[5]14^[6]15 each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE^[7]10^[8]11 that we pretrained using a continuous relaxation.^[9]12^[10]13 We found that training using the relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes. This training procedure allows DALL·E to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt. We recognize that work involving generative models has the potential for significant, broad societal impacts. In the future, we plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer term ethical challenges implied by this technology. Capabilities We find that DALL·E is able to create plausible images for a great variety of sentences that explore the compositional structure of language. We illustrate this using a series of interactive visuals in the next section. The samples shown for each caption in the visuals are obtained by taking the top 32 of 512 after reranking with [11]CLIP, but we do not use any manual cherry-picking, aside from the thumbnails and standalone images that appear outside.^[12][3] Further details provided in [13]a later section. Controlling Attributes We test DALL·E’s ability to modify several of an object’s attributes, as well as the number of times that it appears. Click to edit text prompt or view more AI-generated images a pentagonal green clock. a green clock in the shape of a pentagon. [209f951d45fd17b3ad1134466d9946c7_26.png] navigatedownwide a cube made of porcupine. a cube with the texture of a porcupine. [YSBjdWJlIG1hZGUgb2YgcG9yY3VwaW5lLiBhIGN1YmUgd2l0aCB0aGUgdGV4dHVyZSBvZi BhIHBvcmN1cGluZS4=_1.png] navigatedownwide a collection of glasses is sitting on a table [YSBjb2xsZWN0aW9uIG9mIGdsYXNzZXMgaXMgc2l0dGluZyBvbiBhIHRhYmxl_22.png] navigatedownwide Drawing Multiple Objects Simultaneously controlling multiple objects, their attributes, and their spatial relationships presents a new challenge. For example, consider the phrase “a hedgehog wearing a red hat, yellow gloves, blue shirt, and green pants.” To correctly interpret this sentence, DALL·E must not only correctly compose each piece of apparel with the animal, but also form the associations (hat, red), (gloves, yellow), (shirt, blue), and (pants, green) without mixing them up.^[14][4] This task is called variable binding, and has been extensively studied in the literature.^[15]17^[16]18^[17]19^[18]20 We test DALL·E’s ability to do this for relative positioning, stacking objects, and controlling multiple attributes. a small red block sitting on a large green block [YSBzbWFsbCByZWQgYmxvY2sgc2l0dGluZyBvbiBhIGxhcmdlIGdyZWVuIGJsb2Nr_25.pn g] navigatedownwide a stack of 3 cubes. a red cube is on the top, sitting on a green cube. the green cube is in the middle, sitting on a blue cube. the blue cube is on the bottom. [YSBzdGFjayBvZiAzIGN1YmVzLiBhIHJlZCBjdWJlIGlzIG9uIHRoZSB0b3AsIHNpdHRpbm cgb24gYSBncmVlbiBjdWJlLiB0aGUgZ3JlZW4gY3ViZSBpcyBpbiB0aGUgbWlkZGxlLCBza XR0aW5nIG9uIGEgYmx1ZSBjdWJlLiB0aGUgYmx1ZSBjdWJlIGlzIG9uIHRoZSBib3R0b20u _28.png] navigatedownwide an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants [YW4gZW1vamkgb2YgYSBiYWJ5IHBlbmd1aW4gd2VhcmluZyBhIGJsdWUgaGF0LCByZWQgZ2 xvdmVzLCBncmVlbiBzaGlydCwgYW5kIHllbGxvdyBwYW50cw==_3.png] navigatedownwide While DALL·E does offer some level of controllability over the attributes and positions of a small number of objects, the success rate can depend on how the caption is phrased. As more objects are introduced, DALL·E is prone to confusing the associations between the objects and their colors, and the success rate decreases sharply. We also note that DALL·E is brittle with respect to rephrasing of the caption in these scenarios: alternative, semantically equivalent captions often yield no correct interpretations. Visualizing Perspective and Three-Dimensionality We find that DALL·E also allows for control over the viewpoint of a scene and the 3D style in which a scene is rendered. an extreme close-up view of a capybara sitting in a field [YW4gZXh0cmVtZSBjbG9zZS11cCB2aWV3IG9mIGEgY2FweWJhcmEgc2l0dGluZyBpbiBhIG ZpZWxk_3.png] navigatedownwide a capybara made of voxels sitting in a field [YSBjYXB5YmFyYSBtYWRlIG9mIHZveGVscyBzaXR0aW5nIGluIGEgZmllbGQ=_6.png] navigatedownwide To push this further, we test DALL·E’s ability to repeatedly draw the head of a well-known figure at each angle from a sequence of equally spaced angles, and find that we can recover a smooth animation of the rotating head. a photograph of a bust of homer [1980579866_1.png] navigatedownwide DALL·E appears to be able to apply some types of optical distortions to scenes, as we see with the options “fisheye lens view” and “a spherical panorama.” This motivated us to explore its ability to generate reflections. a plain white cube looking at its own reflection in a mirror. a plain white cube gazing at itself in a mirror. [4070014681_3.png] navigatedownwide Visualizing Internal and External Structure The samples from the “extreme close-up view” and “x-ray” style led us to further explore DALL·E’s ability to render internal structure with cross-sectional views, and external structure with macro photographs. a cross-section view of a walnut [YSBjcm9zcy1zZWN0aW9uIHZpZXcgb2YgYSB3YWxudXQ=_0.png] navigatedownwide a macro photograph of brain coral [YSBtYWNybyBwaG90b2dyYXBoIG9mIGJyYWluIGNvcmFs_2.png] navigatedownwide Inferring Contextual Details The task of translating text to images is underspecified: a single caption generally corresponds to an infinitude of plausible images, so the image is not uniquely determined. For instance, consider the caption “a painting of a capybara sitting on a field at sunrise.” Depending on the orientation of the capybara, it may be necessary to draw a shadow, though this detail is never mentioned explicitly. We explore DALL·E’s ability to resolve underspecification in three cases: changing style, setting, and time; drawing the same object in a variety of different situations; and generating an image of an object with specific text written on it. a painting of a capybara sitting in a field at sunrise [YSBwYWludGluZyBvZiBhIGNhcHliYXJhIHNpdHRpbmcgaW4gYSBmaWVsZCBhdCBzdW5yaX Nl_1.png] navigatedownwide a stained glass window with an image of a blue strawberry [YSBzdGFpbmVkIGdsYXNzIHdpbmRvdyB3aXRoIGFuIGltYWdlIG9mIGEgYmx1ZSBzdHJhd2 JlcnJ5_3.png] navigatedownwide a store front that has the word ‘openai’ written on it. a store front that has the word ‘openai’ written on it. a store front that has the word ‘openai’ written on it. ‘openai’ store front. [e45259de88f6361db852504739eb9255_4.png] navigatedownwide With varying degrees of reliability, DALL·E provides access to a subset of the capabilities of a 3D rendering engine via natural language. It can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions. Unlike a 3D rendering engine, whose inputs must be specified unambiguously and in complete detail, DALL·E is often able to “fill in the blanks” when the caption implies that the image must contain a certain detail that is not explicitly stated. Applications of Preceding Capabilities Next, we explore the use of the preceding capabilities for fashion and interior design. a male mannequin dressed in an orange and black flannel shirt [4209119978_14.png] navigatedownwide a female mannequin dressed in a black leather jacket and gold pleated skirt [4812f86ace1b658bdaab902b7822a4fc_28.png] navigatedownwide a living room with two white armchairs and a painting of the colosseum. the painting is mounted above a modern fireplace. [9e80c37dfc4d27a820be73bc2e85d655_1.png] navigatedownwide a loft bedroom with a white bed next to a nightstand. there is a fish tank beside the bed. [612f04083eb75a32a167e6d26e89e650_2.png] navigatedownwide Combining Unrelated Concepts The compositional nature of language allows us to put together concepts to describe both real and imaginary things. We find that DALL·E also has the ability to combine disparate ideas to synthesize objects, some of which are unlikely to exist in the real world. We explore this ability in two instances: transferring qualities from various concepts to animals, and designing products by taking inspiration from unrelated concepts. a snail made of harp. a snail with the texture of a harp. [YSBzbmFpbCBtYWRlIG9mIGhhcnAuIGEgc25haWwgd2l0aCB0aGUgdGV4dHVyZSBvZiBhIG hhcnAu_13.png] navigatedownwide an armchair in the shape of an avocado. an armchair imitating an avocado. [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG ltaXRhdGluZyBhbiBhdm9jYWRvLg==_4.png] navigatedownwide Animal Illustrations In the previous section, we explored DALL·E’s ability to combine unrelated concepts when generating images of real-world objects. Here, we explore this ability in the context of art, for three kinds of illustrations: anthropomorphized versions of animals and objects, animal chimeras, and emojis. an illustration of a baby daikon radish in a tutu walking a dog [091432009673a3a126fdec860933cdce_26.png] navigatedownwide a professional high quality illustration of a giraffe turtle chimera. a giraffe imitating a turtle. a giraffe made of turtle. [YSBwcm9mZXNzaW9uYWwgaGlnaCBxdWFsaXR5IGlsbHVzdHJhdGlvbiBvZiBhIGdpcmFmZm UgdHVydGxlIGNoaW1lcmEuIGEgZ2lyYWZmZSBpbWl0YXRpbmcgYSB0dXJ0bGUuIGEgZ2lyY WZmZSBtYWRlIG9mIHR1cnRsZS4=_0.png] navigatedownwide a professional high quality emoji of a lovestruck cup of boba [YSBwcm9mZXNzaW9uYWwgaGlnaCBxdWFsaXR5IGVtb2ppIG9mIGEgbG92ZXN0cnVjayBjdX Agb2YgYm9iYQ==_8.png] navigatedownwide Zero-Shot Visual Reasoning GPT-3 can be instructed to perform many kinds of tasks solely from a description and a cue to generate the answer supplied in its prompt, without any additional training. For example, when prompted with the phrase “here is the sentence ‘a person walking his dog in the park’ translated into French:”, GPT-3 answers “un homme qui promène son chien dans le parc.” This capability is called zero-shot reasoning. We find that DALL·E extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way. the exact same cat on the top as a sketch on the bottom [555bab3964e21adc8bb8807c556ede1f_12.png] navigatedownwide the exact same teapot on the top with ’gpt’ written on it on the bottom [0d624be9447b6f20cec1b93674708710_1.png] navigatedownwide We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it. Motivated by these results, we measure DALL·E’s aptitude for analogical reasoning problems by testing it on Raven’s progressive matrices, a visual IQ test that saw widespread use in the 20th century. a sequence of geometric shapes. [set_b_masked_3.png] navigatedownwide Geographic Knowledge We find that DALL·E has learned about geographic facts, landmarks, and neighborhoods. Its knowledge of these concepts is surprisingly precise in some ways and flawed in others. a photo of the food of china [752777209_15.png] navigatedownwide a photo of alamo square, san francisco, from a street at night [3693640717_8.png] navigatedownwide a photo of san francisco’s golden gate bridge [2116608761_11.png] navigatedownwide Temporal Knowledge In addition to exploring DALL·E’s knowledge of concepts that vary over space, we also explore its knowledge of concepts that vary over time. a photo of a phone from the 20s [1701264660_0.png] navigatedownwide Summary of Approach and Prior Work DALL·E is a simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens—256 for the text and 1024 for the image—and models all of them autoregressively. The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer. We provide more details about the architecture and training procedure in our [19]paper. Text-to-image synthesis has been an active area of research since the pioneering work of Reed et. al,^[20]1 whose approach uses a GAN conditioned on text embeddings. The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN^[21]3 and StackGAN++^[22]4 use multi-scale GANs to scale up the image resolution and improve visual fidelity. AttnGAN^[23]5 incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective. This is interesting to compare to our reranking with CLIP, which is done offline. Other work^[24]2^[25]6^[26]7 incorporates additional sources of supervision during training to improve image quality. Finally, work by Nguyen et. al^[27]8 and Cho et. al^[28]9 explores sampling-based strategies for image generation that leverage pretrained multimodal discriminative models. Similar to the rejection sampling used in [29]VQVAE-2, we use [30]CLIP to rerank the top 32 of 512 samples for each caption in all of the interactive visuals. This procedure can also be seen as a kind of language-guided search^[31]16, and can have a dramatic impact on sample quality. an illustration of a baby daikon radish in a tutu walking a dog [caption 1, best 8 of 2048] [0_top_8_of_2048_rank_5.png] navigatedownwide __________________________________________________________________ Footnotes 1. We decided to name our model using a portmanteau of the artist Salvador Dalí and Pixar’s WALL·E. [32]↩︎ 2. A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192. The images are preprocessed to 256x256 resolution during training. Similar to VQVAE,^[33]14^[34]15 each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE^[35]10^[36]11 that we pretrained using a continuous relaxation.^[37]12^[38]13 We found that training using the relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes. [39]↩︎ 3. Further details provided in [40]a later section. [41]↩︎ 4. This task is called variable binding, and has been extensively studied in the literature.^[42]17^[43]18^[44]19^[45]20 [46]↩︎ __________________________________________________________________ References 1. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H. (2016). “[47]Generative adversarial text to image synthesis”. In ICML 2016. [48]↩︎ 2. Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H. (2016). “[49]Learning what and where to draw”. In NIPS 2016. [50]↩︎ 3. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang X., Metaxas, D. (2016). “[51]StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks”. In ICCY 2017. [52]↩︎ 4. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D. (2017). “[53]StackGAN++: realistic image synthesis with stacked generative adversarial networks”. In IEEE TPAMI 2018. [54]↩︎ 5. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. (2017). “[55]AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. [56]↩︎ 6. Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J. (2019). “[57]Object-driven text-to-image synthesis via adversarial training”. In CVPR 2019. [58]↩︎ 7. Koh, J. Y., Baldridge, J., Lee, H., Yang, Y. (2020). “[59]Text-to-image generation grounded by fine-grained user attention”. In WACV 2021. [60]↩︎ 8. Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J. (2016). “[61]Plug & play generative networks: conditional iterative generation of images in latent space. [62]↩︎ 9. Cho, J., Lu, J., Schwen, D., Hajishirzi, H., Kembhavi, A. (2020). “[63]X-LXMERT: Paint, caption, and answer questions with multi-modal transformers”. EMNLP 2020. [64]↩︎ 10. Kingma, Diederik P., and Max Welling. “[65]Auto-encoding variational bayes.” arXiv preprint (2013). [66]↩︎ [67]↩︎ 11. Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “[68]Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint (2014). [69]↩︎ [70]↩︎ 12. Jang, E., Gu, S., Poole, B. (2016). “[71]Categorical reparametrization with Gumbel-softmax”. [72]↩︎ [73]↩︎ 13. Maddison, C., Mnih, A., Teh, Y. W. (2016). “[74]The Concrete distribution: a continuous relaxation of discrete random variables”. [75]↩︎ [76]↩︎ 14. van den Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). “[77]Neural discrete representation learning”. [78]↩︎ [79]↩︎ 15. Razavi, A., van der Oord, A., Vinyals, O. (2019). “[80]Generating diverse high-fidelity images with VQ-VAE-2”. [81]↩︎ [82]↩︎ 16. Andreas, J., Klein, D., Levine, S. (2017). “[83]Learning with Latent Language”. [84]↩︎ 17. Smolensky, P. (1990). “[85]Tensor product variable binding and the representation of symbolic structures in connectionist systems”. [86]↩︎ [87]↩︎ 18. Plate, T. (1995). “[88]Holographic reduced representations: convolution algebra for compositional distributed representations”. [89]↩︎ [90]↩︎ 19. Gayler, R. (1998). “[91]Multiplicative binding, representation operators & analogy”. [92]↩︎ [93]↩︎ 20. Kanerva, P. (1997). “[94]Fully distributed representations”. [95]↩︎ [96]↩︎ __________________________________________________________________ Authors [97]Aditya Ramesh[98]Mikhail Pavlov[99]Gabriel Goh[100]Scott Gray (Primary Authors) [101]Mark Chen[102]Rewon Child[103]Vedant Misra[104]Pamela Mishkin[105]Gretchen Krueger[106]Sandhini Agarwal[107]Ilya Sutskever (Supporting Authors) __________________________________________________________________ Filed Under [108]Research[109]Milestones[110]Multimodal __________________________________________________________________ Cover Artwork Justin Jay Wang __________________________________________________________________ Acknowledgments Thanks to the following for their feedback on this work and contributions to this release: Alec Radford, Andrew Mayne, Jeff Clune, Ashley Pilipiszyn, Steve Dowling, Jong Wook Kim, Lei Pan, Heewoo Jun, John Schulman, Michael Tabatowski, Preetum Nakkiran, Jack Clark, Fraser Kelton, Jacob Jackson, Greg Brockman, Wojciech Zaremba, Justin Mao-Jones, David Luan, Shantanu Jain, Prafulla Dhariwal, Sam Altman, Pranav Shyam, Miles Brundage, Jakub Pachocki, and Ryan Lowe. __________________________________________________________________ Contributions Aditya Ramesh was the project lead: he developed the approach, trained the models, and wrote most of the blog copy. Aditya Ramesh, Mikhail Pavlov, and Scott Gray worked together to scale up the model to 12 billion parameters, and designed the infrastructure used to draw samples from the model. Aditya Ramesh, Gabriel Goh, and Justin Jay Wang worked together to create the interactive visuals for the blog. Mark Chen and Aditya Ramesh created the images for Raven’s Progressives Matrices. Rewon Child and Vedant Misra assisted in writing the blog. Pamela Mishkin, Gretchen Krueger, and Sandhini Agarwal advised on broader impacts of the work and assisted in writing the blog. Ilya Sutskever oversaw the project and assisted in writing the blog. References 1. https://openai.com/blog/dall-e/#fn1 2. https://arxiv.org/abs/2005.14165 3. https://openai.com/blog/image-gpt 4. https://openai.com/blog/dall-e/#fn2 5. https://openai.com/blog/dall-e/#rf14 6. https://openai.com/blog/dall-e/#rf15 7. https://openai.com/blog/dall-e/#rf10 8. https://openai.com/blog/dall-e/#rf11 9. https://openai.com/blog/dall-e/#rf12 10. https://openai.com/blog/dall-e/#rf13 11. https://openai.com/blog/clip/ 12. https://openai.com/blog/dall-e/#fn3 13. https://openai.com/blog/dall-e/#summary 14. https://openai.com/blog/dall-e/#fn4 15. https://openai.com/blog/dall-e/#rf17 16. https://openai.com/blog/dall-e/#rf18 17. https://openai.com/blog/dall-e/#rf19 18. https://openai.com/blog/dall-e/#rf20 19. https://arxiv.org/abs/2102.12092 20. https://openai.com/blog/dall-e/#rf1 21. https://openai.com/blog/dall-e/#rf3 22. https://openai.com/blog/dall-e/#rf4 23. https://openai.com/blog/dall-e/#rf5 24. https://openai.com/blog/dall-e/#rf2 25. https://openai.com/blog/dall-e/#rf6 26. https://openai.com/blog/dall-e/#rf7 27. https://openai.com/blog/dall-e/#rf8 28. https://openai.com/blog/dall-e/#rf9 29. https://arxiv.org/abs/1906.00446 30. https://openai.com/blog/clip/ 31. https://openai.com/blog/dall-e/#rf16 32. https://openai.com/blog/dall-e/#fnref1 33. https://openai.com/blog/dall-e/#rf14 34. https://openai.com/blog/dall-e/#rf15 35. https://openai.com/blog/dall-e/#rf10 36. https://openai.com/blog/dall-e/#rf11 37. https://openai.com/blog/dall-e/#rf12 38. https://openai.com/blog/dall-e/#rf13 39. https://openai.com/blog/dall-e/#fnref2 40. https://openai.com/blog/dall-e/#summary 41. https://openai.com/blog/dall-e/#fnref3 42. https://openai.com/blog/dall-e/#rf17 43. https://openai.com/blog/dall-e/#rf18 44. https://openai.com/blog/dall-e/#rf19 45. https://openai.com/blog/dall-e/#rf20 46. https://openai.com/blog/dall-e/#fnref4 47. https://arxiv.org/abs/1605.05396 48. https://openai.com/blog/dall-e/#rfref1 49. https://arxiv.org/abs/1610.02454 50. https://openai.com/blog/dall-e/#rfref2 51. https://arxiv.org/abs/1612.03242 52. https://openai.com/blog/dall-e/#rfref3 53. https://arxiv.org/abs/1710.10916 54. https://openai.com/blog/dall-e/#rfref4 55. https://arxiv.org/abs/1711.10485 56. https://openai.com/blog/dall-e/#rfref5 57. https://arxiv.org/abs/1902.10740 58. https://openai.com/blog/dall-e/#rfref6 59. https://arxiv.org/abs/2011.03775 60. https://openai.com/blog/dall-e/#rfref7 61. https://arxiv.org/abs/1612.00005 62. https://openai.com/blog/dall-e/#rfref8 63. https://arxiv.org/abs/2009.11278 64. https://openai.com/blog/dall-e/#rfref9 65. https://arxiv.org/abs/1312.6114 66. https://openai.com/blog/dall-e/#rfref10a 67. https://openai.com/blog/dall-e/#rfref10b 68. https://arxiv.org/abs/1401.4082 69. https://openai.com/blog/dall-e/#rfref11a 70. https://openai.com/blog/dall-e/#rfref11b 71. https://arxiv.org/abs/1611.01144 72. https://openai.com/blog/dall-e/#rfref12a 73. https://openai.com/blog/dall-e/#rfref12b 74. https://arxiv.org/abs/1611.00712 75. https://openai.com/blog/dall-e/#rfref13a 76. https://openai.com/blog/dall-e/#rfref13b 77. https://arxiv.org/abs/1711.00937 78. https://openai.com/blog/dall-e/#rfref14a 79. https://openai.com/blog/dall-e/#rfref14b 80. https://arxiv.org/abs/1906.00446 81. https://openai.com/blog/dall-e/#rfref15a 82. https://openai.com/blog/dall-e/#rfref15b 83. https://arxiv.org/abs/1711.00482 84. https://openai.com/blog/dall-e/#rfref16 85. http://www.lscp.net/persons/dupoux/teaching/AT1_2014/papers/Smolensky_1990_TensorProductVariableBinding.AI.pdf 86. https://openai.com/blog/dall-e/#rfref17a 87. https://openai.com/blog/dall-e/#rfref17b 88. https://www.ijcai.org/Proceedings/91-1/Papers/006.pdf 89. https://openai.com/blog/dall-e/#rfref18a 90. https://openai.com/blog/dall-e/#rfref18b 91. http://cogprints.org/502/ 92. https://openai.com/blog/dall-e/#rfref19a 93. https://openai.com/blog/dall-e/#rfref19b 94. http://www.cap-lore.com/RWC97-kanerva.pdf 95. https://openai.com/blog/dall-e/#rfref20a 96. https://openai.com/blog/dall-e/#rfref20b 97. https://openai.com/blog/authors/aditya/ 98. https://openai.com/blog/authors/mikhail/ 99. https://openai.com/blog/authors/gabriel/ 100. https://openai.com/blog/authors/scott/ 101. https://openai.com/blog/authors/mark/ 102. https://openai.com/blog/authors/rewon/ 103. https://openai.com/blog/authors/vedant/ 104. https://openai.com/blog/authors/pamela/ 105. https://openai.com/blog/authors/gretchen/ 106. https://openai.com/blog/authors/sandhini/ 107. https://openai.com/blog/authors/ilya/ 108. https://openai.com/blog/tags/research/ 109. https://openai.com/blog/tags/milestones/ 110. https://openai.com/blog/tags/multimodal/