DALL·E^[1][1]

   We decided to name our model using a portmanteau of the artist Salvador
   Dalí and Pixar’s WALL·E.

   is a 12-billion parameter version of[2]GPT-3 trained to generate images
   from text descriptions, using a dataset of text–image pairs. We’ve
   found that it has a diverse set of capabilities, including creating
   anthropomorphized versions of animals and objects, combining unrelated
   concepts in plausible ways, rendering text, and applying
   transformations to existing images.
     __________________________________________________________________

   Text prompt
   an illustration of a baby daikon radish in a tutu walking a dog
   AI-generated
   images
   [091432009673a3a126fdec860933cdce_1.png]
   [091432009673a3a126fdec860933cdce_2.png]
   [091432009673a3a126fdec860933cdce_10.png]
   [091432009673a3a126fdec860933cdce_26.png]
   [091432009673a3a126fdec860933cdce_14.png]
   Edit prompt or view more images
   Text prompt
   an armchair in the shape of an avocado. . . .
   AI-generated
   images
   [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG
   ltaXRhdGluZyBhbiBhdm9jYWRvLg==_4.png]
   [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG
   ltaXRhdGluZyBhbiBhdm9jYWRvLg==_7.png]
   [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG
   ltaXRhdGluZyBhbiBhdm9jYWRvLg==_11.png]
   [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG
   ltaXRhdGluZyBhbiBhdm9jYWRvLg==_17.png]
   [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG
   ltaXRhdGluZyBhbiBhdm9jYWRvLg==_15.png]
   Edit prompt or view more images
   Text prompt
   a store front that has the word ‘openai’ written on it. . . .
   AI-generated
   images
   [e45259de88f6361db852504739eb9255_0.png]
   [e45259de88f6361db852504739eb9255_1.png]
   [e45259de88f6361db852504739eb9255_4.png]
   [e45259de88f6361db852504739eb9255_10.png]
   [e45259de88f6361db852504739eb9255_19.png]
   Edit prompt or view more images
   Text & image
   prompt
   the exact same cat on the top as a sketch on the bottom
   AI-generated
   images
   [555bab3964e21adc8bb8807c556ede1f_2.png]
   [555bab3964e21adc8bb8807c556ede1f_7.png]
   [555bab3964e21adc8bb8807c556ede1f_9.png]
   [555bab3964e21adc8bb8807c556ede1f_12.png]
   [555bab3964e21adc8bb8807c556ede1f_13.png]
   Edit prompt or view more images
     __________________________________________________________________

   GPT-3 showed that language can be used to instruct a large neural
   network to perform a variety of text generation tasks. [3]Image GPT
   showed that the same type of neural network can also be used to
   generate images with high fidelity. We extend these findings to show
   that manipulating visual concepts through language is now within reach.

Overview

   Like GPT-3, DALL·E is a transformer language model. It receives both
   the text and the image as a single stream of data containing up to 1280
   tokens, and is trained using maximum likelihood to generate all of the
   tokens, one after another.^[4][2]

   A token is any symbol from a discrete vocabulary; for humans, each
   English letter is a token from a 26-letter alphabet. DALL·E’s
   vocabulary has tokens for both text and image concepts. Specifically,
   each image caption is represented using a maximum of 256 BPE-encoded
   tokens with a vocabulary size of 16384, and the image is represented
   using 1024 tokens with a vocabulary size of 8192.

   The images are preprocessed to 256x256 resolution during training.
   Similar to VQVAE,^[5]14^[6]15 each image is compressed to a 32x32 grid
   of discrete latent codes using a discrete VAE^[7]10^[8]11 that we
   pretrained using a continuous relaxation.^[9]12^[10]13 We found that
   training using the relaxation obviates the need for an explicit
   codebook, EMA loss, or tricks like dead code revival, and can scale up
   to large vocabulary sizes.

   This training procedure allows DALL·E to not only generate an image
   from scratch, but also to regenerate any rectangular region of an
   existing image that extends to the bottom-right corner, in a way that
   is consistent with the text prompt.

   We recognize that work involving generative models has the potential
   for significant, broad societal impacts. In the future, we plan to
   analyze how models like DALL·E relate to societal issues like economic
   impact on certain work processes and professions, the potential for
   bias in the model outputs, and the longer term ethical challenges
   implied by this technology.

Capabilities

   We find that DALL·E is able to create plausible images for a great
   variety of sentences that explore the compositional structure of
   language. We illustrate this using a series of interactive visuals in
   the next section. The samples shown for each caption in the visuals are
   obtained by taking the top 32 of 512 after reranking with [11]CLIP, but
   we do not use any manual cherry-picking, aside from the thumbnails and
   standalone images that appear outside.^[12][3]

   Further details provided in [13]a later section.

    Controlling Attributes

   We test DALL·E’s ability to modify several of an object’s attributes,
   as well as the number of times that it appears.

   Click to edit text prompt or view more AI-generated images

   a pentagonal green clock. a green clock in the shape of a pentagon.
   [209f951d45fd17b3ad1134466d9946c7_26.png]
   navigatedownwide

   a cube made of porcupine. a cube with the texture of a porcupine.
   [YSBjdWJlIG1hZGUgb2YgcG9yY3VwaW5lLiBhIGN1YmUgd2l0aCB0aGUgdGV4dHVyZSBvZi
   BhIHBvcmN1cGluZS4=_1.png]
   navigatedownwide

   a collection of glasses is sitting on a table
   [YSBjb2xsZWN0aW9uIG9mIGdsYXNzZXMgaXMgc2l0dGluZyBvbiBhIHRhYmxl_22.png]
   navigatedownwide

    Drawing Multiple Objects

   Simultaneously controlling multiple objects, their attributes, and
   their spatial relationships presents a new challenge. For example,
   consider the phrase “a hedgehog wearing a red hat, yellow gloves, blue
   shirt, and green pants.” To correctly interpret this sentence, DALL·E
   must not only correctly compose each piece of apparel with the animal,
   but also form the associations (hat, red), (gloves, yellow), (shirt,
   blue), and (pants, green) without mixing them up.^[14][4]

   This task is called variable binding, and has been extensively studied
   in the literature.^[15]17^[16]18^[17]19^[18]20

   We test DALL·E’s ability to do this for relative positioning, stacking
   objects, and controlling multiple attributes.

   a small red block sitting on a large green block
   [YSBzbWFsbCByZWQgYmxvY2sgc2l0dGluZyBvbiBhIGxhcmdlIGdyZWVuIGJsb2Nr_25.pn
   g]
   navigatedownwide

   a stack of 3 cubes. a red cube is on the top, sitting on a green cube.
   the green cube is in the middle, sitting on a blue cube. the blue cube
   is on the bottom.
   [YSBzdGFjayBvZiAzIGN1YmVzLiBhIHJlZCBjdWJlIGlzIG9uIHRoZSB0b3AsIHNpdHRpbm
   cgb24gYSBncmVlbiBjdWJlLiB0aGUgZ3JlZW4gY3ViZSBpcyBpbiB0aGUgbWlkZGxlLCBza
   XR0aW5nIG9uIGEgYmx1ZSBjdWJlLiB0aGUgYmx1ZSBjdWJlIGlzIG9uIHRoZSBib3R0b20u
   _28.png]
   navigatedownwide

   an emoji of a baby penguin wearing a blue hat, red gloves, green shirt,
   and yellow pants
   [YW4gZW1vamkgb2YgYSBiYWJ5IHBlbmd1aW4gd2VhcmluZyBhIGJsdWUgaGF0LCByZWQgZ2
   xvdmVzLCBncmVlbiBzaGlydCwgYW5kIHllbGxvdyBwYW50cw==_3.png]
   navigatedownwide

   While DALL·E does offer some level of controllability over the
   attributes and positions of a small number of objects, the success rate
   can depend on how the caption is phrased. As more objects are
   introduced, DALL·E is prone to confusing the associations between the
   objects and their colors, and the success rate decreases sharply. We
   also note that DALL·E is brittle with respect to rephrasing of the
   caption in these scenarios: alternative, semantically equivalent
   captions often yield no correct interpretations.

    Visualizing Perspective and Three-Dimensionality

   We find that DALL·E also allows for control over the viewpoint of a
   scene and the 3D style in which a scene is rendered.

   an extreme close-up view of a capybara sitting in a field
   [YW4gZXh0cmVtZSBjbG9zZS11cCB2aWV3IG9mIGEgY2FweWJhcmEgc2l0dGluZyBpbiBhIG
   ZpZWxk_3.png]
   navigatedownwide

   a capybara made of voxels sitting in a field
   [YSBjYXB5YmFyYSBtYWRlIG9mIHZveGVscyBzaXR0aW5nIGluIGEgZmllbGQ=_6.png]
   navigatedownwide

   To push this further, we test DALL·E’s ability to repeatedly draw the
   head of a well-known figure at each angle from a sequence of equally
   spaced angles, and find that we can recover a smooth animation of the
   rotating head.

   a photograph of a bust of homer
   [1980579866_1.png]
   navigatedownwide

   DALL·E appears to be able to apply some types of optical distortions to
   scenes, as we see with the options “fisheye lens view” and “a spherical
   panorama.” This motivated us to explore its ability to
   generate reflections.

   a plain white cube looking at its own reflection in a mirror. a plain
   white cube gazing at itself in a mirror.
   [4070014681_3.png]
   navigatedownwide

    Visualizing Internal and External Structure

   The samples from the “extreme close-up view” and “x-ray” style led us
   to further explore DALL·E’s ability to render internal structure with
   cross-sectional views, and external structure with macro photographs.

   a cross-section view of a walnut
   [YSBjcm9zcy1zZWN0aW9uIHZpZXcgb2YgYSB3YWxudXQ=_0.png]
   navigatedownwide

   a macro photograph of brain coral
   [YSBtYWNybyBwaG90b2dyYXBoIG9mIGJyYWluIGNvcmFs_2.png]
   navigatedownwide

    Inferring Contextual Details

   The task of translating text to images is underspecified: a single
   caption generally corresponds to an infinitude of plausible images, so
   the image is not uniquely determined. For instance, consider the
   caption “a painting of a capybara sitting on a field at sunrise.”
   Depending on the orientation of the capybara, it may be necessary to
   draw a shadow, though this detail is never mentioned explicitly. We
   explore DALL·E’s ability to resolve underspecification in three cases:
   changing style, setting, and time; drawing the same object in a variety
   of different situations; and generating an image of an object with
   specific text written on it.

   a painting of a capybara sitting in a field at sunrise
   [YSBwYWludGluZyBvZiBhIGNhcHliYXJhIHNpdHRpbmcgaW4gYSBmaWVsZCBhdCBzdW5yaX
   Nl_1.png]
   navigatedownwide

   a stained glass window with an image of a blue strawberry
   [YSBzdGFpbmVkIGdsYXNzIHdpbmRvdyB3aXRoIGFuIGltYWdlIG9mIGEgYmx1ZSBzdHJhd2
   JlcnJ5_3.png]
   navigatedownwide

   a store front that has the word ‘openai’ written on it. a store front
   that has the word ‘openai’ written on it. a store front that has the
   word ‘openai’ written on it. ‘openai’ store front.
   [e45259de88f6361db852504739eb9255_4.png]
   navigatedownwide

   With varying degrees of reliability, DALL·E provides access to a subset
   of the capabilities of a 3D rendering engine via natural language. It
   can independently control the attributes of a small number of objects,
   and to a limited extent, how many there are, and how they are arranged
   with respect to one another. It can also control the location and angle
   from which a scene is rendered, and can generate known objects in
   compliance with precise specifications of angle and
   lighting conditions.

   Unlike a 3D rendering engine, whose inputs must be specified
   unambiguously and in complete detail, DALL·E is often able to “fill in
   the blanks” when the caption implies that the image must contain a
   certain detail that is not explicitly stated.

    Applications of Preceding Capabilities

   Next, we explore the use of the preceding capabilities for fashion and
   interior design.

   a male mannequin dressed in an orange and black flannel shirt
   [4209119978_14.png]
   navigatedownwide

   a female mannequin dressed in a black leather jacket and gold pleated
   skirt
   [4812f86ace1b658bdaab902b7822a4fc_28.png]
   navigatedownwide

   a living room with two white armchairs and a painting of the colosseum.
   the painting is mounted above a modern fireplace.
   [9e80c37dfc4d27a820be73bc2e85d655_1.png]
   navigatedownwide

   a loft bedroom with a white bed next to a nightstand. there is a fish
   tank beside the bed.
   [612f04083eb75a32a167e6d26e89e650_2.png]
   navigatedownwide

    Combining Unrelated Concepts

   The compositional nature of language allows us to put together concepts
   to describe both real and imaginary things. We find that DALL·E also
   has the ability to combine disparate ideas to synthesize objects, some
   of which are unlikely to exist in the real world. We explore this
   ability in two instances: transferring qualities from various concepts
   to animals, and designing products by taking inspiration from
   unrelated concepts.

   a snail made of harp. a snail with the texture of a harp.
   [YSBzbmFpbCBtYWRlIG9mIGhhcnAuIGEgc25haWwgd2l0aCB0aGUgdGV4dHVyZSBvZiBhIG
   hhcnAu_13.png]
   navigatedownwide

   an armchair in the shape of an avocado. an armchair imitating an
   avocado.
   [YW4gYXJtY2hhaXIgaW4gdGhlIHNoYXBlIG9mIGFuIGF2b2NhZG8uIGFuIGFybWNoYWlyIG
   ltaXRhdGluZyBhbiBhdm9jYWRvLg==_4.png]
   navigatedownwide

    Animal Illustrations

   In the previous section, we explored DALL·E’s ability to combine
   unrelated concepts when generating images of real-world objects. Here,
   we explore this ability in the context of art, for three kinds of
   illustrations: anthropomorphized versions of animals and objects,
   animal chimeras, and emojis.

   an illustration of a baby daikon radish in a tutu walking a dog
   [091432009673a3a126fdec860933cdce_26.png]
   navigatedownwide

   a professional high quality illustration of a giraffe turtle chimera. a
   giraffe imitating a turtle. a giraffe made of turtle.
   [YSBwcm9mZXNzaW9uYWwgaGlnaCBxdWFsaXR5IGlsbHVzdHJhdGlvbiBvZiBhIGdpcmFmZm
   UgdHVydGxlIGNoaW1lcmEuIGEgZ2lyYWZmZSBpbWl0YXRpbmcgYSB0dXJ0bGUuIGEgZ2lyY
   WZmZSBtYWRlIG9mIHR1cnRsZS4=_0.png]
   navigatedownwide

   a professional high quality emoji of a lovestruck cup of boba
   [YSBwcm9mZXNzaW9uYWwgaGlnaCBxdWFsaXR5IGVtb2ppIG9mIGEgbG92ZXN0cnVjayBjdX
   Agb2YgYm9iYQ==_8.png]
   navigatedownwide

    Zero-Shot Visual Reasoning

   GPT-3 can be instructed to perform many kinds of tasks solely from a
   description and a cue to generate the answer supplied in its prompt,
   without any additional training. For example, when prompted with the
   phrase “here is the sentence ‘a person walking his dog in the park’
   translated into French:”, GPT-3 answers “un homme qui promène son chien
   dans le parc.” This capability is called zero-shot reasoning. We find
   that DALL·E extends this capability to the visual domain, and is able
   to perform several kinds of image-to-image translation tasks when
   prompted in the right way.

   the exact same cat on the top as a sketch on the bottom
   [555bab3964e21adc8bb8807c556ede1f_12.png]
   navigatedownwide

   the exact same teapot on the top with ’gpt’ written on it on the bottom
   [0d624be9447b6f20cec1b93674708710_1.png]
   navigatedownwide

   We did not anticipate that this capability would emerge, and made no
   modifications to the neural network or training procedure to encourage
   it. Motivated by these results, we measure DALL·E’s aptitude for
   analogical reasoning problems by testing it on Raven’s progressive
   matrices, a visual IQ test that saw widespread use in the 20th century.

   a sequence of geometric shapes.
   [set_b_masked_3.png]
   navigatedownwide

    Geographic Knowledge

   We find that DALL·E has learned about geographic facts, landmarks, and
   neighborhoods. Its knowledge of these concepts is surprisingly precise
   in some ways and flawed in others.

   a photo of the food of china
   [752777209_15.png]
   navigatedownwide

   a photo of alamo square, san francisco, from a street at night
   [3693640717_8.png]
   navigatedownwide

   a photo of san francisco’s golden gate bridge
   [2116608761_11.png]
   navigatedownwide

    Temporal Knowledge

   In addition to exploring DALL·E’s knowledge of concepts that vary over
   space, we also explore its knowledge of concepts that vary over time.

   a photo of a phone from the 20s
   [1701264660_0.png]
   navigatedownwide

Summary of Approach and Prior Work

   DALL·E is a simple decoder-only transformer that receives both the text
   and the image as a single stream of 1280 tokens—256 for the text and
   1024 for the image—and models all of them autoregressively. The
   attention mask at each of its 64 self-attention layers allows each
   image token to attend to all text tokens. DALL·E uses the standard
   causal mask for the text tokens, and sparse attention for the image
   tokens with either a row, column, or convolutional attention pattern,
   depending on the layer. We provide more details about the architecture
   and training procedure in our [19]paper.

   Text-to-image synthesis has been an active area of research since the
   pioneering work of Reed et. al,^[20]1 whose approach uses a GAN
   conditioned on text embeddings. The embeddings are produced by an
   encoder pretrained using a contrastive loss, not unlike CLIP.
   StackGAN^[21]3 and StackGAN++^[22]4 use multi-scale GANs to scale up
   the image resolution and improve visual fidelity. AttnGAN^[23]5
   incorporates attention between the text and image features, and
   proposes a contrastive text-image feature matching loss as an auxiliary
   objective. This is interesting to compare to our reranking with CLIP,
   which is done offline. Other work^[24]2^[25]6^[26]7 incorporates
   additional sources of supervision during training to improve image
   quality. Finally, work by Nguyen et. al^[27]8 and Cho et. al^[28]9
   explores sampling-based strategies for image generation that leverage
   pretrained multimodal discriminative models.

   Similar to the rejection sampling used in [29]VQVAE-2, we use [30]CLIP
   to rerank the top 32 of 512 samples for each caption in all of the
   interactive visuals. This procedure can also be seen as a kind of
   language-guided search^[31]16, and can have a dramatic impact on
   sample quality.

   an illustration of a baby daikon radish in a tutu walking a dog
   [caption 1, best 8 of 2048]
   [0_top_8_of_2048_rank_5.png]
   navigatedownwide
     __________________________________________________________________

   Footnotes
    1. We decided to name our model using a portmanteau of the artist
       Salvador Dalí and Pixar’s WALL·E. [32]↩︎
    2. A token is any symbol from a discrete vocabulary; for humans, each
       English letter is a token from a 26-letter alphabet. DALL·E’s
       vocabulary has tokens for both text and image concepts.
       Specifically, each image caption is represented using a maximum of
       256 BPE-encoded tokens with a vocabulary size of 16384, and the
       image is represented using 1024 tokens with a vocabulary size of
       8192.
       The images are preprocessed to 256x256 resolution during training.
       Similar to VQVAE,^[33]14^[34]15 each image is compressed to a 32x32
       grid of discrete latent codes using a discrete VAE^[35]10^[36]11
       that we pretrained using a continuous relaxation.^[37]12^[38]13 We
       found that training using the relaxation obviates the need for an
       explicit codebook, EMA loss, or tricks like dead code revival, and
       can scale up to large vocabulary sizes. [39]↩︎
    3. Further details provided in [40]a later section. [41]↩︎
    4. This task is called variable binding, and has been extensively
       studied in the literature.^[42]17^[43]18^[44]19^[45]20 [46]↩︎
     __________________________________________________________________

   References
    1. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.
       (2016). “[47]Generative adversarial text to image synthesis”. In
       ICML 2016. [48]↩︎
    2. Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.
       (2016). “[49]Learning what and where to draw”. In NIPS 2016. [50]↩︎
    3. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang X., Metaxas,
       D. (2016). “[51]StackGAN: Text to photo-realistic image synthesis
       with stacked generative adversarial networks”. In ICCY 2017. [52]↩︎
    4. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas,
       D. (2017). “[53]StackGAN++: realistic image synthesis with stacked
       generative adversarial networks”. In IEEE TPAMI 2018. [54]↩︎
    5. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.
       (2017). “[55]AttnGAN: Fine-grained text to image generation with
       attentional generative adversarial networks. [56]↩︎
    6. Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J.
       (2019). “[57]Object-driven text-to-image synthesis via adversarial
       training”. In CVPR 2019. [58]↩︎
    7. Koh, J. Y., Baldridge, J., Lee, H., Yang, Y. (2020).
       “[59]Text-to-image generation grounded by fine-grained user
       attention”. In WACV 2021. [60]↩︎
    8. Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J.
       (2016). “[61]Plug & play generative networks: conditional iterative
       generation of images in latent space. [62]↩︎
    9. Cho, J., Lu, J., Schwen, D., Hajishirzi, H., Kembhavi, A. (2020).
       “[63]X-LXMERT: Paint, caption, and answer questions with
       multi-modal transformers”. EMNLP 2020. [64]↩︎
   10. Kingma, Diederik P., and Max Welling. “[65]Auto-encoding
       variational bayes.” arXiv preprint (2013). [66]↩︎ [67]↩︎
   11. Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra.
       “[68]Stochastic backpropagation and approximate inference in deep
       generative models.” arXiv preprint (2014). [69]↩︎ [70]↩︎
   12. Jang, E., Gu, S., Poole, B. (2016). “[71]Categorical
       reparametrization with Gumbel-softmax”. [72]↩︎ [73]↩︎
   13. Maddison, C., Mnih, A., Teh, Y. W. (2016). “[74]The Concrete
       distribution: a continuous relaxation of discrete
       random variables”. [75]↩︎ [76]↩︎
   14. van den Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). “[77]Neural
       discrete representation learning”. [78]↩︎ [79]↩︎
   15. Razavi, A., van der Oord, A., Vinyals, O. (2019). “[80]Generating
       diverse high-fidelity images with VQ-VAE-2”. [81]↩︎ [82]↩︎
   16. Andreas, J., Klein, D., Levine, S. (2017). “[83]Learning with
       Latent Language”. [84]↩︎
   17. Smolensky, P. (1990). “[85]Tensor product variable binding and the
       representation of symbolic structures in connectionist systems”.
       [86]↩︎ [87]↩︎
   18. Plate, T. (1995). “[88]Holographic reduced representations:
       convolution algebra for compositional distributed representations”.
       [89]↩︎ [90]↩︎
   19. Gayler, R. (1998). “[91]Multiplicative binding, representation
       operators & analogy”. [92]↩︎ [93]↩︎
   20. Kanerva, P. (1997). “[94]Fully distributed representations”. [95]↩︎
       [96]↩︎
     __________________________________________________________________

   Authors
   [97]Aditya Ramesh[98]Mikhail Pavlov[99]Gabriel Goh[100]Scott Gray
   (Primary Authors)
   [101]Mark Chen[102]Rewon Child[103]Vedant Misra[104]Pamela
   Mishkin[105]Gretchen Krueger[106]Sandhini Agarwal[107]Ilya Sutskever
   (Supporting Authors)
     __________________________________________________________________

   Filed Under
   [108]Research[109]Milestones[110]Multimodal
     __________________________________________________________________

   Cover Artwork

   Justin Jay Wang
     __________________________________________________________________

   Acknowledgments

   Thanks to the following for their feedback on this work and
   contributions to this release: Alec Radford, Andrew Mayne, Jeff Clune,
   Ashley Pilipiszyn, Steve Dowling, Jong Wook Kim, Lei Pan, Heewoo Jun,
   John Schulman, Michael Tabatowski, Preetum Nakkiran, Jack Clark, Fraser
   Kelton, Jacob Jackson, Greg Brockman, Wojciech Zaremba, Justin
   Mao-Jones, David Luan, Shantanu Jain, Prafulla Dhariwal, Sam Altman,
   Pranav Shyam, Miles Brundage, Jakub Pachocki, and Ryan Lowe.
     __________________________________________________________________

   Contributions

   Aditya Ramesh was the project lead: he developed the approach, trained
   the models, and wrote most of the blog copy.

   Aditya Ramesh, Mikhail Pavlov, and Scott Gray worked together to scale
   up the model to 12 billion parameters, and designed the infrastructure
   used to draw samples from the model.

   Aditya Ramesh, Gabriel Goh, and Justin Jay Wang worked together to
   create the interactive visuals for the blog.

   Mark Chen and Aditya Ramesh created the images for Raven’s
   Progressives Matrices.

   Rewon Child and Vedant Misra assisted in writing the blog.

   Pamela Mishkin, Gretchen Krueger, and Sandhini Agarwal advised on
   broader impacts of the work and assisted in writing the blog.

   Ilya Sutskever oversaw the project and assisted in writing the blog.

References

   1. https://openai.com/blog/dall-e/#fn1
   2. https://arxiv.org/abs/2005.14165
   3. https://openai.com/blog/image-gpt
   4. https://openai.com/blog/dall-e/#fn2
   5. https://openai.com/blog/dall-e/#rf14
   6. https://openai.com/blog/dall-e/#rf15
   7. https://openai.com/blog/dall-e/#rf10
   8. https://openai.com/blog/dall-e/#rf11
   9. https://openai.com/blog/dall-e/#rf12
  10. https://openai.com/blog/dall-e/#rf13
  11. https://openai.com/blog/clip/
  12. https://openai.com/blog/dall-e/#fn3
  13. https://openai.com/blog/dall-e/#summary
  14. https://openai.com/blog/dall-e/#fn4
  15. https://openai.com/blog/dall-e/#rf17
  16. https://openai.com/blog/dall-e/#rf18
  17. https://openai.com/blog/dall-e/#rf19
  18. https://openai.com/blog/dall-e/#rf20
  19. https://arxiv.org/abs/2102.12092
  20. https://openai.com/blog/dall-e/#rf1
  21. https://openai.com/blog/dall-e/#rf3
  22. https://openai.com/blog/dall-e/#rf4
  23. https://openai.com/blog/dall-e/#rf5
  24. https://openai.com/blog/dall-e/#rf2
  25. https://openai.com/blog/dall-e/#rf6
  26. https://openai.com/blog/dall-e/#rf7
  27. https://openai.com/blog/dall-e/#rf8
  28. https://openai.com/blog/dall-e/#rf9
  29. https://arxiv.org/abs/1906.00446
  30. https://openai.com/blog/clip/
  31. https://openai.com/blog/dall-e/#rf16
  32. https://openai.com/blog/dall-e/#fnref1
  33. https://openai.com/blog/dall-e/#rf14
  34. https://openai.com/blog/dall-e/#rf15
  35. https://openai.com/blog/dall-e/#rf10
  36. https://openai.com/blog/dall-e/#rf11
  37. https://openai.com/blog/dall-e/#rf12
  38. https://openai.com/blog/dall-e/#rf13
  39. https://openai.com/blog/dall-e/#fnref2
  40. https://openai.com/blog/dall-e/#summary
  41. https://openai.com/blog/dall-e/#fnref3
  42. https://openai.com/blog/dall-e/#rf17
  43. https://openai.com/blog/dall-e/#rf18
  44. https://openai.com/blog/dall-e/#rf19
  45. https://openai.com/blog/dall-e/#rf20
  46. https://openai.com/blog/dall-e/#fnref4
  47. https://arxiv.org/abs/1605.05396
  48. https://openai.com/blog/dall-e/#rfref1
  49. https://arxiv.org/abs/1610.02454
  50. https://openai.com/blog/dall-e/#rfref2
  51. https://arxiv.org/abs/1612.03242
  52. https://openai.com/blog/dall-e/#rfref3
  53. https://arxiv.org/abs/1710.10916
  54. https://openai.com/blog/dall-e/#rfref4
  55. https://arxiv.org/abs/1711.10485
  56. https://openai.com/blog/dall-e/#rfref5
  57. https://arxiv.org/abs/1902.10740
  58. https://openai.com/blog/dall-e/#rfref6
  59. https://arxiv.org/abs/2011.03775
  60. https://openai.com/blog/dall-e/#rfref7
  61. https://arxiv.org/abs/1612.00005
  62. https://openai.com/blog/dall-e/#rfref8
  63. https://arxiv.org/abs/2009.11278
  64. https://openai.com/blog/dall-e/#rfref9
  65. https://arxiv.org/abs/1312.6114
  66. https://openai.com/blog/dall-e/#rfref10a
  67. https://openai.com/blog/dall-e/#rfref10b
  68. https://arxiv.org/abs/1401.4082
  69. https://openai.com/blog/dall-e/#rfref11a
  70. https://openai.com/blog/dall-e/#rfref11b
  71. https://arxiv.org/abs/1611.01144
  72. https://openai.com/blog/dall-e/#rfref12a
  73. https://openai.com/blog/dall-e/#rfref12b
  74. https://arxiv.org/abs/1611.00712
  75. https://openai.com/blog/dall-e/#rfref13a
  76. https://openai.com/blog/dall-e/#rfref13b
  77. https://arxiv.org/abs/1711.00937
  78. https://openai.com/blog/dall-e/#rfref14a
  79. https://openai.com/blog/dall-e/#rfref14b
  80. https://arxiv.org/abs/1906.00446
  81. https://openai.com/blog/dall-e/#rfref15a
  82. https://openai.com/blog/dall-e/#rfref15b
  83. https://arxiv.org/abs/1711.00482
  84. https://openai.com/blog/dall-e/#rfref16
  85. http://www.lscp.net/persons/dupoux/teaching/AT1_2014/papers/Smolensky_1990_TensorProductVariableBinding.AI.pdf
  86. https://openai.com/blog/dall-e/#rfref17a
  87. https://openai.com/blog/dall-e/#rfref17b
  88. https://www.ijcai.org/Proceedings/91-1/Papers/006.pdf
  89. https://openai.com/blog/dall-e/#rfref18a
  90. https://openai.com/blog/dall-e/#rfref18b
  91. http://cogprints.org/502/
  92. https://openai.com/blog/dall-e/#rfref19a
  93. https://openai.com/blog/dall-e/#rfref19b
  94. http://www.cap-lore.com/RWC97-kanerva.pdf
  95. https://openai.com/blog/dall-e/#rfref20a
  96. https://openai.com/blog/dall-e/#rfref20b
  97. https://openai.com/blog/authors/aditya/
  98. https://openai.com/blog/authors/mikhail/
  99. https://openai.com/blog/authors/gabriel/
 100. https://openai.com/blog/authors/scott/
 101. https://openai.com/blog/authors/mark/
 102. https://openai.com/blog/authors/rewon/
 103. https://openai.com/blog/authors/vedant/
 104. https://openai.com/blog/authors/pamela/
 105. https://openai.com/blog/authors/gretchen/
 106. https://openai.com/blog/authors/sandhini/
 107. https://openai.com/blog/authors/ilya/
 108. https://openai.com/blog/tags/research/
 109. https://openai.com/blog/tags/milestones/
 110. https://openai.com/blog/tags/multimodal/