This is old technology of openai's that preceded e.g. the russian image generator. The linked code demonstrates encoders and decoders that preserve most of the content of an image while compressing it down to a set of numbers representing its meaning that can be used to train models to convert between images and other modes. The language-to-image data does not at first glance appear to be at all included, although I could be wrong. Here it is as a colab gist that can be run in a browser. I just added code at the top to clone and link the model: https://colab.research.google.com/gist/xloem/0290832de95d97abfed242acb2e398e... DALL-E was a precursor to the public open source and then closed russian image generation models I randomly posted some time ago. DALL-E 2 is presently in waitlist private beta, which is news. You can sign up at https://openai.com/dall-e-2/ which links to https://labs.openai.com/waitlist . If you want to play with image generation now, there are a bunch of bots on discord channels, and one site is https://replicate.com/pixray/text2image , another one is https://huggingface.co/spaces/flax-community/dalle-mini . I'm afraid I'm not up-to-speed on what the latest free, open-source, community solution to text2image is. I haven't been watching it recently. On chat venues, you can see people playing with image generation bots nonstop for like days and days, just generating images. It's one of the most common pretrained transformer model bots. Some of these people are addicted, others paid. I've experienced a little addiction myself. It's still somewhat hard to find something that will make just what you want without training a model, though.