April 2022 - cypherpunks - lists.cpunks.org

Julian and Don Jnr - a love story
by professor rat 09 Apr '22

09 Apr '22

WikiLeaks encouraged Donald Trump's campaign to contest the results of the 2016 election if he lost to Hillary Clinton, according to private messages sent to the then-candidate's eldest son, Donald Trump, Jr. https://www.businessinsider.com/wikileaks-urged-trump-to-contest-the-2016-e… WikiLeaks told Donald Trump Jr. on Election Day that it would be "much more interesting" if his father challenged the election results if he lost to Hillary Clinton, instead of conceding victory

1 0

Communist Party affiliated All-Workers Militant Front attack against anarchists
by professor rat 09 Apr '22

09 Apr '22

Greek Communist Party affiliated All-Workers Militant Front attack against anarchists https://anarchistnews.org/content/brief-update-greek-communist-party-affili… First organized CHEKIST operations were against anarchists - 104 years ago. http://www.wsm.ie/c/cheka-during-russian-revolution

1 0

Adam Backstabber plans to destroy planet
by professor rat 09 Apr '22

09 Apr '22

Tesla, Blockstream and Block Break Ground on All-Solar Bitcoin Mining Facility in Texas. However worthy the ostensible cause any collaboration with Elon Drax, sorry, Musk ( or the NASA, for that matter ) increases the risk to wilderness planet Mars. This fascist aggression will not stand, man.

1 0

DALI was a fascist
by professor rat 09 Apr '22

09 Apr '22

Coderman, sorry, Dali was allegedly an admirer of Hitler. Later, when the Spanish dictator, Francisco Franco, ruled Spain, despite his ruthlessness towards the common man, Dali maintained affable relations with him. Dali constantly affirmed his apolitical stance but his paintings and actions did not match with his statements. George Orwell wrote an essay on him and ruminated on a question and called Dali “a disgusting human being” In 1975, when General Franco executed many people, hundreds and thousands of fascists gathered in support of Franco, chanting his name and making fascists salutes. When the whole world condemned this appalling act, Dali praised Franco and made him the “greatest hero of Spain.” Over and over again, following high-profile rape scandals and domestic abuse, intellectual thievery and explicit racism, people have asked, hesitant yet hopeful, if it's possible to separate the art from the artist. The subtext of this question, usually outwardly expressed as a kind of philosophical fluffing, is: Can we please just purely enjoy our favorite catchy songs, cool-looking paintings, and well-written sentences without having to think about the suffering their creators engendered? With Dalí—an openly obnoxious man who willfully claimed necrophilia, cruelty to animals and people, fascism, self-obsession, and greed—to do this seems particularly egregious. When Dalí collaborated with Philippe Halsman (who also made a book about Dalí's mustache) to make the iconic Dalí Atomicus photo, the process required 28 attempts, which would have been fine except for the fact that each of those attempts involved throwing three cats into the air and flinging buckets of water at them. I hope Coderman's happy now.

1 0

Ukraine Receives $4 Million In Crypto Donations Within Hours
by professor rat 09 Apr '22

09 Apr '22

Ukraine Receives $4 Million In Crypto Donations Within Hours—Including $1.9 Million Tied To Pak, Julian Assange NFT Collection a single donation worth nearly $1.9 million, Elliptic Chief Scientist Tom Robinson wrote in a Saturday email, adding that wallet addresses indicate the donation was associated with an auction of non-fungible tokens that helped raise funds for Australian WikiLeaks founder Julian Assange https://www.forbes.com/sites/jonathanponciano/2022/02/26/ukraine-receives-4… I think its great if Camp Assmange are tilting towards Ukraine - however much it embarrasses and wrong-foots their rusted-on pro-Chekist supporters who are all extreme morons anyway.

1 0

Ukrainian Nazi Army? - very Assange
by professor rat 09 Apr '22

09 Apr '22

Israel Shamir is the worst Nazi propagandist since Julius Streicher https://www.unz.com/ishamir/the-long-captivity-of-julian-assange/ WikiLeaks's tweet - "We contacted Julian Assange in prison ... We contacted Julian Assange in prison. He says that everyone in #Ukraine should install @BriarApp NOW before the internet goes down.

1 0

A damning indictment of the hack journalist-politician, Julian Israel Shamir Assmange
by professor rat 09 Apr '22

09 Apr '22

A damning indictment that nearly twenty years on, virtually no one responsible for alleged US war crimes committed in the course of the Afghanistan and Iraq wars has been held accountable So what is Wikileaks position on this? JULIAN " Being guilty of aggravated rape has nothing to do with our party" A. Broinowski. Wikileaks Party NSW candidate for au Senate. 2013. "... Mr Assange agreed that some level of privacy was necessary for the successful operation of the military ..." http://www.abc.net.au/radionational/programs/drive/assange-i-can-rule-from-… JULIAN " ‘[The military] protects the sovereignty of Australia. It protects the independence of Australia.' ASSANGE July, 2013

1 0

Re: cypherpunks Digest, Vol 106, Issue 95
by Gunnar Larson 09 Apr '22

09 Apr '22

Oh yes he did, he did do it. Gunnar Larson raped him. He had to ... On Fri, Apr 8, 2022, 8:32 PM <cypherpunks-request(a)lists.cpunks.org> wrote: > Send cypherpunks mailing list submissions to > cypherpunks(a)lists.cpunks.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.cpunks.org/mailman/listinfo/cypherpunks > or, via email, send a message with subject or body 'help' to > cypherpunks-request(a)lists.cpunks.org > > You can reach the person managing the list at > cypherpunks-owner(a)lists.cpunks.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of cypherpunks digest..." > > > Today's Topics: > > 1. Re: cypherpunks Digest, Vol 106, Issue 94 (Gunnar Larson) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 8 Apr 2022 20:29:43 -0400 > From: Gunnar Larson <g(a)xny.io> > To: cypherpunks <cypherpunks(a)lists.cpunks.org> > Subject: Re: cypherpunks Digest, Vol 106, Issue 94 > Message-ID: > < > CAPc8xwO4+uLbaR52tWvdyRckPtLWO49uxSHk-boT0HwUkMmUVw(a)mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Did Gunnar Larson, rape Mr. Mark Zuckerburg? Or, was it fair game? > > Finders keepers? > > On Fri, Apr 8, 2022, 7:56 PM <cypherpunks-request(a)lists.cpunks.org> wrote: > > > Send cypherpunks mailing list submissions to > > cypherpunks(a)lists.cpunks.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.cpunks.org/mailman/listinfo/cypherpunks > > or, via email, send a message with subject or body 'help' to > > cypherpunks-request(a)lists.cpunks.org > > > > You can reach the person managing the list at > > cypherpunks-owner(a)lists.cpunks.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of cypherpunks digest..." > > > > > > Today's Topics: > > > > 1. Re: cypherpunks Digest, Vol 106, Issue 93 (Gunnar Larson) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Fri, 8 Apr 2022 19:54:15 -0400 > > From: Gunnar Larson <g(a)xny.io> > > To: cypherpunks <cypherpunks(a)lists.cpunks.org> > > Subject: Re: cypherpunks Digest, Vol 106, Issue 93 > > Message-ID: > > <CAPc8xwPsCK2cA3tT1U-wjuV09T5kc= > > TBMcrzLz46uyNHJXV9cg(a)mail.gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > At first glance, this was a great article. > > > > On Fri, Apr 8, 2022, 7:52 PM <cypherpunks-request(a)lists.cpunks.org> > wrote: > > > > > Send cypherpunks mailing list submissions to > > > cypherpunks(a)lists.cpunks.org > > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > https://lists.cpunks.org/mailman/listinfo/cypherpunks > > > or, via email, send a message with subject or body 'help' to > > > cypherpunks-request(a)lists.cpunks.org > > > > > > You can reach the person managing the list at > > > cypherpunks-owner(a)lists.cpunks.org > > > > > > When replying, please edit your Subject line so it is more specific > > > than "Re: Contents of cypherpunks digest..." > > > > > > > > > Today's Topics: > > > > > > 1. Re: DALL-E (coderman) > > > > > > > > > ---------------------------------------------------------------------- > > > > > > Message: 1 > > > Date: Fri, 08 Apr 2022 23:50:53 +0000 > > > From: coderman <coderman(a)protonmail.com> > > > To: coderman <coderman(a)protonmail.com> > > > Cc: "cy\"Cypherpunks" <cypherpunks(a)cpunks.org> > > > Subject: Re: DALL-E > > > Message-ID: > > > > > > > > > <a9WeFGpr9g422W0Uym9aQZyxT6mqWNzNwLsG6yKqqlD4BLpH6NxuARXLOMvBY8IdZF9HMetBKZYGjdH--qJRFZDIWnXdMRQVqr3pmMYVo5I=@ > > > protonmail.com> > > > > > > Content-Type: text/plain; charset="utf-8" > > > > > > DALL·E[1](https://openai.com/blog/dall-e/#fn1) > > > > > > We decided to name our model using a portmanteau of the artist Salvador > > > Dalí and Pixar’s WALL·E. > > > > > > is a 12-billion parameter version of[GPT-3]( > > > https://arxiv.org/abs/2005.14165) trained to generate images from text > > > descriptions, using a dataset of text–image pairs. We’ve found that it > > has > > > a diverse set of capabilities, including creating anthropomorphized > > > versions of animals and objects, combining unrelated concepts in > > plausible > > > ways, rendering text, and applying transformations to existing images. > > > > > > --------------------------------------------------------------- > > > > > > Text prompt > > > an illustration of a baby daikon radish in a tutu walking a dog > > > AI-generated > > > images > > > > > > Edit prompt or view more images > > > Text prompt > > > an armchair in the shape of an avocado. . . . > > > AI-generated > > > images > > > > > > Edit prompt or view more images > > > Text prompt > > > a store front that has the word ‘openai’ written on it. . . . > > > AI-generated > > > images > > > > > > Edit prompt or view more images > > > Text & image > > > prompt > > > the exact same cat on the top as a sketch on the bottom > > > AI-generated > > > images > > > > > > Edit prompt or view more images > > > --------------------------------------------------------------- > > > > > > GPT-3 showed that language can be used to instruct a large neural > network > > > to perform a variety of text generation tasks. [Image GPT]( > > > https://openai.com/blog/image-gpt) showed that the same type of neural > > > network can also be used to generate images with high fidelity. We > extend > > > these findings to show that manipulating visual concepts through > language > > > is now within reach. > > > > > > Overview > > > > > > Like GPT-3, DALL·E is a transformer language model. It receives both > the > > > text and the image as a single stream of data containing up to 1280 > > tokens, > > > and is trained using maximum likelihood to generate all of the tokens, > > one > > > after another.[2](https://openai.com/blog/dall-e/#fn2) > > > > > > A token is any symbol from a discrete vocabulary; for humans, each > > English > > > letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has > > tokens > > > for both text and image concepts. Specifically, each image caption is > > > represented using a maximum of 256 BPE-encoded tokens with a vocabulary > > > size of 16384, and the image is represented using 1024 tokens with a > > > vocabulary size of 8192. > > > > > > The images are preprocessed to 256x256 resolution during training. > > Similar > > > to VQVAE,[14]( > > > > > > https://openai.com/blog/dall-e/#rf14)[15](https://openai.com/blog/dall-e/#r… > > ) > > > each image is compressed to a 32x32 grid of discrete latent codes > using a > > > discrete VAE[10]( > > > > > > https://openai.com/blog/dall-e/#rf10)[11](https://openai.com/blog/dall-e/#r… > > ) > > > that we pretrained using a continuous relaxation.[12]( > > > > > > https://openai.com/blog/dall-e/#rf12)[13](https://openai.com/blog/dall-e/#r… > > ) > > > We found that training using the relaxation obviates the need for an > > > explicit codebook, EMA loss, or tricks like dead code revival, and can > > > scale up to large vocabulary sizes. > > > > > > This training procedure allows DALL·E to not only generate an image > from > > > scratch, but also to regenerate any rectangular region of an existing > > image > > > that extends to the bottom-right corner, in a way that is consistent > with > > > the text prompt. > > > > > > We recognize that work involving generative models has the potential > for > > > significant, broad societal impacts. In the future, we plan to analyze > > how > > > models like DALL·E relate to societal issues like economic impact on > > > certain work processes and professions, the potential for bias in the > > model > > > outputs, and the longer term ethical challenges implied by this > > technology. > > > > > > Capabilities > > > > > > We find that DALL·E is able to create plausible images for a great > > variety > > > of sentences that explore the compositional structure of language. We > > > illustrate this using a series of interactive visuals in the next > > section. > > > The samples shown for each caption in the visuals are obtained by > taking > > > the top 32 of 512 after reranking with [CLIP]( > > > https://openai.com/blog/clip/) but we do not use any manual > > > cherry-picking, aside from the thumbnails and standalone images that > > appear > > > outside.[3](https://openai.com/blog/dall-e/#fn3) > > > > > > Further details provided in [a later section]( > > > https://openai.com/blog/dall-e/#summary) > > > > > > Controlling Attributes > > > > > > We test DALL·E’s ability to modify several of an object’s attributes, > as > > > well as the number of times that it appears. > > > > > > Click to edit text prompt or view more AI-generated images > > > a pentagonal green clock. a green clock in the shape of a pentagon. > > > > > > navigatedownwide > > > a cube made of porcupine. a cube with the texture of a porcupine. > > > > > > navigatedownwide > > > a collection of glasses is sitting on a table > > > > > > navigatedownwide > > > > > > Drawing Multiple Objects > > > > > > Simultaneously controlling multiple objects, their attributes, and > their > > > spatial relationships presents a new challenge. For example, consider > the > > > phrase “a hedgehog wearing a red hat, yellow gloves, blue shirt, and > > green > > > pants.” To correctly interpret this sentence, DALL·E must not only > > > correctly compose each piece of apparel with the animal, but also form > > the > > > associations (hat, red), (gloves, yellow), (shirt, blue), and (pants, > > > green) without mixing them up.[4](https://openai.com/blog/dall-e/#fn4) > > > > > > This task is called variable binding, and has been extensively studied > in > > > the literature.[17]( > > > > > > https://openai.com/blog/dall-e/#rf17)[18](https://openai.com/blog/dall-e/#r… > > > ) > > > > > > We test DALL·E’s ability to do this for relative positioning, stacking > > > objects, and controlling multiple attributes. > > > > > > a small red block sitting on a large green block > > > > > > navigatedownwide > > > a stack of 3 cubes. a red cube is on the top, sitting on a green cube. > > the > > > green cube is in the middle, sitting on a blue cube. the blue cube is > on > > > the bottom. > > > > > > navigatedownwide > > > an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, > > > and yellow pants > > > > > > navigatedownwide > > > > > > While DALL·E does offer some level of controllability over the > attributes > > > and positions of a small number of objects, the success rate can depend > > on > > > how the caption is phrased. As more objects are introduced, DALL·E is > > prone > > > to confusing the associations between the objects and their colors, and > > the > > > success rate decreases sharply. We also note that DALL·E is brittle > with > > > respect to rephrasing of the caption in these scenarios: alternative, > > > semantically equivalent captions often yield no correct > interpretations. > > > > > > Visualizing Perspective and Three-Dimensionality > > > > > > We find that DALL·E also allows for control over the viewpoint of a > scene > > > and the 3D style in which a scene is rendered. > > > > > > an extreme close-up view of a capybara sitting in a field > > > > > > navigatedownwide > > > a capybara made of voxels sitting in a field > > > > > > navigatedownwide > > > > > > To push this further, we test DALL·E’s ability to repeatedly draw the > > head > > > of a well-known figure at each angle from a sequence of equally spaced > > > angles, and find that we can recover a smooth animation of the rotating > > > head. > > > > > > a photograph of a bust of homer > > > > > > navigatedownwide > > > > > > DALL·E appears to be able to apply some types of optical distortions to > > > scenes, as we see with the options “fisheye lens view” and “a spherical > > > panorama.” This motivated us to explore its ability to generate > > reflections. > > > > > > a plain white cube looking at its own reflection in a mirror. a plain > > > white cube gazing at itself in a mirror. > > > > > > navigatedownwide > > > > > > Visualizing Internal and External Structure > > > > > > The samples from the “extreme close-up view” and “x-ray” style led us > to > > > further explore DALL·E’s ability to render internal structure with > > > cross-sectional views, and external structure with macro photographs. > > > > > > a cross-section view of a walnut > > > > > > navigatedownwide > > > a macro photograph of brain coral > > > > > > navigatedownwide > > > > > > Inferring Contextual Details > > > > > > The task of translating text to images is underspecified: a single > > caption > > > generally corresponds to an infinitude of plausible images, so the > image > > is > > > not uniquely determined. For instance, consider the caption “a painting > > of > > > a capybara sitting on a field at sunrise.” Depending on the orientation > > of > > > the capybara, it may be necessary to draw a shadow, though this detail > is > > > never mentioned explicitly. We explore DALL·E’s ability to resolve > > > underspecification in three cases: changing style, setting, and time; > > > drawing the same object in a variety of different situations; and > > > generating an image of an object with specific text written on it. > > > > > > a painting of a capybara sitting in a field at sunrise > > > > > > navigatedownwide > > > a stained glass window with an image of a blue strawberry > > > > > > navigatedownwide > > > a store front that has the word ‘openai’ written on it. a store front > > that > > > has the word ‘openai’ written on it. a store front that has the word > > > ‘openai’ written on it. ‘openai’ store front. > > > > > > navigatedownwide > > > > > > With varying degrees of reliability, DALL·E provides access to a subset > > of > > > the capabilities of a 3D rendering engine via natural language. It can > > > independently control the attributes of a small number of objects, and > > to a > > > limited extent, how many there are, and how they are arranged with > > respect > > > to one another. It can also control the location and angle from which a > > > scene is rendered, and can generate known objects in compliance with > > > precise specifications of angle and lighting conditions. > > > > > > Unlike a 3D rendering engine, whose inputs must be specified > > unambiguously > > > and in complete detail, DALL·E is often able to “fill in the blanks” > when > > > the caption implies that the image must contain a certain detail that > is > > > not explicitly stated. > > > > > > Applications of Preceding Capabilities > > > > > > Next, we explore the use of the preceding capabilities for fashion and > > > interior design. > > > > > > a male mannequin dressed in an orange and black flannel shirt > > > > > > navigatedownwide > > > a female mannequin dressed in a black leather jacket and gold pleated > > skirt > > > > > > navigatedownwide > > > a living room with two white armchairs and a painting of the colosseum. > > > the painting is mounted above a modern fireplace. > > > > > > navigatedownwide > > > a loft bedroom with a white bed next to a nightstand. there is a fish > > tank > > > beside the bed. > > > > > > navigatedownwide > > > > > > Combining Unrelated Concepts > > > > > > The compositional nature of language allows us to put together concepts > > to > > > describe both real and imaginary things. We find that DALL·E also has > the > > > ability to combine disparate ideas to synthesize objects, some of which > > are > > > unlikely to exist in the real world. We explore this ability in two > > > instances: transferring qualities from various concepts to animals, and > > > designing products by taking inspiration from unrelated concepts. > > > > > > a snail made of harp. a snail with the texture of a harp. > > > > > > navigatedownwide > > > an armchair in the shape of an avocado. an armchair imitating an > avocado. > > > > > > navigatedownwide > > > > > > Animal Illustrations > > > > > > In the previous section, we explored DALL·E’s ability to combine > > unrelated > > > concepts when generating images of real-world objects. Here, we explore > > > this ability in the context of art, for three kinds of illustrations: > > > anthropomorphized versions of animals and objects, animal chimeras, and > > > emojis. > > > > > > an illustration of a baby daikon radish in a tutu walking a dog > > > > > > navigatedownwide > > > a professional high quality illustration of a giraffe turtle chimera. a > > > giraffe imitating a turtle. a giraffe made of turtle. > > > > > > navigatedownwide > > > a professional high quality emoji of a lovestruck cup of boba > > > > > > navigatedownwide > > > > > > Zero-Shot Visual Reasoning > > > > > > GPT-3 can be instructed to perform many kinds of tasks solely from a > > > description and a cue to generate the answer supplied in its prompt, > > > without any additional training. For example, when prompted with the > > phrase > > > “here is the sentence ‘a person walking his dog in the park’ translated > > > into French:”, GPT-3 answers “un homme qui promène son chien dans le > > parc.” > > > This capability is called zero-shot reasoning. We find that DALL·E > > extends > > > this capability to the visual domain, and is able to perform several > > kinds > > > of image-to-image translation tasks when prompted in the right way. > > > > > > the exact same cat on the top as a sketch on the bottom > > > > > > navigatedownwide > > > the exact same teapot on the top with ’gpt’ written on it on the bottom > > > > > > navigatedownwide > > > > > > We did not anticipate that this capability would emerge, and made no > > > modifications to the neural network or training procedure to encourage > > it. > > > Motivated by these results, we measure DALL·E’s aptitude for analogical > > > reasoning problems by testing it on Raven’s progressive matrices, a > > visual > > > IQ test that saw widespread use in the 20th century. > > > > > > a sequence of geometric shapes. > > > > > > navigatedownwide > > > > > > Geographic Knowledge > > > > > > We find that DALL·E has learned about geographic facts, landmarks, and > > > neighborhoods. Its knowledge of these concepts is surprisingly precise > in > > > some ways and flawed in others. > > > > > > a photo of the food of china > > > > > > navigatedownwide > > > a photo of alamo square, san francisco, from a street at night > > > > > > navigatedownwide > > > a photo of san francisco’s golden gate bridge > > > > > > navigatedownwide > > > > > > Temporal Knowledge > > > > > > In addition to exploring DALL·E’s knowledge of concepts that vary over > > > space, we also explore its knowledge of concepts that vary over time. > > > > > > a photo of a phone from the 20s > > > > > > navigatedownwide > > > > > > Summary of Approach and Prior Work > > > > > > DALL·E is a simple decoder-only transformer that receives both the text > > > and the image as a single stream of 1280 tokens—256 for the text and > 1024 > > > for the image—and models all of them autoregressively. The attention > mask > > > at each of its 64 self-attention layers allows each image token to > attend > > > to all text tokens. DALL·E uses the standard causal mask for the text > > > tokens, and sparse attention for the image tokens with either a row, > > > column, or convolutional attention pattern, depending on the layer. We > > > provide more details about the architecture and training procedure in > our > > > [paper](https://arxiv.org/abs/2102.12092). > > > > > > Text-to-image synthesis has been an active area of research since the > > > pioneering work of Reed et. al,[1](https://openai.com/blog/dall-e/#rf1 > ) > > > whose approach uses a GAN conditioned on text embeddings. The > embeddings > > > are produced by an encoder pretrained using a contrastive loss, not > > unlike > > > CLIP. StackGAN[3](https://openai.com/blog/dall-e/#rf3) and > > StackGAN++[4]( > > > https://openai.com/blog/dall-e/#rf4) use multi-scale GANs to scale up > > the > > > image resolution and improve visual fidelity. AttnGAN[5]( > > > https://openai.com/blog/dall-e/#rf5) incorporates attention between > the > > > text and image features, and proposes a contrastive text-image feature > > > matching loss as an auxiliary objective. This is interesting to compare > > to > > > our reranking with CLIP, which is done offline. Other work[2]( > > > > > > https://openai.com/blog/dall-e/#rf2)[6](https://openai.com/blog/dall-e/#rf6… > > ) > > > incorporates additional sources of supervision during training to > improve > > > image quality. Finally, work by Nguyen et. al[8]( > > > https://openai.com/blog/dall-e/#rf8) and Cho et. al[9]( > > > https://openai.com/blog/dall-e/#rf9) explores sampling-based > strategies > > > for image generation that leverage pretrained multimodal discriminative > > > models. > > > > > > Similar to the rejection sampling used in [VQVAE-2]( > > > https://arxiv.org/abs/1906.00446) we use [CLIP]( > > > https://openai.com/blog/clip/) to rerank the top 32 of 512 samples for > > > each caption in all of the interactive visuals. This procedure can also > > be > > > seen as a kind of language-guided search[16]( > > > https://openai.com/blog/dall-e/#rf16) and can have a dramatic impact > on > > > sample quality. > > > > > > an illustration of a baby daikon radish in a tutu walking a dog > [caption > > > 1, best 8 of 2048] > > > > > > > > > > > > navigatedownwide--------------------------------------------------------------- > > > > > > Footnotes > > > > > > - > > > > > > We decided to name our model using a portmanteau of the artist Salvador > > > Dalí and Pixar’s WALL·E. [↩︎](https://openai.com/blog/dall-e/#fnref1) > > > > > > - > > > > > > A token is any symbol from a discrete vocabulary; for humans, each > > English > > > letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has > > tokens > > > for both text and image concepts. Specifically, each image caption is > > > represented using a maximum of 256 BPE-encoded tokens with a vocabulary > > > size of 16384, and the image is represented using 1024 tokens with a > > > vocabulary size of 8192. > > > > > > The images are preprocessed to 256x256 resolution during training. > > Similar > > > to VQVAE,[14]( > > > > > > https://openai.com/blog/dall-e/#rf14)[15](https://openai.com/blog/dall-e/#r… > > ) > > > each image is compressed to a 32x32 grid of discrete latent codes > using a > > > discrete VAE[10]( > > > > > > https://openai.com/blog/dall-e/#rf10)[11](https://openai.com/blog/dall-e/#r… > > ) > > > that we pretrained using a continuous relaxation.[12]( > > > > > > https://openai.com/blog/dall-e/#rf12)[13](https://openai.com/blog/dall-e/#r… > > ) > > > We found that training using the relaxation obviates the need for an > > > explicit codebook, EMA loss, or tricks like dead code revival, and can > > > scale up to large vocabulary sizes. [↩︎]( > > > https://openai.com/blog/dall-e/#fnref2) > > > > > > - > > > > > > Further details provided in [a later section]( > > > https://openai.com/blog/dall-e/#summary) [↩︎]( > > > https://openai.com/blog/dall-e/#fnref3) > > > > > > - > > > > > > This task is called variable binding, and has been extensively studied > in > > > the literature.[17]( > > > > > > https://openai.com/blog/dall-e/#rf17)[18](https://openai.com/blog/dall-e/#r… > > ) > > > [↩︎](https://openai.com/blog/dall-e/#fnref4) > > > > > > --------------------------------------------------------------- > > > > > > References > > > > > > - Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H. > > > (2016). “[Generative adversarial text to image synthesis]( > > > https://arxiv.org/abs/1605.05396)”. In ICML 2016. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref1) > > > > > > - Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H. > (2016). > > > “[Learning what and where to draw](https://arxiv.org/abs/1610.02454)”. > > In > > > NIPS 2016. [↩︎](https://openai.com/blog/dall-e/#rfref2) > > > > > > - Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang X., Metaxas, D. > > > (2016). “[StackGAN: Text to photo-realistic image synthesis with > stacked > > > generative adversarial networks](https://arxiv.org/abs/1612.03242)”. > In > > > ICCY 2017. [↩︎](https://openai.com/blog/dall-e/#rfref3) > > > > > > - Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, > D. > > > (2017). “[StackGAN++: realistic image synthesis with stacked generative > > > adversarial networks](https://arxiv.org/abs/1710.10916)”. In IEEE > TPAMI > > > 2018. [↩︎](https://openai.com/blog/dall-e/#rfref4) > > > > > > - Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. > > > (2017). “[AttnGAN: Fine-grained text to image generation with > attentional > > > generative adversarial networks](https://arxiv.org/abs/1711.10485). > > [↩︎]( > > > https://openai.com/blog/dall-e/#rfref5) > > > > > > - Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J. > > > (2019). “[Object-driven text-to-image synthesis via adversarial > > training]( > > > https://arxiv.org/abs/1902.10740)”. In CVPR 2019. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref6) > > > > > > - Koh, J. Y., Baldridge, J., Lee, H., Yang, Y. (2020). “[Text-to-image > > > generation grounded by fine-grained user attention]( > > > https://arxiv.org/abs/2011.03775)”. In WACV 2021. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref7) > > > > > > - Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J. > > (2016). > > > “[Plug & play generative networks: conditional iterative generation of > > > images in latent space](https://arxiv.org/abs/1612.00005). [↩︎]( > > > https://openai.com/blog/dall-e/#rfref8) > > > > > > - Cho, J., Lu, J., Schwen, D., Hajishirzi, H., Kembhavi, A. (2020). > > > “[X-LXMERT: Paint, caption, and answer questions with multi-modal > > > transformers](https://arxiv.org/abs/2009.11278)”. EMNLP 2020. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref9) > > > > > > - Kingma, Diederik P., and Max Welling. “[Auto-encoding variational > > bayes]( > > > https://arxiv.org/abs/1312.6114).” arXiv preprint (2013). [↩︎]( > > > https://openai.com/blog/dall-e/#rfref10a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref10b) > > > > > > - Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. > > “[Stochastic > > > backpropagation and approximate inference in deep generative models]( > > > https://arxiv.org/abs/1401.4082).” arXiv preprint (2014). [↩︎]( > > > https://openai.com/blog/dall-e/#rfref11a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref11b) > > > > > > - Jang, E., Gu, S., Poole, B. (2016). “[Categorical reparametrization > > with > > > Gumbel-softmax](https://arxiv.org/abs/1611.01144)”. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref12a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref12b) > > > > > > - Maddison, C., Mnih, A., Teh, Y. W. (2016). “[The Concrete > distribution: > > > a continuous relaxation of discrete random variables]( > > > https://arxiv.org/abs/1611.00712)”. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref13a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref13b) > > > > > > - van den Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). “[Neural > > discrete > > > representation learning](https://arxiv.org/abs/1711.00937)”. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref14a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref14b) > > > > > > - Razavi, A., van der Oord, A., Vinyals, O. (2019). “[Generating > diverse > > > high-fidelity images with VQ-VAE-2](https://arxiv.org/abs/1906.00446) > ”. > > > [↩︎](https://openai.com/blog/dall-e/#rfref15a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref15b) > > > > > > - Andreas, J., Klein, D., Levine, S. (2017). “[Learning with Latent > > > Language](https://arxiv.org/abs/1711.00482)”. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref16) > > > > > > - Smolensky, P. (1990). “[Tensor product variable binding and the > > > representation of symbolic structures in connectionist systems]( > > > > > > http://www.lscp.net/persons/dupoux/teaching/AT1_2014/papers/Smolensky_1990_… > ) > > ”. > > > [↩︎](https://openai.com/blog/dall-e/#rfref17a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref17b) > > > > > > - Plate, T. (1995). “[Holographic reduced representations: convolution > > > algebra for compositional distributed representations]( > > > https://www.ijcai.org/Proceedings/91-1/Papers/006.pdf)”. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref18a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref18b) > > > > > > - Gayler, R. (1998). “[Multiplicative binding, representation > operators & > > > analogy](http://cogprints.org/502/)”. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref19a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref19b) > > > > > > - Kanerva, P. (1997). “[Fully distributed representations]( > > > http://www.cap-lore.com/RWC97-kanerva.pdf)”. [↩︎]( > > > https://openai.com/blog/dall-e/#rfref20a) [↩︎]( > > > https://openai.com/blog/dall-e/#rfref20b) > > > > > > --------------------------------------------------------------- > > > > > > Authors > > > [Aditya Ramesh](https://openai.com/blog/authors/aditya/)[Mikhail > > Pavlov]( > > > https://openai.com/blog/authors/mikhail/)[Gabriel Goh]( > > > https://openai.com/blog/authors/gabriel/)[Scott Gray]( > > > https://openai.com/blog/authors/scott/) > > > (Primary Authors) > > > [Mark Chen](https://openai.com/blog/authors/mark/)[Rewon Child]( > > > https://openai.com/blog/authors/rewon/)[Vedant Misra]( > > > https://openai.com/blog/authors/vedant/)[Pamela Mishkin]( > > > https://openai.com/blog/authors/pamela/)[Gretchen Krueger]( > > > https://openai.com/blog/authors/gretchen/)[Sandhini Agarwal]( > > > https://openai.com/blog/authors/sandhini/)[Ilya Sutskever]( > > > https://openai.com/blog/authors/ilya/) > > > (Supporting Authors) > > > --------------------------------------------------------------- > > > > > > Filed Under > > > [Research]( > > > > > > https://openai.com/blog/tags/research/)[Milestones](https://openai.com/blog… > > > ) > > > --------------------------------------------------------------- > > > > > > Cover Artwork > > > > > > Justin Jay Wang > > > > > > --------------------------------------------------------------- > > > > > > Acknowledgments > > > > > > Thanks to the following for their feedback on this work and > contributions > > > to this release: Alec Radford, Andrew Mayne, Jeff Clune, Ashley > > Pilipiszyn, > > > Steve Dowling, Jong Wook Kim, Lei Pan, Heewoo Jun, John Schulman, > Michael > > > Tabatowski, Preetum Nakkiran, Jack Clark, Fraser Kelton, Jacob Jackson, > > > Greg Brockman, Wojciech Zaremba, Justin Mao-Jones, David Luan, Shantanu > > > Jain, Prafulla Dhariwal, Sam Altman, Pranav Shyam, Miles Brundage, > Jakub > > > Pachocki, and Ryan Lowe. > > > > > > --------------------------------------------------------------- > > > > > > Contributions > > > > > > Aditya Ramesh was the project lead: he developed the approach, trained > > the > > > models, and wrote most of the blog copy. > > > > > > Aditya Ramesh, Mikhail Pavlov, and Scott Gray worked together to scale > up > > > the model to 12 billion parameters, and designed the infrastructure > used > > to > > > draw samples from the model. > > > > > > Aditya Ramesh, Gabriel Goh, and Justin Jay Wang worked together to > create > > > the interactive visuals for the blog. > > > > > > Mark Chen and Aditya Ramesh created the images for Raven’s Progressives > > > Matrices. > > > > > > Rewon Child and Vedant Misra assisted in writing the blog. > > > > > > Pamela Mishkin, Gretchen Krueger, and Sandhini Agarwal advised on > broader > > > impacts of the work and assisted in writing the blog. > > > > > > Ilya Sutskever oversaw the project and assisted in writing the blog. > > >

1 0

Re: cypherpunks Digest, Vol 106, Issue 94
by Gunnar Larson 09 Apr '22

09 Apr '22

Did Gunnar Larson, rape Mr. Mark Zuckerburg? Or, was it fair game? Finders keepers? On Fri, Apr 8, 2022, 7:56 PM <cypherpunks-request(a)lists.cpunks.org> wrote: > Send cypherpunks mailing list submissions to > cypherpunks(a)lists.cpunks.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.cpunks.org/mailman/listinfo/cypherpunks > or, via email, send a message with subject or body 'help' to > cypherpunks-request(a)lists.cpunks.org > > You can reach the person managing the list at > cypherpunks-owner(a)lists.cpunks.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of cypherpunks digest..." > > > Today's Topics: > > 1. Re: cypherpunks Digest, Vol 106, Issue 93 (Gunnar Larson) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 8 Apr 2022 19:54:15 -0400 > From: Gunnar Larson <g(a)xny.io> > To: cypherpunks <cypherpunks(a)lists.cpunks.org> > Subject: Re: cypherpunks Digest, Vol 106, Issue 93 > Message-ID: > <CAPc8xwPsCK2cA3tT1U-wjuV09T5kc= > TBMcrzLz46uyNHJXV9cg(a)mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > At first glance, this was a great article. > > On Fri, Apr 8, 2022, 7:52 PM <cypherpunks-request(a)lists.cpunks.org> wrote: > > > Send cypherpunks mailing list submissions to > > cypherpunks(a)lists.cpunks.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.cpunks.org/mailman/listinfo/cypherpunks > > or, via email, send a message with subject or body 'help' to > > cypherpunks-request(a)lists.cpunks.org > > > > You can reach the person managing the list at > > cypherpunks-owner(a)lists.cpunks.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of cypherpunks digest..." > > > > > > Today's Topics: > > > > 1. Re: DALL-E (coderman) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Fri, 08 Apr 2022 23:50:53 +0000 > > From: coderman <coderman(a)protonmail.com> > > To: coderman <coderman(a)protonmail.com> > > Cc: "cy\"Cypherpunks" <cypherpunks(a)cpunks.org> > > Subject: Re: DALL-E > > Message-ID: > > > > > <a9WeFGpr9g422W0Uym9aQZyxT6mqWNzNwLsG6yKqqlD4BLpH6NxuARXLOMvBY8IdZF9HMetBKZYGjdH--qJRFZDIWnXdMRQVqr3pmMYVo5I=@ > > protonmail.com> > > > > Content-Type: text/plain; charset="utf-8" > > > > DALL·E[1](https://openai.com/blog/dall-e/#fn1) > > > > We decided to name our model using a portmanteau of the artist Salvador > > Dalí and Pixar’s WALL·E. > > > > is a 12-billion parameter version of[GPT-3]( > > https://arxiv.org/abs/2005.14165) trained to generate images from text > > descriptions, using a dataset of text–image pairs. We’ve found that it > has > > a diverse set of capabilities, including creating anthropomorphized > > versions of animals and objects, combining unrelated concepts in > plausible > > ways, rendering text, and applying transformations to existing images. > > > > --------------------------------------------------------------- > > > > Text prompt > > an illustration of a baby daikon radish in a tutu walking a dog > > AI-generated > > images > > > > Edit prompt or view more images > > Text prompt > > an armchair in the shape of an avocado. . . . > > AI-generated > > images > > > > Edit prompt or view more images > > Text prompt > > a store front that has the word ‘openai’ written on it. . . . > > AI-generated > > images > > > > Edit prompt or view more images > > Text & image > > prompt > > the exact same cat on the top as a sketch on the bottom > > AI-generated > > images > > > > Edit prompt or view more images > > --------------------------------------------------------------- > > > > GPT-3 showed that language can be used to instruct a large neural network > > to perform a variety of text generation tasks. [Image GPT]( > > https://openai.com/blog/image-gpt) showed that the same type of neural > > network can also be used to generate images with high fidelity. We extend > > these findings to show that manipulating visual concepts through language > > is now within reach. > > > > Overview > > > > Like GPT-3, DALL·E is a transformer language model. It receives both the > > text and the image as a single stream of data containing up to 1280 > tokens, > > and is trained using maximum likelihood to generate all of the tokens, > one > > after another.[2](https://openai.com/blog/dall-e/#fn2) > > > > A token is any symbol from a discrete vocabulary; for humans, each > English > > letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has > tokens > > for both text and image concepts. Specifically, each image caption is > > represented using a maximum of 256 BPE-encoded tokens with a vocabulary > > size of 16384, and the image is represented using 1024 tokens with a > > vocabulary size of 8192. > > > > The images are preprocessed to 256x256 resolution during training. > Similar > > to VQVAE,[14]( > > > https://openai.com/blog/dall-e/#rf14)[15](https://openai.com/blog/dall-e/#r… > ) > > each image is compressed to a 32x32 grid of discrete latent codes using a > > discrete VAE[10]( > > > https://openai.com/blog/dall-e/#rf10)[11](https://openai.com/blog/dall-e/#r… > ) > > that we pretrained using a continuous relaxation.[12]( > > > https://openai.com/blog/dall-e/#rf12)[13](https://openai.com/blog/dall-e/#r… > ) > > We found that training using the relaxation obviates the need for an > > explicit codebook, EMA loss, or tricks like dead code revival, and can > > scale up to large vocabulary sizes. > > > > This training procedure allows DALL·E to not only generate an image from > > scratch, but also to regenerate any rectangular region of an existing > image > > that extends to the bottom-right corner, in a way that is consistent with > > the text prompt. > > > > We recognize that work involving generative models has the potential for > > significant, broad societal impacts. In the future, we plan to analyze > how > > models like DALL·E relate to societal issues like economic impact on > > certain work processes and professions, the potential for bias in the > model > > outputs, and the longer term ethical challenges implied by this > technology. > > > > Capabilities > > > > We find that DALL·E is able to create plausible images for a great > variety > > of sentences that explore the compositional structure of language. We > > illustrate this using a series of interactive visuals in the next > section. > > The samples shown for each caption in the visuals are obtained by taking > > the top 32 of 512 after reranking with [CLIP]( > > https://openai.com/blog/clip/) but we do not use any manual > > cherry-picking, aside from the thumbnails and standalone images that > appear > > outside.[3](https://openai.com/blog/dall-e/#fn3) > > > > Further details provided in [a later section]( > > https://openai.com/blog/dall-e/#summary) > > > > Controlling Attributes > > > > We test DALL·E’s ability to modify several of an object’s attributes, as > > well as the number of times that it appears. > > > > Click to edit text prompt or view more AI-generated images > > a pentagonal green clock. a green clock in the shape of a pentagon. > > > > navigatedownwide > > a cube made of porcupine. a cube with the texture of a porcupine. > > > > navigatedownwide > > a collection of glasses is sitting on a table > > > > navigatedownwide > > > > Drawing Multiple Objects > > > > Simultaneously controlling multiple objects, their attributes, and their > > spatial relationships presents a new challenge. For example, consider the > > phrase “a hedgehog wearing a red hat, yellow gloves, blue shirt, and > green > > pants.” To correctly interpret this sentence, DALL·E must not only > > correctly compose each piece of apparel with the animal, but also form > the > > associations (hat, red), (gloves, yellow), (shirt, blue), and (pants, > > green) without mixing them up.[4](https://openai.com/blog/dall-e/#fn4) > > > > This task is called variable binding, and has been extensively studied in > > the literature.[17]( > > > https://openai.com/blog/dall-e/#rf17)[18](https://openai.com/blog/dall-e/#r… > > ) > > > > We test DALL·E’s ability to do this for relative positioning, stacking > > objects, and controlling multiple attributes. > > > > a small red block sitting on a large green block > > > > navigatedownwide > > a stack of 3 cubes. a red cube is on the top, sitting on a green cube. > the > > green cube is in the middle, sitting on a blue cube. the blue cube is on > > the bottom. > > > > navigatedownwide > > an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, > > and yellow pants > > > > navigatedownwide > > > > While DALL·E does offer some level of controllability over the attributes > > and positions of a small number of objects, the success rate can depend > on > > how the caption is phrased. As more objects are introduced, DALL·E is > prone > > to confusing the associations between the objects and their colors, and > the > > success rate decreases sharply. We also note that DALL·E is brittle with > > respect to rephrasing of the caption in these scenarios: alternative, > > semantically equivalent captions often yield no correct interpretations. > > > > Visualizing Perspective and Three-Dimensionality > > > > We find that DALL·E also allows for control over the viewpoint of a scene > > and the 3D style in which a scene is rendered. > > > > an extreme close-up view of a capybara sitting in a field > > > > navigatedownwide > > a capybara made of voxels sitting in a field > > > > navigatedownwide > > > > To push this further, we test DALL·E’s ability to repeatedly draw the > head > > of a well-known figure at each angle from a sequence of equally spaced > > angles, and find that we can recover a smooth animation of the rotating > > head. > > > > a photograph of a bust of homer > > > > navigatedownwide > > > > DALL·E appears to be able to apply some types of optical distortions to > > scenes, as we see with the options “fisheye lens view” and “a spherical > > panorama.” This motivated us to explore its ability to generate > reflections. > > > > a plain white cube looking at its own reflection in a mirror. a plain > > white cube gazing at itself in a mirror. > > > > navigatedownwide > > > > Visualizing Internal and External Structure > > > > The samples from the “extreme close-up view” and “x-ray” style led us to > > further explore DALL·E’s ability to render internal structure with > > cross-sectional views, and external structure with macro photographs. > > > > a cross-section view of a walnut > > > > navigatedownwide > > a macro photograph of brain coral > > > > navigatedownwide > > > > Inferring Contextual Details > > > > The task of translating text to images is underspecified: a single > caption > > generally corresponds to an infinitude of plausible images, so the image > is > > not uniquely determined. For instance, consider the caption “a painting > of > > a capybara sitting on a field at sunrise.” Depending on the orientation > of > > the capybara, it may be necessary to draw a shadow, though this detail is > > never mentioned explicitly. We explore DALL·E’s ability to resolve > > underspecification in three cases: changing style, setting, and time; > > drawing the same object in a variety of different situations; and > > generating an image of an object with specific text written on it. > > > > a painting of a capybara sitting in a field at sunrise > > > > navigatedownwide > > a stained glass window with an image of a blue strawberry > > > > navigatedownwide > > a store front that has the word ‘openai’ written on it. a store front > that > > has the word ‘openai’ written on it. a store front that has the word > > ‘openai’ written on it. ‘openai’ store front. > > > > navigatedownwide > > > > With varying degrees of reliability, DALL·E provides access to a subset > of > > the capabilities of a 3D rendering engine via natural language. It can > > independently control the attributes of a small number of objects, and > to a > > limited extent, how many there are, and how they are arranged with > respect > > to one another. It can also control the location and angle from which a > > scene is rendered, and can generate known objects in compliance with > > precise specifications of angle and lighting conditions. > > > > Unlike a 3D rendering engine, whose inputs must be specified > unambiguously > > and in complete detail, DALL·E is often able to “fill in the blanks” when > > the caption implies that the image must contain a certain detail that is > > not explicitly stated. > > > > Applications of Preceding Capabilities > > > > Next, we explore the use of the preceding capabilities for fashion and > > interior design. > > > > a male mannequin dressed in an orange and black flannel shirt > > > > navigatedownwide > > a female mannequin dressed in a black leather jacket and gold pleated > skirt > > > > navigatedownwide > > a living room with two white armchairs and a painting of the colosseum. > > the painting is mounted above a modern fireplace. > > > > navigatedownwide > > a loft bedroom with a white bed next to a nightstand. there is a fish > tank > > beside the bed. > > > > navigatedownwide > > > > Combining Unrelated Concepts > > > > The compositional nature of language allows us to put together concepts > to > > describe both real and imaginary things. We find that DALL·E also has the > > ability to combine disparate ideas to synthesize objects, some of which > are > > unlikely to exist in the real world. We explore this ability in two > > instances: transferring qualities from various concepts to animals, and > > designing products by taking inspiration from unrelated concepts. > > > > a snail made of harp. a snail with the texture of a harp. > > > > navigatedownwide > > an armchair in the shape of an avocado. an armchair imitating an avocado. > > > > navigatedownwide > > > > Animal Illustrations > > > > In the previous section, we explored DALL·E’s ability to combine > unrelated > > concepts when generating images of real-world objects. Here, we explore > > this ability in the context of art, for three kinds of illustrations: > > anthropomorphized versions of animals and objects, animal chimeras, and > > emojis. > > > > an illustration of a baby daikon radish in a tutu walking a dog > > > > navigatedownwide > > a professional high quality illustration of a giraffe turtle chimera. a > > giraffe imitating a turtle. a giraffe made of turtle. > > > > navigatedownwide > > a professional high quality emoji of a lovestruck cup of boba > > > > navigatedownwide > > > > Zero-Shot Visual Reasoning > > > > GPT-3 can be instructed to perform many kinds of tasks solely from a > > description and a cue to generate the answer supplied in its prompt, > > without any additional training. For example, when prompted with the > phrase > > “here is the sentence ‘a person walking his dog in the park’ translated > > into French:”, GPT-3 answers “un homme qui promène son chien dans le > parc.” > > This capability is called zero-shot reasoning. We find that DALL·E > extends > > this capability to the visual domain, and is able to perform several > kinds > > of image-to-image translation tasks when prompted in the right way. > > > > the exact same cat on the top as a sketch on the bottom > > > > navigatedownwide > > the exact same teapot on the top with ’gpt’ written on it on the bottom > > > > navigatedownwide > > > > We did not anticipate that this capability would emerge, and made no > > modifications to the neural network or training procedure to encourage > it. > > Motivated by these results, we measure DALL·E’s aptitude for analogical > > reasoning problems by testing it on Raven’s progressive matrices, a > visual > > IQ test that saw widespread use in the 20th century. > > > > a sequence of geometric shapes. > > > > navigatedownwide > > > > Geographic Knowledge > > > > We find that DALL·E has learned about geographic facts, landmarks, and > > neighborhoods. Its knowledge of these concepts is surprisingly precise in > > some ways and flawed in others. > > > > a photo of the food of china > > > > navigatedownwide > > a photo of alamo square, san francisco, from a street at night > > > > navigatedownwide > > a photo of san francisco’s golden gate bridge > > > > navigatedownwide > > > > Temporal Knowledge > > > > In addition to exploring DALL·E’s knowledge of concepts that vary over > > space, we also explore its knowledge of concepts that vary over time. > > > > a photo of a phone from the 20s > > > > navigatedownwide > > > > Summary of Approach and Prior Work > > > > DALL·E is a simple decoder-only transformer that receives both the text > > and the image as a single stream of 1280 tokens—256 for the text and 1024 > > for the image—and models all of them autoregressively. The attention mask > > at each of its 64 self-attention layers allows each image token to attend > > to all text tokens. DALL·E uses the standard causal mask for the text > > tokens, and sparse attention for the image tokens with either a row, > > column, or convolutional attention pattern, depending on the layer. We > > provide more details about the architecture and training procedure in our > > [paper](https://arxiv.org/abs/2102.12092). > > > > Text-to-image synthesis has been an active area of research since the > > pioneering work of Reed et. al,[1](https://openai.com/blog/dall-e/#rf1) > > whose approach uses a GAN conditioned on text embeddings. The embeddings > > are produced by an encoder pretrained using a contrastive loss, not > unlike > > CLIP. StackGAN[3](https://openai.com/blog/dall-e/#rf3) and > StackGAN++[4]( > > https://openai.com/blog/dall-e/#rf4) use multi-scale GANs to scale up > the > > image resolution and improve visual fidelity. AttnGAN[5]( > > https://openai.com/blog/dall-e/#rf5) incorporates attention between the > > text and image features, and proposes a contrastive text-image feature > > matching loss as an auxiliary objective. This is interesting to compare > to > > our reranking with CLIP, which is done offline. Other work[2]( > > > https://openai.com/blog/dall-e/#rf2)[6](https://openai.com/blog/dall-e/#rf6… > ) > > incorporates additional sources of supervision during training to improve > > image quality. Finally, work by Nguyen et. al[8]( > > https://openai.com/blog/dall-e/#rf8) and Cho et. al[9]( > > https://openai.com/blog/dall-e/#rf9) explores sampling-based strategies > > for image generation that leverage pretrained multimodal discriminative > > models. > > > > Similar to the rejection sampling used in [VQVAE-2]( > > https://arxiv.org/abs/1906.00446) we use [CLIP]( > > https://openai.com/blog/clip/) to rerank the top 32 of 512 samples for > > each caption in all of the interactive visuals. This procedure can also > be > > seen as a kind of language-guided search[16]( > > https://openai.com/blog/dall-e/#rf16) and can have a dramatic impact on > > sample quality. > > > > an illustration of a baby daikon radish in a tutu walking a dog [caption > > 1, best 8 of 2048] > > > > > > > navigatedownwide--------------------------------------------------------------- > > > > Footnotes > > > > - > > > > We decided to name our model using a portmanteau of the artist Salvador > > Dalí and Pixar’s WALL·E. [↩︎](https://openai.com/blog/dall-e/#fnref1) > > > > - > > > > A token is any symbol from a discrete vocabulary; for humans, each > English > > letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has > tokens > > for both text and image concepts. Specifically, each image caption is > > represented using a maximum of 256 BPE-encoded tokens with a vocabulary > > size of 16384, and the image is represented using 1024 tokens with a > > vocabulary size of 8192. > > > > The images are preprocessed to 256x256 resolution during training. > Similar > > to VQVAE,[14]( > > > https://openai.com/blog/dall-e/#rf14)[15](https://openai.com/blog/dall-e/#r… > ) > > each image is compressed to a 32x32 grid of discrete latent codes using a > > discrete VAE[10]( > > > https://openai.com/blog/dall-e/#rf10)[11](https://openai.com/blog/dall-e/#r… > ) > > that we pretrained using a continuous relaxation.[12]( > > > https://openai.com/blog/dall-e/#rf12)[13](https://openai.com/blog/dall-e/#r… > ) > > We found that training using the relaxation obviates the need for an > > explicit codebook, EMA loss, or tricks like dead code revival, and can > > scale up to large vocabulary sizes. [↩︎]( > > https://openai.com/blog/dall-e/#fnref2) > > > > - > > > > Further details provided in [a later section]( > > https://openai.com/blog/dall-e/#summary) [↩︎]( > > https://openai.com/blog/dall-e/#fnref3) > > > > - > > > > This task is called variable binding, and has been extensively studied in > > the literature.[17]( > > > https://openai.com/blog/dall-e/#rf17)[18](https://openai.com/blog/dall-e/#r… > ) > > [↩︎](https://openai.com/blog/dall-e/#fnref4) > > > > --------------------------------------------------------------- > > > > References > > > > - Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H. > > (2016). “[Generative adversarial text to image synthesis]( > > https://arxiv.org/abs/1605.05396)”. In ICML 2016. [↩︎]( > > https://openai.com/blog/dall-e/#rfref1) > > > > - Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H. (2016). > > “[Learning what and where to draw](https://arxiv.org/abs/1610.02454)”. > In > > NIPS 2016. [↩︎](https://openai.com/blog/dall-e/#rfref2) > > > > - Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang X., Metaxas, D. > > (2016). “[StackGAN: Text to photo-realistic image synthesis with stacked > > generative adversarial networks](https://arxiv.org/abs/1612.03242)”. In > > ICCY 2017. [↩︎](https://openai.com/blog/dall-e/#rfref3) > > > > - Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D. > > (2017). “[StackGAN++: realistic image synthesis with stacked generative > > adversarial networks](https://arxiv.org/abs/1710.10916)”. In IEEE TPAMI > > 2018. [↩︎](https://openai.com/blog/dall-e/#rfref4) > > > > - Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. > > (2017). “[AttnGAN: Fine-grained text to image generation with attentional > > generative adversarial networks](https://arxiv.org/abs/1711.10485). > [↩︎]( > > https://openai.com/blog/dall-e/#rfref5) > > > > - Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J. > > (2019). “[Object-driven text-to-image synthesis via adversarial > training]( > > https://arxiv.org/abs/1902.10740)”. In CVPR 2019. [↩︎]( > > https://openai.com/blog/dall-e/#rfref6) > > > > - Koh, J. Y., Baldridge, J., Lee, H., Yang, Y. (2020). “[Text-to-image > > generation grounded by fine-grained user attention]( > > https://arxiv.org/abs/2011.03775)”. In WACV 2021. [↩︎]( > > https://openai.com/blog/dall-e/#rfref7) > > > > - Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J. > (2016). > > “[Plug & play generative networks: conditional iterative generation of > > images in latent space](https://arxiv.org/abs/1612.00005). [↩︎]( > > https://openai.com/blog/dall-e/#rfref8) > > > > - Cho, J., Lu, J., Schwen, D., Hajishirzi, H., Kembhavi, A. (2020). > > “[X-LXMERT: Paint, caption, and answer questions with multi-modal > > transformers](https://arxiv.org/abs/2009.11278)”. EMNLP 2020. [↩︎]( > > https://openai.com/blog/dall-e/#rfref9) > > > > - Kingma, Diederik P., and Max Welling. “[Auto-encoding variational > bayes]( > > https://arxiv.org/abs/1312.6114).” arXiv preprint (2013). [↩︎]( > > https://openai.com/blog/dall-e/#rfref10a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref10b) > > > > - Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. > “[Stochastic > > backpropagation and approximate inference in deep generative models]( > > https://arxiv.org/abs/1401.4082).” arXiv preprint (2014). [↩︎]( > > https://openai.com/blog/dall-e/#rfref11a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref11b) > > > > - Jang, E., Gu, S., Poole, B. (2016). “[Categorical reparametrization > with > > Gumbel-softmax](https://arxiv.org/abs/1611.01144)”. [↩︎]( > > https://openai.com/blog/dall-e/#rfref12a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref12b) > > > > - Maddison, C., Mnih, A., Teh, Y. W. (2016). “[The Concrete distribution: > > a continuous relaxation of discrete random variables]( > > https://arxiv.org/abs/1611.00712)”. [↩︎]( > > https://openai.com/blog/dall-e/#rfref13a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref13b) > > > > - van den Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). “[Neural > discrete > > representation learning](https://arxiv.org/abs/1711.00937)”. [↩︎]( > > https://openai.com/blog/dall-e/#rfref14a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref14b) > > > > - Razavi, A., van der Oord, A., Vinyals, O. (2019). “[Generating diverse > > high-fidelity images with VQ-VAE-2](https://arxiv.org/abs/1906.00446)”. > > [↩︎](https://openai.com/blog/dall-e/#rfref15a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref15b) > > > > - Andreas, J., Klein, D., Levine, S. (2017). “[Learning with Latent > > Language](https://arxiv.org/abs/1711.00482)”. [↩︎]( > > https://openai.com/blog/dall-e/#rfref16) > > > > - Smolensky, P. (1990). “[Tensor product variable binding and the > > representation of symbolic structures in connectionist systems]( > > > http://www.lscp.net/persons/dupoux/teaching/AT1_2014/papers/Smolensky_1990_…) > ”. > > [↩︎](https://openai.com/blog/dall-e/#rfref17a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref17b) > > > > - Plate, T. (1995). “[Holographic reduced representations: convolution > > algebra for compositional distributed representations]( > > https://www.ijcai.org/Proceedings/91-1/Papers/006.pdf)”. [↩︎]( > > https://openai.com/blog/dall-e/#rfref18a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref18b) > > > > - Gayler, R. (1998). “[Multiplicative binding, representation operators & > > analogy](http://cogprints.org/502/)”. [↩︎]( > > https://openai.com/blog/dall-e/#rfref19a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref19b) > > > > - Kanerva, P. (1997). “[Fully distributed representations]( > > http://www.cap-lore.com/RWC97-kanerva.pdf)”. [↩︎]( > > https://openai.com/blog/dall-e/#rfref20a) [↩︎]( > > https://openai.com/blog/dall-e/#rfref20b) > > > > --------------------------------------------------------------- > > > > Authors > > [Aditya Ramesh](https://openai.com/blog/authors/aditya/)[Mikhail > Pavlov]( > > https://openai.com/blog/authors/mikhail/)[Gabriel Goh]( > > https://openai.com/blog/authors/gabriel/)[Scott Gray]( > > https://openai.com/blog/authors/scott/) > > (Primary Authors) > > [Mark Chen](https://openai.com/blog/authors/mark/)[Rewon Child]( > > https://openai.com/blog/authors/rewon/)[Vedant Misra]( > > https://openai.com/blog/authors/vedant/)[Pamela Mishkin]( > > https://openai.com/blog/authors/pamela/)[Gretchen Krueger]( > > https://openai.com/blog/authors/gretchen/)[Sandhini Agarwal]( > > https://openai.com/blog/authors/sandhini/)[Ilya Sutskever]( > > https://openai.com/blog/authors/ilya/) > > (Supporting Authors) > > --------------------------------------------------------------- > > > > Filed Under > > [Research]( > > > https://openai.com/blog/tags/research/)[Milestones](https://openai.com/blog… > > ) > > --------------------------------------------------------------- > > > > Cover Artwork > > > > Justin Jay Wang > > > > --------------------------------------------------------------- > > > > Acknowledgments > > > > Thanks to the following for their feedback on this work and contributions > > to this release: Alec Radford, Andrew Mayne, Jeff Clune, Ashley > Pilipiszyn, > > Steve Dowling, Jong Wook Kim, Lei Pan, Heewoo Jun, John Schulman, Michael > > Tabatowski, Preetum Nakkiran, Jack Clark, Fraser Kelton, Jacob Jackson, > > Greg Brockman, Wojciech Zaremba, Justin Mao-Jones, David Luan, Shantanu > > Jain, Prafulla Dhariwal, Sam Altman, Pranav Shyam, Miles Brundage, Jakub > > Pachocki, and Ryan Lowe. > > > > --------------------------------------------------------------- > > > > Contributions > > > > Aditya Ramesh was the project lead: he developed the approach, trained > the > > models, and wrote most of the blog copy. > > > > Aditya Ramesh, Mikhail Pavlov, and Scott Gray worked together to scale up > > the model to 12 billion parameters, and designed the infrastructure used > to > > draw samples from the model. > > > > Aditya Ramesh, Gabriel Goh, and Justin Jay Wang worked together to create > > the interactive visuals for the blog. > > > > Mark Chen and Aditya Ramesh created the images for Raven’s Progressives > > Matrices. > > > > Rewon Child and Vedant Misra assisted in writing the blog. > > > > Pamela Mishkin, Gretchen Krueger, and Sandhini Agarwal advised on broader > > impacts of the work and assisted in writing the blog. > > > > Ilya Sutskever oversaw the project and assisted in writing the blog. > >

1 0

Re: cypherpunks Digest, Vol 106, Issue 93
by Gunnar Larson 08 Apr '22

08 Apr '22

At first glance, this was a great article. On Fri, Apr 8, 2022, 7:52 PM <cypherpunks-request(a)lists.cpunks.org> wrote: > Send cypherpunks mailing list submissions to > cypherpunks(a)lists.cpunks.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.cpunks.org/mailman/listinfo/cypherpunks > or, via email, send a message with subject or body 'help' to > cypherpunks-request(a)lists.cpunks.org > > You can reach the person managing the list at > cypherpunks-owner(a)lists.cpunks.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of cypherpunks digest..." > > > Today's Topics: > > 1. Re: DALL-E (coderman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 08 Apr 2022 23:50:53 +0000 > From: coderman <coderman(a)protonmail.com> > To: coderman <coderman(a)protonmail.com> > Cc: "cy\"Cypherpunks" <cypherpunks(a)cpunks.org> > Subject: Re: DALL-E > Message-ID: > > <a9WeFGpr9g422W0Uym9aQZyxT6mqWNzNwLsG6yKqqlD4BLpH6NxuARXLOMvBY8IdZF9HMetBKZYGjdH--qJRFZDIWnXdMRQVqr3pmMYVo5I=@ > protonmail.com> > > Content-Type: text/plain; charset="utf-8" > > DALL·E[1](https://openai.com/blog/dall-e/#fn1) > > We decided to name our model using a portmanteau of the artist Salvador > Dalí and Pixar’s WALL·E. > > is a 12-billion parameter version of[GPT-3]( > https://arxiv.org/abs/2005.14165) trained to generate images from text > descriptions, using a dataset of text–image pairs. We’ve found that it has > a diverse set of capabilities, including creating anthropomorphized > versions of animals and objects, combining unrelated concepts in plausible > ways, rendering text, and applying transformations to existing images. > > --------------------------------------------------------------- > > Text prompt > an illustration of a baby daikon radish in a tutu walking a dog > AI-generated > images > > Edit prompt or view more images > Text prompt > an armchair in the shape of an avocado. . . . > AI-generated > images > > Edit prompt or view more images > Text prompt > a store front that has the word ‘openai’ written on it. . . . > AI-generated > images > > Edit prompt or view more images > Text & image > prompt > the exact same cat on the top as a sketch on the bottom > AI-generated > images > > Edit prompt or view more images > --------------------------------------------------------------- > > GPT-3 showed that language can be used to instruct a large neural network > to perform a variety of text generation tasks. [Image GPT]( > https://openai.com/blog/image-gpt) showed that the same type of neural > network can also be used to generate images with high fidelity. We extend > these findings to show that manipulating visual concepts through language > is now within reach. > > Overview > > Like GPT-3, DALL·E is a transformer language model. It receives both the > text and the image as a single stream of data containing up to 1280 tokens, > and is trained using maximum likelihood to generate all of the tokens, one > after another.[2](https://openai.com/blog/dall-e/#fn2) > > A token is any symbol from a discrete vocabulary; for humans, each English > letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens > for both text and image concepts. Specifically, each image caption is > represented using a maximum of 256 BPE-encoded tokens with a vocabulary > size of 16384, and the image is represented using 1024 tokens with a > vocabulary size of 8192. > > The images are preprocessed to 256x256 resolution during training. Similar > to VQVAE,[14]( > https://openai.com/blog/dall-e/#rf14)[15](https://openai.com/blog/dall-e/#r…) > each image is compressed to a 32x32 grid of discrete latent codes using a > discrete VAE[10]( > https://openai.com/blog/dall-e/#rf10)[11](https://openai.com/blog/dall-e/#r…) > that we pretrained using a continuous relaxation.[12]( > https://openai.com/blog/dall-e/#rf12)[13](https://openai.com/blog/dall-e/#r…) > We found that training using the relaxation obviates the need for an > explicit codebook, EMA loss, or tricks like dead code revival, and can > scale up to large vocabulary sizes. > > This training procedure allows DALL·E to not only generate an image from > scratch, but also to regenerate any rectangular region of an existing image > that extends to the bottom-right corner, in a way that is consistent with > the text prompt. > > We recognize that work involving generative models has the potential for > significant, broad societal impacts. In the future, we plan to analyze how > models like DALL·E relate to societal issues like economic impact on > certain work processes and professions, the potential for bias in the model > outputs, and the longer term ethical challenges implied by this technology. > > Capabilities > > We find that DALL·E is able to create plausible images for a great variety > of sentences that explore the compositional structure of language. We > illustrate this using a series of interactive visuals in the next section. > The samples shown for each caption in the visuals are obtained by taking > the top 32 of 512 after reranking with [CLIP]( > https://openai.com/blog/clip/) but we do not use any manual > cherry-picking, aside from the thumbnails and standalone images that appear > outside.[3](https://openai.com/blog/dall-e/#fn3) > > Further details provided in [a later section]( > https://openai.com/blog/dall-e/#summary) > > Controlling Attributes > > We test DALL·E’s ability to modify several of an object’s attributes, as > well as the number of times that it appears. > > Click to edit text prompt or view more AI-generated images > a pentagonal green clock. a green clock in the shape of a pentagon. > > navigatedownwide > a cube made of porcupine. a cube with the texture of a porcupine. > > navigatedownwide > a collection of glasses is sitting on a table > > navigatedownwide > > Drawing Multiple Objects > > Simultaneously controlling multiple objects, their attributes, and their > spatial relationships presents a new challenge. For example, consider the > phrase “a hedgehog wearing a red hat, yellow gloves, blue shirt, and green > pants.” To correctly interpret this sentence, DALL·E must not only > correctly compose each piece of apparel with the animal, but also form the > associations (hat, red), (gloves, yellow), (shirt, blue), and (pants, > green) without mixing them up.[4](https://openai.com/blog/dall-e/#fn4) > > This task is called variable binding, and has been extensively studied in > the literature.[17]( > https://openai.com/blog/dall-e/#rf17)[18](https://openai.com/blog/dall-e/#r… > ) > > We test DALL·E’s ability to do this for relative positioning, stacking > objects, and controlling multiple attributes. > > a small red block sitting on a large green block > > navigatedownwide > a stack of 3 cubes. a red cube is on the top, sitting on a green cube. the > green cube is in the middle, sitting on a blue cube. the blue cube is on > the bottom. > > navigatedownwide > an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, > and yellow pants > > navigatedownwide > > While DALL·E does offer some level of controllability over the attributes > and positions of a small number of objects, the success rate can depend on > how the caption is phrased. As more objects are introduced, DALL·E is prone > to confusing the associations between the objects and their colors, and the > success rate decreases sharply. We also note that DALL·E is brittle with > respect to rephrasing of the caption in these scenarios: alternative, > semantically equivalent captions often yield no correct interpretations. > > Visualizing Perspective and Three-Dimensionality > > We find that DALL·E also allows for control over the viewpoint of a scene > and the 3D style in which a scene is rendered. > > an extreme close-up view of a capybara sitting in a field > > navigatedownwide > a capybara made of voxels sitting in a field > > navigatedownwide > > To push this further, we test DALL·E’s ability to repeatedly draw the head > of a well-known figure at each angle from a sequence of equally spaced > angles, and find that we can recover a smooth animation of the rotating > head. > > a photograph of a bust of homer > > navigatedownwide > > DALL·E appears to be able to apply some types of optical distortions to > scenes, as we see with the options “fisheye lens view” and “a spherical > panorama.” This motivated us to explore its ability to generate reflections. > > a plain white cube looking at its own reflection in a mirror. a plain > white cube gazing at itself in a mirror. > > navigatedownwide > > Visualizing Internal and External Structure > > The samples from the “extreme close-up view” and “x-ray” style led us to > further explore DALL·E’s ability to render internal structure with > cross-sectional views, and external structure with macro photographs. > > a cross-section view of a walnut > > navigatedownwide > a macro photograph of brain coral > > navigatedownwide > > Inferring Contextual Details > > The task of translating text to images is underspecified: a single caption > generally corresponds to an infinitude of plausible images, so the image is > not uniquely determined. For instance, consider the caption “a painting of > a capybara sitting on a field at sunrise.” Depending on the orientation of > the capybara, it may be necessary to draw a shadow, though this detail is > never mentioned explicitly. We explore DALL·E’s ability to resolve > underspecification in three cases: changing style, setting, and time; > drawing the same object in a variety of different situations; and > generating an image of an object with specific text written on it. > > a painting of a capybara sitting in a field at sunrise > > navigatedownwide > a stained glass window with an image of a blue strawberry > > navigatedownwide > a store front that has the word ‘openai’ written on it. a store front that > has the word ‘openai’ written on it. a store front that has the word > ‘openai’ written on it. ‘openai’ store front. > > navigatedownwide > > With varying degrees of reliability, DALL·E provides access to a subset of > the capabilities of a 3D rendering engine via natural language. It can > independently control the attributes of a small number of objects, and to a > limited extent, how many there are, and how they are arranged with respect > to one another. It can also control the location and angle from which a > scene is rendered, and can generate known objects in compliance with > precise specifications of angle and lighting conditions. > > Unlike a 3D rendering engine, whose inputs must be specified unambiguously > and in complete detail, DALL·E is often able to “fill in the blanks” when > the caption implies that the image must contain a certain detail that is > not explicitly stated. > > Applications of Preceding Capabilities > > Next, we explore the use of the preceding capabilities for fashion and > interior design. > > a male mannequin dressed in an orange and black flannel shirt > > navigatedownwide > a female mannequin dressed in a black leather jacket and gold pleated skirt > > navigatedownwide > a living room with two white armchairs and a painting of the colosseum. > the painting is mounted above a modern fireplace. > > navigatedownwide > a loft bedroom with a white bed next to a nightstand. there is a fish tank > beside the bed. > > navigatedownwide > > Combining Unrelated Concepts > > The compositional nature of language allows us to put together concepts to > describe both real and imaginary things. We find that DALL·E also has the > ability to combine disparate ideas to synthesize objects, some of which are > unlikely to exist in the real world. We explore this ability in two > instances: transferring qualities from various concepts to animals, and > designing products by taking inspiration from unrelated concepts. > > a snail made of harp. a snail with the texture of a harp. > > navigatedownwide > an armchair in the shape of an avocado. an armchair imitating an avocado. > > navigatedownwide > > Animal Illustrations > > In the previous section, we explored DALL·E’s ability to combine unrelated > concepts when generating images of real-world objects. Here, we explore > this ability in the context of art, for three kinds of illustrations: > anthropomorphized versions of animals and objects, animal chimeras, and > emojis. > > an illustration of a baby daikon radish in a tutu walking a dog > > navigatedownwide > a professional high quality illustration of a giraffe turtle chimera. a > giraffe imitating a turtle. a giraffe made of turtle. > > navigatedownwide > a professional high quality emoji of a lovestruck cup of boba > > navigatedownwide > > Zero-Shot Visual Reasoning > > GPT-3 can be instructed to perform many kinds of tasks solely from a > description and a cue to generate the answer supplied in its prompt, > without any additional training. For example, when prompted with the phrase > “here is the sentence ‘a person walking his dog in the park’ translated > into French:”, GPT-3 answers “un homme qui promène son chien dans le parc.” > This capability is called zero-shot reasoning. We find that DALL·E extends > this capability to the visual domain, and is able to perform several kinds > of image-to-image translation tasks when prompted in the right way. > > the exact same cat on the top as a sketch on the bottom > > navigatedownwide > the exact same teapot on the top with ’gpt’ written on it on the bottom > > navigatedownwide > > We did not anticipate that this capability would emerge, and made no > modifications to the neural network or training procedure to encourage it. > Motivated by these results, we measure DALL·E’s aptitude for analogical > reasoning problems by testing it on Raven’s progressive matrices, a visual > IQ test that saw widespread use in the 20th century. > > a sequence of geometric shapes. > > navigatedownwide > > Geographic Knowledge > > We find that DALL·E has learned about geographic facts, landmarks, and > neighborhoods. Its knowledge of these concepts is surprisingly precise in > some ways and flawed in others. > > a photo of the food of china > > navigatedownwide > a photo of alamo square, san francisco, from a street at night > > navigatedownwide > a photo of san francisco’s golden gate bridge > > navigatedownwide > > Temporal Knowledge > > In addition to exploring DALL·E’s knowledge of concepts that vary over > space, we also explore its knowledge of concepts that vary over time. > > a photo of a phone from the 20s > > navigatedownwide > > Summary of Approach and Prior Work > > DALL·E is a simple decoder-only transformer that receives both the text > and the image as a single stream of 1280 tokens—256 for the text and 1024 > for the image—and models all of them autoregressively. The attention mask > at each of its 64 self-attention layers allows each image token to attend > to all text tokens. DALL·E uses the standard causal mask for the text > tokens, and sparse attention for the image tokens with either a row, > column, or convolutional attention pattern, depending on the layer. We > provide more details about the architecture and training procedure in our > [paper](https://arxiv.org/abs/2102.12092). > > Text-to-image synthesis has been an active area of research since the > pioneering work of Reed et. al,[1](https://openai.com/blog/dall-e/#rf1) > whose approach uses a GAN conditioned on text embeddings. The embeddings > are produced by an encoder pretrained using a contrastive loss, not unlike > CLIP. StackGAN[3](https://openai.com/blog/dall-e/#rf3) and StackGAN++[4]( > https://openai.com/blog/dall-e/#rf4) use multi-scale GANs to scale up the > image resolution and improve visual fidelity. AttnGAN[5]( > https://openai.com/blog/dall-e/#rf5) incorporates attention between the > text and image features, and proposes a contrastive text-image feature > matching loss as an auxiliary objective. This is interesting to compare to > our reranking with CLIP, which is done offline. Other work[2]( > https://openai.com/blog/dall-e/#rf2)[6](https://openai.com/blog/dall-e/#rf6…) > incorporates additional sources of supervision during training to improve > image quality. Finally, work by Nguyen et. al[8]( > https://openai.com/blog/dall-e/#rf8) and Cho et. al[9]( > https://openai.com/blog/dall-e/#rf9) explores sampling-based strategies > for image generation that leverage pretrained multimodal discriminative > models. > > Similar to the rejection sampling used in [VQVAE-2]( > https://arxiv.org/abs/1906.00446) we use [CLIP]( > https://openai.com/blog/clip/) to rerank the top 32 of 512 samples for > each caption in all of the interactive visuals. This procedure can also be > seen as a kind of language-guided search[16]( > https://openai.com/blog/dall-e/#rf16) and can have a dramatic impact on > sample quality. > > an illustration of a baby daikon radish in a tutu walking a dog [caption > 1, best 8 of 2048] > > > navigatedownwide--------------------------------------------------------------- > > Footnotes > > - > > We decided to name our model using a portmanteau of the artist Salvador > Dalí and Pixar’s WALL·E. [↩︎](https://openai.com/blog/dall-e/#fnref1) > > - > > A token is any symbol from a discrete vocabulary; for humans, each English > letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens > for both text and image concepts. Specifically, each image caption is > represented using a maximum of 256 BPE-encoded tokens with a vocabulary > size of 16384, and the image is represented using 1024 tokens with a > vocabulary size of 8192. > > The images are preprocessed to 256x256 resolution during training. Similar > to VQVAE,[14]( > https://openai.com/blog/dall-e/#rf14)[15](https://openai.com/blog/dall-e/#r…) > each image is compressed to a 32x32 grid of discrete latent codes using a > discrete VAE[10]( > https://openai.com/blog/dall-e/#rf10)[11](https://openai.com/blog/dall-e/#r…) > that we pretrained using a continuous relaxation.[12]( > https://openai.com/blog/dall-e/#rf12)[13](https://openai.com/blog/dall-e/#r…) > We found that training using the relaxation obviates the need for an > explicit codebook, EMA loss, or tricks like dead code revival, and can > scale up to large vocabulary sizes. [↩︎]( > https://openai.com/blog/dall-e/#fnref2) > > - > > Further details provided in [a later section]( > https://openai.com/blog/dall-e/#summary) [↩︎]( > https://openai.com/blog/dall-e/#fnref3) > > - > > This task is called variable binding, and has been extensively studied in > the literature.[17]( > https://openai.com/blog/dall-e/#rf17)[18](https://openai.com/blog/dall-e/#r…) > [↩︎](https://openai.com/blog/dall-e/#fnref4) > > --------------------------------------------------------------- > > References > > - Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H. > (2016). “[Generative adversarial text to image synthesis]( > https://arxiv.org/abs/1605.05396)”. In ICML 2016. [↩︎]( > https://openai.com/blog/dall-e/#rfref1) > > - Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H. (2016). > “[Learning what and where to draw](https://arxiv.org/abs/1610.02454)”. In > NIPS 2016. [↩︎](https://openai.com/blog/dall-e/#rfref2) > > - Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang X., Metaxas, D. > (2016). “[StackGAN: Text to photo-realistic image synthesis with stacked > generative adversarial networks](https://arxiv.org/abs/1612.03242)”. In > ICCY 2017. [↩︎](https://openai.com/blog/dall-e/#rfref3) > > - Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D. > (2017). “[StackGAN++: realistic image synthesis with stacked generative > adversarial networks](https://arxiv.org/abs/1710.10916)”. In IEEE TPAMI > 2018. [↩︎](https://openai.com/blog/dall-e/#rfref4) > > - Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X. > (2017). “[AttnGAN: Fine-grained text to image generation with attentional > generative adversarial networks](https://arxiv.org/abs/1711.10485). [↩︎]( > https://openai.com/blog/dall-e/#rfref5) > > - Li, W., Zhang, P., Zhang, L., Huang, Q., He, X., Lyu, S., Gao, J. > (2019). “[Object-driven text-to-image synthesis via adversarial training]( > https://arxiv.org/abs/1902.10740)”. In CVPR 2019. [↩︎]( > https://openai.com/blog/dall-e/#rfref6) > > - Koh, J. Y., Baldridge, J., Lee, H., Yang, Y. (2020). “[Text-to-image > generation grounded by fine-grained user attention]( > https://arxiv.org/abs/2011.03775)”. In WACV 2021. [↩︎]( > https://openai.com/blog/dall-e/#rfref7) > > - Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J. (2016). > “[Plug & play generative networks: conditional iterative generation of > images in latent space](https://arxiv.org/abs/1612.00005). [↩︎]( > https://openai.com/blog/dall-e/#rfref8) > > - Cho, J., Lu, J., Schwen, D., Hajishirzi, H., Kembhavi, A. (2020). > “[X-LXMERT: Paint, caption, and answer questions with multi-modal > transformers](https://arxiv.org/abs/2009.11278)”. EMNLP 2020. [↩︎]( > https://openai.com/blog/dall-e/#rfref9) > > - Kingma, Diederik P., and Max Welling. “[Auto-encoding variational bayes]( > https://arxiv.org/abs/1312.6114).” arXiv preprint (2013). [↩︎]( > https://openai.com/blog/dall-e/#rfref10a) [↩︎]( > https://openai.com/blog/dall-e/#rfref10b) > > - Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. “[Stochastic > backpropagation and approximate inference in deep generative models]( > https://arxiv.org/abs/1401.4082).” arXiv preprint (2014). [↩︎]( > https://openai.com/blog/dall-e/#rfref11a) [↩︎]( > https://openai.com/blog/dall-e/#rfref11b) > > - Jang, E., Gu, S., Poole, B. (2016). “[Categorical reparametrization with > Gumbel-softmax](https://arxiv.org/abs/1611.01144)”. [↩︎]( > https://openai.com/blog/dall-e/#rfref12a) [↩︎]( > https://openai.com/blog/dall-e/#rfref12b) > > - Maddison, C., Mnih, A., Teh, Y. W. (2016). “[The Concrete distribution: > a continuous relaxation of discrete random variables]( > https://arxiv.org/abs/1611.00712)”. [↩︎]( > https://openai.com/blog/dall-e/#rfref13a) [↩︎]( > https://openai.com/blog/dall-e/#rfref13b) > > - van den Oord, A., Vinyals, O., Kavukcuoglu, K. (2017). “[Neural discrete > representation learning](https://arxiv.org/abs/1711.00937)”. [↩︎]( > https://openai.com/blog/dall-e/#rfref14a) [↩︎]( > https://openai.com/blog/dall-e/#rfref14b) > > - Razavi, A., van der Oord, A., Vinyals, O. (2019). “[Generating diverse > high-fidelity images with VQ-VAE-2](https://arxiv.org/abs/1906.00446)”. > [↩︎](https://openai.com/blog/dall-e/#rfref15a) [↩︎]( > https://openai.com/blog/dall-e/#rfref15b) > > - Andreas, J., Klein, D., Levine, S. (2017). “[Learning with Latent > Language](https://arxiv.org/abs/1711.00482)”. [↩︎]( > https://openai.com/blog/dall-e/#rfref16) > > - Smolensky, P. (1990). “[Tensor product variable binding and the > representation of symbolic structures in connectionist systems]( > http://www.lscp.net/persons/dupoux/teaching/AT1_2014/papers/Smolensky_1990_…. > [↩︎](https://openai.com/blog/dall-e/#rfref17a) [↩︎]( > https://openai.com/blog/dall-e/#rfref17b) > > - Plate, T. (1995). “[Holographic reduced representations: convolution > algebra for compositional distributed representations]( > https://www.ijcai.org/Proceedings/91-1/Papers/006.pdf)”. [↩︎]( > https://openai.com/blog/dall-e/#rfref18a) [↩︎]( > https://openai.com/blog/dall-e/#rfref18b) > > - Gayler, R. (1998). “[Multiplicative binding, representation operators & > analogy](http://cogprints.org/502/)”. [↩︎]( > https://openai.com/blog/dall-e/#rfref19a) [↩︎]( > https://openai.com/blog/dall-e/#rfref19b) > > - Kanerva, P. (1997). “[Fully distributed representations]( > http://www.cap-lore.com/RWC97-kanerva.pdf)”. [↩︎]( > https://openai.com/blog/dall-e/#rfref20a) [↩︎]( > https://openai.com/blog/dall-e/#rfref20b) > > --------------------------------------------------------------- > > Authors > [Aditya Ramesh](https://openai.com/blog/authors/aditya/)[Mikhail Pavlov]( > https://openai.com/blog/authors/mikhail/)[Gabriel Goh]( > https://openai.com/blog/authors/gabriel/)[Scott Gray]( > https://openai.com/blog/authors/scott/) > (Primary Authors) > [Mark Chen](https://openai.com/blog/authors/mark/)[Rewon Child]( > https://openai.com/blog/authors/rewon/)[Vedant Misra]( > https://openai.com/blog/authors/vedant/)[Pamela Mishkin]( > https://openai.com/blog/authors/pamela/)[Gretchen Krueger]( > https://openai.com/blog/authors/gretchen/)[Sandhini Agarwal]( > https://openai.com/blog/authors/sandhini/)[Ilya Sutskever]( > https://openai.com/blog/authors/ilya/) > (Supporting Authors) > --------------------------------------------------------------- > > Filed Under > [Research]( > https://openai.com/blog/tags/research/)[Milestones](https://openai.com/blog… > ) > --------------------------------------------------------------- > > Cover Artwork > > Justin Jay Wang > > --------------------------------------------------------------- > > Acknowledgments > > Thanks to the following for their feedback on this work and contributions > to this release: Alec Radford, Andrew Mayne, Jeff Clune, Ashley Pilipiszyn, > Steve Dowling, Jong Wook Kim, Lei Pan, Heewoo Jun, John Schulman, Michael > Tabatowski, Preetum Nakkiran, Jack Clark, Fraser Kelton, Jacob Jackson, > Greg Brockman, Wojciech Zaremba, Justin Mao-Jones, David Luan, Shantanu > Jain, Prafulla Dhariwal, Sam Altman, Pranav Shyam, Miles Brundage, Jakub > Pachocki, and Ryan Lowe. > > --------------------------------------------------------------- > > Contributions > > Aditya Ramesh was the project lead: he developed the approach, trained the > models, and wrote most of the blog copy. > > Aditya Ramesh, Mikhail Pavlov, and Scott Gray worked together to scale up > the model to 12 billion parameters, and designed the infrastructure used to > draw samples from the model. > > Aditya Ramesh, Gabriel Goh, and Justin Jay Wang worked together to create > the interactive visuals for the blog. > > Mark Chen and Aditya Ramesh created the images for Raven’s Progressives > Matrices. > > Rewon Child and Vedant Misra assisted in writing the blog. > > Pamela Mishkin, Gretchen Krueger, and Sandhini Agarwal advised on broader > impacts of the work and assisted in writing the blog. > > Ilya Sutskever oversaw the project and assisted in writing the blog. >

1 0