Two days ago while using my phone I found a trick I could that worked with a discord machine learning bot to guide its images better. The trick no longer works as well after the settings were changed, but can be used elsewhere. First I did image->text, then used that for text->image, and tweaked things until its output was stable. The problem with image generation with CLIP+VQGAN is that it doesn't understand broad layout of objects, so the approach helps a lot. Once it was stable I could replace words to replace concepts in the output. I was using an image of a face. Here are images I made. A male cyborg dryad: https://cdn.discordapp.com/attachments/838682121975234571/879622138980601856... A group of zombies and robots sitting around fires in a forest: https://cdn.discordapp.com/attachments/838682121975234571/879628620249837568...