Two days ago while using my phone I found a trick I could that worked with a discord machine learning bot to guide its images better. The trick no longer works as well after the settings were changed, but can be used elsewhere.

First I did image->text, then used that for text->image, and tweaked things until its output was stable. The problem with image generation with CLIP+VQGAN is that it doesn't understand broad layout of objects, so the approach helps a lot. Once it was stable I could replace words to replace concepts in the output.

I was using an image of a face. Here are images I made.

A male cyborg dryad:

https://cdn.discordapp.com/attachments/838682121975234571/879622138980601856/1629788417_a_cyborg_tree_with_short_hair.jpg

A group of zombies and robots sitting around fires in a forest:

https://cdn.discordapp.com/attachments/838682121975234571/879628620249837568/1629789944_a_group_of_zombies_robots_and_dryads_sitting_around_a_campfire_in_the_middle_of_a_forest.jpg