Two days ago while using my phone I found a trick I could that worked
   with a discord machine learning bot to guide its images better.  The
   trick no longer works as well after the settings were changed, but can
   be used elsewhere.
   First I did image->text, then used that for text->image, and tweaked
   things until its output was stable.  The problem with image generation
   with CLIP+VQGAN is that it doesn't understand broad layout of objects,
   so the approach helps a lot.  Once it was stable I could replace words
   to replace concepts in the output.
   I was using an image of a face.  Here are images I made.
   A male cyborg dryad:
   [1]https://cdn.discordapp.com/attachments/838682121975234571/8796221389
   80601856/1629788417_a_cyborg_tree_with_short_hair.jpg
   A group of zombies and robots sitting around fires in a forest:
   [2]https://cdn.discordapp.com/attachments/838682121975234571/8796286202
   49837568/1629789944_a_group_of_zombies_robots_and_dryads_sitting_around
   _a_campfire_in_the_middle_of_a_forest.jpg

References

   1. https://cdn.discordapp.com/attachments/838682121975234571/879622138980601856/1629788417_a_cyborg_tree_with_short_hair.jpg
   2. https://cdn.discordapp.com/attachments/838682121975234571/879628620249837568/1629789944_a_group_of_zombies_robots_and_dryads_sitting_around_a_campfire_in_the_middle_of_a_forest.jpg