Two days ago while using my phone I found a trick I could that worked with a discord machine learning bot to guide its images better. The trick no longer works as well after the settings were changed, but can be used elsewhere.
First I did image->text, then used that for text->image, and tweaked things until its output was stable. The problem with image generation with CLIP+VQGAN is that it doesn't understand broad layout of objects, so the approach helps a lot. Once it was stable I could replace words to replace concepts in the output.
I was using an image of a face. Here are images I made.
A male cyborg dryad:
A group of zombies and robots sitting around fires in a forest: