Two days ago while using my phone I found a trick I could that worked with a discord machine learning bot to guide its images better.  The trick no longer works as well after the settings were changed, but can be used elsewhere.

First I did image->text, then used that for text->image, and tweaked things until its output was stable.  The problem with image generation with CLIP+VQGAN is that it doesn't understand broad layout of objects, so the approach helps a lot.  Once it was stable I could replace words to replace concepts in the output.

I was using an image of a face.  Here are images I made.

A male cyborg dryad:

A group of zombies and robots sitting around fires in a forest: