things i've found without education:
   Training a Model to Make Choices
   - you can backpropagate loss around a decision, by -> weighting
   different outcomes with the likelihood of choosing them, and summing
   them <-
   then the loss can propagate to the impact of the weight on the final
   sum
   you can even do it in random minibatches with small samples from the
   outcome space.
   guessing that rl ppo does something analogous
   this briefly worked for me a little to automatically tune prompts
   might need some further review or rephrasing (and/or education) to
   refine and reduce inhibiton around