To represent normal goal behavior with maximization, the return
   function needs to not only be incredibly complex, but also feed back to
   its own evaluation, in a way not provided for in these libraries.

   Daydreaming: I'm thinking of how in reality and normality, we have many
   many goals going at once (most of them "common sense" and/or "staying
   being a living human").  Similarly, I'm thinking of how with normal
   transformer models, one trains according to a loss rather than a
   reward.
   I'm considering what if it were more interesting when an agent _fails_
   to meet a goal. Its reward would usually be full, 1.0, but would
   multiply by losses when goals are not met.
   This seems much nicer to me.