To represent normal goal behavior with maximization, the return function needs to not only be incredibly complex, but also feed back to its own evaluation, in a way not provided for in these libraries. Daydreaming: I'm thinking of how in reality and normality, we have many many goals going at once (most of them "common sense" and/or "staying being a living human"). Similarly, I'm thinking of how with normal transformer models, one trains according to a loss rather than a reward. I'm considering what if it were more interesting when an agent _fails_ to meet a goal. Its reward would usually be full, 1.0, but would multiply by losses when goals are not met. This seems much nicer to me.