[I rebooted x but it didn't boot back up and i'm sending this from a
   phone.]
   The next step for me here is to form logical verification that the
   implementation of the feature completely counters the memory savings.
   It seems likely I can do this given the rest of these threads, but if I
   don't it makes sense to basically assume the verification is likely
   true, and move forward by closing the issues and pull request (they
   could be replaced by one noting the reason for the "feature") and
   changing transformers to disable attention output if chunking is
   engaged.
   Commented on the PR:

   I've converted this to a draft because I think this "feature" may be
   ridiculous, making the memory usage O(n^2) again by retaining weights
   across chunks