[I rebooted x but it didn't boot back up and i'm sending this from a phone.]

The next step for me here is to form logical verification that the implementation of the feature completely counters the memory savings. It seems likely I can do this given the rest of these threads, but if I don't it makes sense to basically assume the verification is likely true, and move forward by closing the issues and pull request (they could be replaced by one noting the reason for the "feature") and changing transformers to disable attention output if chunking is engaged.

Commented on the PR:

I've converted this to a draft because I think this "feature" may be ridiculous, making the memory usage O(n^2) again by retaining weights across chunks