Amin Rezaei commented on my work on their github, and pointed out that
   the paper advises the technique is only useful on models with
   incredibly large input data sizes. Not any of the ones I added it to.
   Briefly thinking about that, it could be because of the size of the
   O(n^2) data. For example, GPT-J has a total model size of 22GB or so,
   and is trained to predict tokens well from text of up to 2k tokens
   long. 2k^2 is about 4M floats, which is much smaller than the total
   model size.
   However, when training a model, a larger algebra graph can be allocated
   for each float, in order to calculate and use the gradients for
   backpropagation. Running a test to see what the memory requirements
   really are makes sense here, since a usable implementation is readily
   available now. Or at least finding and comprehending the text in the
   paper where it says the expected sizes of usefulness.
   It shows how off in wide field I am. But it was also a great
   opportunity to work and make something near these powerful things.