Amin Rezaei commented on my work on their github, and pointed out that the paper advises the technique is only useful on models with incredibly large input data sizes. Not any of the ones I added it to.

Briefly thinking about that, it could be because of the size of the O(n^2) data. For example, GPT-J has a total model size of 22GB or so, and is trained to predict tokens well from text of up to 2k tokens long. 2k^2 is about 4M floats, which is much smaller than the total model size.

However, when training a model, a larger algebra graph can be allocated for each float, in order to calculate and use the gradients for backpropagation. Running a test to see what the memory requirements really are makes sense here, since a usable implementation is readily available now. Or at least finding and comprehending the text in the paper where it says the expected sizes of usefulness.

It shows how off in wide field I am. But it was also a great opportunity to work and make something near these powerful things.