[ot][spam][crazy][data] transformer model 'attention' improvement

Undiscussed Horrific Abuse, One Victim & Survivor of Many gmkarl at gmail.com
Mon Jan 31 18:39:37 PST 2022


Amin Rezaei commented on my work on their github, and pointed out that the
paper advises the technique is only useful on models with incredibly large
input data sizes. Not any of the ones I added it to.

Briefly thinking about that, it could be because of the size of the O(n^2)
data. For example, GPT-J has a total model size of 22GB or so, and is
trained to predict tokens well from text of up to 2k tokens long. 2k^2 is
about 4M floats, which is much smaller than the total model size.

However, when training a model, a larger algebra graph can be allocated for
each float, in order to calculate and use the gradients for
backpropagation. Running a test to see what the memory requirements really
are makes sense here, since a usable implementation is readily available
now. Or at least finding and comprehending the text in the paper where it
says the expected sizes of usefulness.

It shows how off in wide field I am. But it was also a great opportunity to
work and make something near these powerful things.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 1176 bytes
Desc: not available
URL: <https://lists.cpunks.org/pipermail/cypherpunks/attachments/20220131/62cf9c1a/attachment.txt>


More information about the cypherpunks mailing list