[ot][spam][crazy][data] transformer model 'attention' improvement

Undiscussed Horrific Abuse, One Victim & Survivor of Many gmkarl at gmail.com
Tue Feb 1 13:18:41 PST 2022


re the masks and biases, basically the chunking code assumes they are
dense matrices, but by changing the chunking code you can pass only
the data needed. i'm presently doing that. it may end up that the
optimization is not reasonable on models that store a dense mask or
bias as an on-disk weight.


More information about the cypherpunks mailing list