[ot][spam][crazy][data] transformer model 'attention' improvement
Undiscussed Horrific Abuse, One Victim & Survivor of Many
gmkarl at gmail.com
Tue Feb 1 13:18:41 PST 2022
re the masks and biases, basically the chunking code assumes they are
dense matrices, but by changing the chunking code you can pass only
the data needed. i'm presently doing that. it may end up that the
optimization is not reasonable on models that store a dense mask or
bias as an on-disk weight.
More information about the cypherpunks
mailing list