[ot][spam][crazy][data] transformer model 'attention' improvement
Undiscussed Horrific Abuse, Victim & Survivor of
gmkarl at gmail.com
Sun Jan 30 10:20:30 PST 2022
I'm pretty sure the "n" in time * memory = O(n^2) relates to the
key/query count, which are the same in self-attention. The batch
dimension is used by the user to set memory bounds.
More information about the cypherpunks
mailing list