[ot][spam][crazy][data] transformer model 'attention' improvement

Undiscussed Horrific Abuse, Victim & Survivor of gmkarl at gmail.com
Sun Jan 30 10:20:30 PST 2022


I'm pretty sure the "n" in time * memory = O(n^2) relates to the
key/query count, which are the same in self-attention. The batch
dimension is used by the user to set memory bounds.


More information about the cypherpunks mailing list