[ot][spam][crazy][data] transformer model 'attention' improvement
Undiscussed Horrific Abuse, Victim & Survivor of
gmkarl at gmail.com
Sun Jan 30 06:06:59 PST 2022
"The straight-forward implementation of the attention operation above
requires us to first compute and remember s_i for all i, leading to a
O(n) time and memory complexity for each query. Transformers use
_self-attention_, which issues a separate query for each position in
the sequence, so the overall time and space complexity is O(n^2)."
s_i is the dot product between the keys and a single query. So "s" in
the paper would be "attn_weights" in the source. I think.
It's looking likely that my contribution was in error, but it's
seeming very hard for me to cognitively verify this.
More information about the cypherpunks
mailing list