[ot][spam][crazy][data] transformer model 'attention' improvement

Sun Jan 30 06:06:59 PST 2022

"The straight-forward implementation of the attention operation above
requires us to first compute and remember s_i for all i, leading to a
O(n) time and memory complexity for each query. Transformers use
_self-attention_, which issues a separate query for each position in
the sequence, so the overall time and space complexity is O(n^2)."

s_i is the dot product between the keys and a single query.  So "s" in
the paper would be "attn_weights" in the source. I think.

It's looking likely that my contribution was in error, but it's
seeming very hard for me to cognitively verify this.