30 Jan
2022
30 Jan
'22
2:06 p.m.
"The straight-forward implementation of the attention operation above requires us to first compute and remember s_i for all i, leading to a O(n) time and memory complexity for each query. Transformers use _self-attention_, which issues a separate query for each position in the sequence, so the overall time and space complexity is O(n^2)." s_i is the dot product between the keys and a single query. So "s" in the paper would be "attn_weights" in the source. I think. It's looking likely that my contribution was in error, but it's seeming very hard for me to cognitively verify this.