[and the phrase "self-attention" when applied to an exhaustive "cross-product" of information flow parts is reminiscent of the forgotten automation norm of having the system attend to its own effectiveness. missing in transformer models.] Lemme try that paragraph again: "The straight-forward implementation of the attention operation above requires us to first compute and remember s_i for all i, leading to a O(n) time and memory complexity for each query. Transformers use _self-attention_, which issues a separate query for each position in the sequence, so the overall time and space complexity is O(n^2)." s_i = dot(q_i,k) s_i' = softmax(s_i) attention = sum_i(v_i * s_i) so originally three ndvectors come in: q, k, v. They q and k have a dimension of length n. To find the dot of each q_i with k, an ndmatrix s is formed that has a dimension of both q and k. so it's the q,k dimension of s (attn_weights in the source) that makes it O(n^2). I think I can move that information over to the source to show myself that my addition does indeed prevent the memory savings.