I'll review the definition of self-attention from the paper. s_i = dot(q, k_i) # take the dot product of the query with every key s'_i = exp(s_i) / sum(exp(s_j), j) # take the softmax of the result attn = sum(v_i * s'_i, i) # attention is the dot of that softmax with the value vectors This is done in parallel. Why does it take O(n^2) memory? - the paper says the calculation of s_i is O(n) for a single query, but for _self attention_, as used in transformer models, there is a different query for each sequence position: hence n parallel calculations of attention. So, the bulk of the simple paper is to convert from O(n) for a single query, to O(1) for a single query, by simply unparallelising the calculation. Okay, I was missing that, might review this email content again to try to comprehend what's up.