i've glanced through this paper before but don't really remember it well. i've got it open again, here are notes: - 'attention' is a transformation between a query q, vectors of key and value k_n and v_n, to a linear transformation of the value vector - the linear transformation is roughly ('s' being an intermediate value, i could have made an error writing this): s_i = dot(q, k_i) s_i = e^s_i / sum(s) output = sum(v_i * s_i) if that's right, i think it could be written this way: s_i = e^dot(q, k_i) output = sum(dot(v, s) / sum(s)) i think the revelation of the paper is that the inner sum can be moved outside the outer sum. not sure. it adds more than just that, though. - this is their first example demonstration: s_i = dot(q, k_i) s_i = e^s_i output = dot(v_i, s_i) / sum(s) - it looks like their solution is to compute the output sequentially. since nothing depends on calculating the sum when it is moved outside, k_i and v_i can each be discarded as they are processed. - the paper states that 'plain attention' has a constant query, whereas 'self attention' has a query for each element of v and k. - Warning! the e^s_i / sum(e^s_i) operation is usually optimized via subtraction of exponents, to not propagate errors - the paper handles this by performing running normalisation on e^s_i I'm just going to type that explanation in: We initialize the vector v and scalar s with 0, and m with -inf. As before, given key value pair k_i, v_i, we compute s_i = dot(q, k_i), but then the algorithm differs slightly from Section 2. We first compute m_i = max(m, s_i) and update v = v * e^(m - m_i) + v_i * e^(s_i - m_i) and s = s * e^(m - m_i) + e^(s_i - m_i) and m = m_i. After processing all keys and queries, we divide v / s to get final results. All the funny math notation is kind of strange to me, but i infer they basically found a way to refrain from dividing after exponentiation by shuffling appropriate subtractions prior to them. There's a code implementation too on github that is likely clearer, and some code on the next page or two of the paper. - Since the approach involves sequential accumulation, it doesn't undergo gpu optimization. The paper proposes chunking for gpu/tpu devices (page 2). They describe their algorithm in terms of their example code (page 3)