These next bits starting line 14 (and i'm trying to remember there's an /sqrt(count) line i mean to return to) must be part of the strategy to iteratively calculate the precise softmax (expi(i) / sum[exp(i)]) by doing subtraction in the exponent rather than division outside it. Here;s text from Section 3 of the paper: In practice, the softmax is implemented by subtracting the maximum score from all scores. This does not change the result of the softmax, but avoids this numerical problem. Our incremental computation of the sum of exponentiated scores (and the values times the scores) does not immediately allow for the same trick, as the maximum may depend on the last score in the sequence. But the subtraction cannot be delayed either, since the scores must be exponentiated before they can be added to the cumulative sum. To resolve this problem, we introduce an additional scalar, which keeps track of the maximum score that the incremental algorithm has seen so far, and we renormalize the sums of exponentiated values as needed: We initialize the vector v and scalar s with 0, and m with -inf. As before, given a key value pair k_i, v_i, we compute s_i = dot(q, k_i), but then the algorithm differs slightly from Section 2. We first compute m_i = max(m, s_i), and update v = v * exp(m - m_i) + v_i * exp(s_i - m_i) and s = s * exp(m - m_i) + exp(s_i - m_i) and m = m_i. After processing all keys and queries, we divide v / s to get the final result. - I want to drill down into their equations more, but it would make sense to use the variable names from the code example starting line 14.