[ot][spam][crazy][data] transformer model 'attention' improvement

Tue Jan 25 03:48:38 PST 2022

These next bits starting line 14 (and i'm trying to remember there's
an /sqrt(count) line i mean to return to) must be part of the strategy
to iteratively calculate the precise softmax (expi(i) / sum[exp(i)])
by doing subtraction in the exponent rather than division outside it.

Here;s text from Section 3 of the paper:

In practice, the softmax is implemented by subtracting the maximum
score from all scores. This does not change the result of the softmax,
but avoids this numerical problem.

Our incremental computation of the sum of exponentiated scores (and
the values times the scores) does not immediately allow for the same
trick, as the maximum may depend on the last score in the sequence.
But the subtraction cannot be delayed either, since the scores must be
exponentiated before they can be added to the cumulative sum.

To resolve this problem, we introduce an additional scalar, which
keeps track of the maximum score that the incremental algorithm has
seen so far, and we renormalize the sums of exponentiated values as
needed: We initialize the vector v and scalar s with 0, and m with
-inf. As before, given a key value pair k_i, v_i, we compute s_i =
dot(q, k_i), but then the algorithm differs slightly from Section 2.
We first compute m_i = max(m, s_i), and update v = v * exp(m - m_i) +
v_i * exp(s_i - m_i) and s = s * exp(m - m_i) + exp(s_i - m_i) and m =
m_i.  After processing all keys and queries, we divide v / s to get
the final result.

-

I want to drill down into their equations more, but it would make
sense to use the variable names from the code example starting line
14.