[ot][spam][crazy][data] transformer model 'attention' improvement

Tue Jan 25 02:03:45 PST 2022

I'll review the definition of self-attention from the paper.

s_i = dot(q, k_i) # take the dot product of the query with every key
s'_i = exp(s_i) / sum(exp(s_j), j) # take the softmax of the result
attn = sum(v_i * s'_i, i) # attention is the dot of that softmax with
the value vectors

This is done in parallel.  Why does it take O(n^2) memory?

- the paper says the calculation of s_i is O(n) for a single query,
but for _self attention_, as used in transformer models, there is a
different query for each sequence position: hence n parallel
calculations of attention.

So, the bulk of the simple paper is to convert from O(n) for a single
query, to O(1) for a single query, by simply unparallelising the
calculation.  Okay, I was missing that, might review this email
content again to try to comprehend what's up.