[ot][spam][crazy][data] transformer model 'attention' improvement
k
gmkarl at gmail.com
Tue Jan 25 02:03:45 PST 2022
I'll review the definition of self-attention from the paper.
s_i = dot(q, k_i) # take the dot product of the query with every key
s'_i = exp(s_i) / sum(exp(s_j), j) # take the softmax of the result
attn = sum(v_i * s'_i, i) # attention is the dot of that softmax with
the value vectors
This is done in parallel. Why does it take O(n^2) memory?
- the paper says the calculation of s_i is O(n) for a single query,
but for _self attention_, as used in transformer models, there is a
different query for each sequence position: hence n parallel
calculations of attention.
So, the bulk of the simple paper is to convert from O(n) for a single
query, to O(1) for a single query, by simply unparallelising the
calculation. Okay, I was missing that, might review this email
content again to try to comprehend what's up.
More information about the cypherpunks
mailing list