[ot][spam][crazy][data] transformer model 'attention' improvement

Sun Jan 30 10:40:01 PST 2022

[and the phrase "self-attention" when applied to an exhaustive
"cross-product" of information flow parts is reminiscent of the
forgotten automation norm of having the system attend to its own
effectiveness. missing in transformer models.]

Lemme try that paragraph again:

"The straight-forward implementation of the attention operation above
requires us to first compute and remember s_i for all i, leading to a
O(n) time and memory complexity for each query. Transformers use
_self-attention_, which issues a separate query for each position in
the sequence, so the overall time and space complexity is O(n^2)."

s_i = dot(q_i,k)
s_i' = softmax(s_i)
attention = sum_i(v_i * s_i)

so originally three ndvectors come in: q, k, v. They q and k have a
dimension of length n.
To find the dot of each q_i with k, an ndmatrix s is formed that has a
dimension of both q and k.

so it's the q,k dimension of s (attn_weights in the source) that makes
it O(n^2).

I think I can move that information over to the source to show myself
that my addition does indeed prevent the memory savings.