Re: [ot][spam][crazy][data] transformer model 'attention' improvement

26 Jan 2022

      So basically a matmul is an einsum that drops the last coordinate of
the first operand and the second to last coordinate of the second
operand. Matrix multiplications really are sequences of dot products!
Linear algebra is slowly and painfully coming back to me.

Attached is a transcription of huggingface's perceiver attention that
works with the same example data. The 'keys/queries/values' axis ends
up being the sequence axis.  They permute the matrices to exclude the
heads dimension so the dot products can be done with a normal matmul
rather than einsum and its string parsing.