[ot][spam][crazy][data] transformer model 'attention' improvement

Tue Jan 25 16:04:43 PST 2022

The first issue I have working with PerceiverSelfAttention is sorting
out the huggingface permutations of the query, key, value matrices.
The dot products aren't making the same weights, indicating I'm not
providing the data in the right shape.  They reorganise the matrices
to handle multiple channels, and split into heads a certain way.  I
have trouble intuiting the relation between torch.matmul and einsum,
regarding a matrix of dot products of feature vectors.