[ot][spam][crazy][data] transformer model 'attention' improvement

Sun Jan 30 15:21:20 PST 2022

doing some work on getting the current state of transformers code i
have working again with my test perceiver model that converts numbers.

these are my notes on the correct dimension shapes, for the code that
ostensibly worked. i plan to compare these with the broken commit
below them.

correct context layer shape is 1,256,8,20
                attention_probs.shape = 1,8,256,96
                values.shape = 1,8,96,20
                queries.shape = 1,8,256,32
                keys.shape = 1,8,96,32

commit ca60cd579c82191d4e6696534af32e96b850015e (HEAD ->
memory-efficient-attention, xloem/memory-efficient-attention)
Author: xloem <0xloem at gmail.com>
Date:   Sun Jan 30 23:17:41 2022 +0000

    commented out old perceiver code and drafted a call of the new
attentions function that does both chunked and nonchunked. currently
crashes due to dimension error.