Re: [ot][spam][crazy][data] transformer model 'attention' improvement

25 Jan 2022

      The reason my barebones attention got a different answer than the
paper's chunked attention was that I hadn't included the division by
the square root of the feature count, that I had intended to return to
but had not done. When included, the outputs are the same, and the
script is attached, unsure why.

Next I'm comparing the output of huggingface's PerceiverSelfAttention
class with my script and the chunked attention. The output is
different, maybe due to an additional post processing step? It also
includes the square root denominator.