25 Jan
2022
25 Jan
'22
11:53 p.m.
The reason my barebones attention got a different answer than the paper's chunked attention was that I hadn't included the division by the square root of the feature count, that I had intended to return to but had not done. When included, the outputs are the same, and the script is attached, unsure why. Next I'm comparing the output of huggingface's PerceiverSelfAttention class with my script and the chunked attention. The output is different, maybe due to an additional post processing step? It also includes the square root denominator.