[ot][spam][crazy][data] transformer model 'attention' improvement

Tue Jan 25 15:53:18 PST 2022

The reason my barebones attention got a different answer than the
paper's chunked attention was that I hadn't included the division by
the square root of the feature count, that I had intended to return to
but had not done. When included, the outputs are the same, and the
script is attached, unsure why.

Next I'm comparing the output of huggingface's PerceiverSelfAttention
class with my script and the chunked attention. The output is
different, maybe due to an additional post processing step? It also
includes the square root denominator.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unchunked.py
Type: text/x-python
Size: 1359 bytes
Desc: not available
URL: <https://lists.cpunks.org/pipermail/cypherpunks/attachments/20220125/f652a475/attachment-0001.py>