[ot][spam][crazy][data] transformer model 'attention' improvement

k gmkarl at gmail.com
Wed Jan 26 05:41:52 PST 2022


the AminRezaei0x443 implementation also produces the same data, attached again.

the aminrezaei implementation does the square root, provides for
optional mask and bias tensors, is on pypi, and has both a jax and
torch implementation, so it seems the way to go.

next i'll be timing it compared to the paper's implementation that i
noted as speedy.  just on my raspberry pi, though.  i'm guessing it's
roughly the same on good hardware with large models, where the core
batches dominate everything.  sometimes i mostly engage stuff i bump
into.

maybe it would be good just to quickly run through the source and
verify that aminrezai does checkpointing and lax mapping like in the
paper.


More information about the cypherpunks mailing list