26 Jan
2022
26 Jan
'22
10:41 p.m.
the AminRezaei0x443 implementation also produces the same data, attached again. the aminrezaei implementation does the square root, provides for optional mask and bias tensors, is on pypi, and has both a jax and torch implementation, so it seems the way to go. next i'll be timing it compared to the paper's implementation that i noted as speedy. just on my raspberry pi, though. i'm guessing it's roughly the same on good hardware with large models, where the core batches dominate everything. sometimes i mostly engage stuff i bump into. maybe it would be good just to quickly run through the source and verify that aminrezai does checkpointing and lax mapping like in the paper.