[ot][spam][crazy][data] transformer model 'attention' improvement

22 Jan 2022

      i made two spamjournal threads regarding 'data science''y things:
automated reverse engineering and perceiver model notes

in the automated reverse engineering one i linked a paper:
https://arxiv.org/abs/2112.05682 .  this paper clearly describes an
optimization to these models that should be obvious: an algebraic
transformation of the 'attention' mechanism that requires
understanding of the implementation of some of the operators to use,
and drops the memory requirements by orders of magnitude.
implementing this might help people with fewer resources than
governments and large corporations, train models like the automated
reverse engineering one.

i _think_ that basically it means you can process a bigger batchsize
or a bigger model or longer input/output text, on a smaller gpu or tpu
(or cpu ram).

i'd like to try to implement it in both perceiver and t5.