i made two spamjournal threads regarding 'data science''y things: automated reverse engineering and perceiver model notes in the automated reverse engineering one i linked a paper: https://arxiv.org/abs/2112.05682 . this paper clearly describes an optimization to these models that should be obvious: an algebraic transformation of the 'attention' mechanism that requires understanding of the implementation of some of the operators to use, and drops the memory requirements by orders of magnitude. implementing this might help people with fewer resources than governments and large corporations, train models like the automated reverse engineering one. i _think_ that basically it means you can process a bigger batchsize or a bigger model or longer input/output text, on a smaller gpu or tpu (or cpu ram). i'd like to try to implement it in both perceiver and t5.