So the generalisation improvements for this got into that repository, but I never moved forward on my pull request to huggingface. I wasn't sure how to make the work reusable in the face of the preference against abstraction or generalisation of model components. Still, copy-pasting is a real and useful way to reuse things, and people need things to copy-paste. Some time ago I stumbled on _another_ approach to making attention memory efficient without changing its underlying structure, top-k attention: https://github.com/ag1988/top_k_attention . Basically it only performs k of the most impactful multiplies. It sounds like this approach could be easily mutated to use a dynamic 'k' that preserves precision. That might be near the ideas of pruning or distillation, too. I also stumbled on just one library that collects together efficient attention improvements: https://github.com/idiap/fast-transformers . Last updated mid 2021, tests currently not passing. pytorch-only. Already has a top-k attention implementation: https://fast-transformers.github.io/api_docs/fast_transformers/attention/exa... . I wonder if there are other such libraries somewhere. That fast-transformers repo might be a better place for unifying attention improvements. It doesn't load pretrained models, but could be improved to. Still might seem most relevant to clean up the perceiver and gpt-j attention improvements and either make a fork or submit a PR that uses them. I kind of wanted to try running gpt-j on long text, and see how the actual memory usage went. This might mean shelling into a remote server or using colab to simplify the large memory allocation involved, and waiting some time for the large model to download.