[ot][spam][crazy][data] transformer model 'attention' improvement

Thu Feb 10 03:08:09 PST 2022

So the generalisation improvements for this got into that repository,
but I never moved forward on my pull request to huggingface. I wasn't
sure how to make the work reusable in the face of the preference
against abstraction or generalisation of model components. Still,
copy-pasting is a real and useful way to reuse things, and people need
things to copy-paste.

Some time ago I stumbled on _another_ approach to making attention
memory efficient without changing its underlying structure, top-k
attention: https://github.com/ag1988/top_k_attention . Basically it
only performs k of the most impactful multiplies. It sounds like this
approach could be easily mutated to use a dynamic 'k' that preserves
precision. That might be near the ideas of pruning or distillation,
too.

I also stumbled on just one library that collects together efficient
attention improvements: https://github.com/idiap/fast-transformers .
Last updated mid 2021, tests currently not passing. pytorch-only.
Already has a top-k attention implementation:
https://fast-transformers.github.io/api_docs/fast_transformers/attention/exact_topk_attention.html
. I wonder if there are other such libraries somewhere.

That fast-transformers repo might be a better place for unifying
attention improvements. It doesn't load pretrained models, but could
be improved to.

Still might seem most relevant to clean up the perceiver and gpt-j
attention improvements and either make a fork or submit a PR that uses
them. I kind of wanted to try running gpt-j on long text, and see how
the actual memory usage went. This might mean shelling into a remote
server or using colab to simplify the large memory allocation
involved, and waiting some time for the large model to download.