[ot][spam][crazy][data] transformer model 'attention' improvement

Wed Feb 2 01:40:41 PST 2022

Good morning, spamthread.

I commented on the PR. I believe the DRY concern relates to
researchers being able to quickly review implementations without
having to switch files, unsure.

Here's the PR log:

2 days ago, xloem:
# What does this PR do?

This begins the implementation of a central `attention()` function in
modeling_utils.py that calls out to
https://github.com/AminRezaei0x443/memory-efficient-attention if
configuration parameters are set, to allocate configurably down to
O(n) memory rather than O(n^2) memory at the expense of parallel
execution.

I'm afraid the new memory-efficient-attention still needs to be added
as a dependency and some development cruft removed from the source.

I believe it is important to reuse existing projects, so that people's
work can be more effective and valued, but I also believe
memory-efficient-attention code is MIT licensed if copying it is
preferred.

The GPTJ and Perceiver models are altered to call out to the new
attention function.

Working on this has been very hard for me, so I am contributing what I
have now. If others have better work on this, feel free to accept them
before mine.

- [ ] I have commented out the rest of the PR form here, to return to
as I find capacity.

--
2 days ago, LysandreJik:

> Hey @xloem, thanks a lot for your hard work on this! It is cool to support the attention mechanism as visible in https://github.com/AminRezaei0x443/memory-efficient-attention. However, the `transformers` library does not really work with central components to be shared among many models, so we do not design layers in the `modeling_utils.py` file.
>
> This comes from the following two pieces of the "Why should I/shouldn't I use Transformers" from the README:
>
> > 4. Easily customize a model or an example to your needs:
> >
> >
> > * We provide examples for each architecture to reproduce the results published by its original authors.
> > * **Model internals are exposed as consistently as possible.**
> > * **Model files can be used independently of the library for quick experiments.**
>
> and
>
> > This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose so that researchers can quickly iterate on each of the models without diving into additional abstractions/files.
>
> You'll see that we have other implementations of efficient attention mechanisms spread across the codebase, and each of them are linked to a single model. Recent examples of this are YOSO and Nyströmformer, which were released in the v4.16 released last week.
>
> cc @sgugger @patrickvonplaten for knowledge

--
20 hours ago, patrickvonplaten: (this is a name i've seen before and
recognise as a community member, didn't know they worked for
huggingface)
> @xloem, would you be interested in adding this new attention mechanism simply as a new model - linked to its official paper: https://arxiv.org/abs/2112.05682v2 (maybe called something like `O1Transformer` ?)

--
13 hours ago, xloem (me):
Thanks so much for your replies and advice.

I hadn't noticed the readme line. Do you know of any similar libraries
to transformers, that do indeed collect building blocks? The
transformers library is far smaller than the torch and flax libraries
that models already depend on, but I imagine you know the experience
of research work better than I do to make that call.

I've been finding a little more energy to work on this. The owner of
the dependency repo found there's still an outstanding issue that
masks and biases are built O(n^2), so more work is needed.

--
7 minutes ago, xloem (me):
Apologies for submitting this PR in such a poor state.

Unlike YOSO and Nystromformer, this implementation is exact. The
output is theoretically idempotent with that generated by use of the
softmax function.

Although the research could be implemented as a standalone module, I'm
really hoping to actually help users on low-memory systems use the
code to fine tune current pretrained models. @patrickvonplaten how
does that hope settle with you?

After rereading the concerns, my plan is to move the code into the
model files to provide for the statements in the readme. @LysandreJik,
does this sound reasonable?

I also try to stick to just one sourcefile when developing and
learning. My usual approach is to open a python interpreter and do
`print(inspect.getsource(modulefunc))` to quickly see abstraction
implementations, which is indeed quite unideal. Abstraction can
_really_ empower a domain: research and development accelerates when
abstraction is made a norm.