Good morning, spamthread. I commented on the PR. I believe the DRY concern relates to researchers being able to quickly review implementations without having to switch files, unsure. Here's the PR log: 2 days ago, xloem: # What does this PR do? This begins the implementation of a central `attention()` function in modeling_utils.py that calls out to https://github.com/AminRezaei0x443/memory-efficient-attention if configuration parameters are set, to allocate configurably down to O(n) memory rather than O(n^2) memory at the expense of parallel execution. I'm afraid the new memory-efficient-attention still needs to be added as a dependency and some development cruft removed from the source. I believe it is important to reuse existing projects, so that people's work can be more effective and valued, but I also believe memory-efficient-attention code is MIT licensed if copying it is preferred. The GPTJ and Perceiver models are altered to call out to the new attention function. Working on this has been very hard for me, so I am contributing what I have now. If others have better work on this, feel free to accept them before mine. - [ ] I have commented out the rest of the PR form here, to return to as I find capacity. -- 2 days ago, LysandreJik:
Hey @xloem, thanks a lot for your hard work on this! It is cool to support the attention mechanism as visible in https://github.com/AminRezaei0x443/memory-efficient-attention. However, the `transformers` library does not really work with central components to be shared among many models, so we do not design layers in the `modeling_utils.py` file.
This comes from the following two pieces of the "Why should I/shouldn't I use Transformers" from the README:
4. Easily customize a model or an example to your needs:
* We provide examples for each architecture to reproduce the results published by its original authors. * **Model internals are exposed as consistently as possible.** * **Model files can be used independently of the library for quick experiments.**
and
This library is not a modular toolbox of building blocks for neural nets. The code in the model files is not refactored with additional abstractions on purpose so that researchers can quickly iterate on each of the models without diving into additional abstractions/files.
You'll see that we have other implementations of efficient attention mechanisms spread across the codebase, and each of them are linked to a single model. Recent examples of this are YOSO and Nyströmformer, which were released in the v4.16 released last week.
cc @sgugger @patrickvonplaten for knowledge
-- 20 hours ago, patrickvonplaten: (this is a name i've seen before and recognise as a community member, didn't know they worked for huggingface)
@xloem, would you be interested in adding this new attention mechanism simply as a new model - linked to its official paper: https://arxiv.org/abs/2112.05682v2 (maybe called something like `O1Transformer` ?)
-- 13 hours ago, xloem (me): Thanks so much for your replies and advice. I hadn't noticed the readme line. Do you know of any similar libraries to transformers, that do indeed collect building blocks? The transformers library is far smaller than the torch and flax libraries that models already depend on, but I imagine you know the experience of research work better than I do to make that call. I've been finding a little more energy to work on this. The owner of the dependency repo found there's still an outstanding issue that masks and biases are built O(n^2), so more work is needed. -- 7 minutes ago, xloem (me): Apologies for submitting this PR in such a poor state. Unlike YOSO and Nystromformer, this implementation is exact. The output is theoretically idempotent with that generated by use of the softmax function. Although the research could be implemented as a standalone module, I'm really hoping to actually help users on low-memory systems use the code to fine tune current pretrained models. @patrickvonplaten how does that hope settle with you? After rereading the concerns, my plan is to move the code into the model files to provide for the statements in the readme. @LysandreJik, does this sound reasonable? I also try to stick to just one sourcefile when developing and learning. My usual approach is to open a python interpreter and do `print(inspect.getsource(modulefunc))` to quickly see abstraction implementations, which is indeed quite unideal. Abstraction can _really_ empower a domain: research and development accelerates when abstraction is made a norm.