30 Jan
30 Jan
1:54 p.m.
I've realised the return_attentions addition I attempted to make to memory-efficient-transformers may have actually completely countered the memory savings of the research paper, by allocating a matrix sized by queries x keys for the entirety of the execution. If true, then my pull request could be confusing and harmful to the developer. I should re-review the paper to understand how much memory is saved, and whether or not my feature is appropriate in the algorithm. If not, it would simply be disabled in the transformers library if chunking is used.