edited issue text, i think the flag is called 'output_attentions' or something, not 'return_weights': feat: output_attentions #1 I'm looking into hacking some of the models in the transformers library to use this library for attention, and I don't see a way to support `output_attentions` yet. This is a flag passed in transformers, where the pre-softmax attention weights are preserved and returned to the user, if it is set. I looked a little at implementing this in the torch backend, and I note the scan() function provides for only a single tensor return value. It seems to me that scan() function would be most clearly replaced by a for loop, but it could also be modified to handle tuples, or return_weights could be handled via accessing nonlocal data in some way instead of returning them through the chunk scanner. I'm also not sure how the output would best be passed to the user. I'm thinking it might make the most sense to provide for an optional output parameter, although I don't really know.