i am risking my focus in order to spend a little time learning what the difference might be between local attention and transient global attention in longt5 as a side note, it is notably that it looks like a little like people are using longt5 heavily.