22 Jul
2022
22 Jul
'22
7:36 a.m.
long story short, from https://arxiv.org/pdf/2112.07916.pdf , transient global attention is a new approach to attention invented for longt5, which appears to reliably outperform local attention in the same architecture the appearance of that, combined with finetunings i saw on the huggingface hub using tglobal attention, seems enough reason to try to use the tglobal models without fully understanding things in the moment