baffo32 2022-10-06 00:09 UTC There’s been a paper quietly kicking around that speeds up model training by up to 370x, flattens architectures to a single layer, drops memory requirements by 10x, and effectively has long context: . It’s for NeurIPS 2022, upcoming. I’ve been kicking it around in my mind a smidge, and I’m thinking it would just be so useful and likely appreciated for anybody at all to make any implementation at all of this paper, that it would be worth trying to do. The concept of the paper is not complex, but it’s quite hard for me to approach, as usual.