- other works: Transformer-XL (recurrent gpt sequence training), ERNIE (recursive layer to develop complex state), sentence-transformers
these links were given in reply to the below message of mine in eleutherai #research. i ended up having a couple extensive and somewhat disjoint/strange interactions that could bear pasting, but i guess it am not doing that yet. me: I spent a couple days working with RWKV. Quite exciting: - if you are looking for cuda work, it would be helpful to quickly mention the optimization techniques you're thinking of. i'm more a software dev than a ml researcher, and i think i've read of the max_k approach in a paper but don't remember it offhand. [i think i found this, i'm looking at it a little, the max value can be pulled out of the quotient] - similarly, i briefly websearched for cuda divide-by-zero approaches, and basically the solution i saw mentioned was to mutate the output from inf or nan after the fact, so it would be helpful to know in which step these invalid values are generated to weigh the impacts of available approaches - as posted above in this chat, i took some time to derive how to use the GPT form in a recurrent manner, so as to train with long context right at the start. i'm curious why you didn't do this in your existing training code. - i'm not an ml researcher, but i'm worried that the architecture of the model appears to refrain from passing the output of a layer to the input of the analogous layer in a further recurrent pass. i'd love a mention of where to learn more about this concern. it seems the lack of recursion must prevent the model from summarising its memories into complex meaning over time, or pull relevant information out of such summaries, and require a lot more parameters to handle long context lengths. it seems it is decaying its hidden states with inference complexity limited by model depth in a unidirectional manner. if i were rearchitecting it, i would try having a "recursive" part where output ends up reaching input again over time, and then in the gpt form the depth of the recursive construct would be as long as the context length, to make them equivalent. i'm thinking of shortening up the context length, or maybe the recursive part could be made very simple, or given dropout to run less. reply: - yeah we need both [pull off max_k] and [fix divide-by-zero] for it to work. - yeah you can do that and it will be a bit like transformer-XL. on the other hand, train+fine-tune to longer ctxlen is very fast and easy too - that will be like https://arxiv.org/pdf/2012.15688.pdf and yes you can certainly do that. i am using some simple training code at this moment.