[ot][spam][journal] last few days of ML work for me

Undiscussed Horrific Abuse, One Victim of Many gmkarl at gmail.com
Tue Jun 21 05:51:51 PDT 2022


> - other works: Transformer-XL (recurrent gpt sequence training), ERNIE
> (recursive layer to develop complex state), sentence-transformers

these links were given in reply to the below message of mine in
eleutherai #research. i ended up having a couple extensive and
somewhat disjoint/strange interactions that could bear pasting, but i
guess it am not doing that yet.

me:

I spent a couple days working with RWKV. Quite exciting:
- if you are looking for cuda work, it would be helpful to quickly
mention the optimization techniques you're thinking of. i'm more a
software dev than a ml researcher, and i think i've read of the max_k
approach in a paper but don't remember it offhand. [i think i found
this, i'm looking at it a little, the max value can be pulled out of
the quotient]
- similarly, i briefly websearched for cuda divide-by-zero approaches,
and basically the solution i saw mentioned was to mutate the output
from inf or nan after the fact, so it would be helpful to know in
which step these invalid values are generated to weigh the impacts of
available approaches
- as posted above in this chat, i took some time to derive how to use
the GPT form in a recurrent manner, so as to train with long context
right at the start. i'm curious why you didn't do this in your
existing training code.
- i'm not an ml researcher, but i'm worried that the architecture of
the model appears to refrain from passing the output of a layer to the
input of the analogous layer in a further recurrent pass. i'd love a
mention of where to learn more about this concern. it seems the lack
of recursion must prevent the model from summarising its memories into
complex meaning over time, or pull relevant information out of such
summaries, and require a lot more parameters to handle long context
lengths. it seems it is decaying its hidden states with inference
complexity limited by model depth in a unidirectional manner. if i
were rearchitecting it, i would try having a "recursive" part where
output ends up reaching input again over time, and then in the gpt
form the depth of the recursive construct would be as long as the
context length, to make them equivalent. i'm thinking of shortening up
the context length, or maybe the recursive part could be made very
simple, or given dropout to run less.

reply:

- yeah we need both [pull off max_k] and [fix divide-by-zero] for it to work.
- yeah you can do that and it will be a bit like transformer-XL.  on
the other hand, train+fine-tune to longer ctxlen is very fast and easy
too
- that will be like https://arxiv.org/pdf/2012.15688.pdf and yes you
can certainly do that. i am using some simple training code at this
moment.


More information about the cypherpunks mailing list