[ot][spam][journal] last few days of ML work for me
this is a snapshot of a channel I made in a discord called 'Land of AI'. I'm an honorary moderator in the discord, but never really come in. I used to spam the chatbots to help myself move through my issues. I made this channel in my subsection so as to work on a machine learning model I saw in online feeds. I've matured a little in my spamming. Rather than sending flurries of stuff, I edited my posts throughout hours, shortening and consolidating the information. Maybe that won't stay my behavior, who knows. But others sometimes express value in what I say, so I tried to make it more organised. This is what it looks like now: baffo32 — 06/18/2022 an independent researcher from china shared a model in #research in eleutherai that downscales and has unlimited context. it's an rnn that reuses state like gptb. they made a demo that infers clientside in a web browser on a phone. i'm thinking of seeing what would be needed to make it into a text adventure engine. it could retain consistent room and map state and save/load games, across unlimited game size, if finetuned to. it of course like all these models would be useful for many other wonderful and scary things, and likely will be, as others are. it's also small enough i can do more on my low-end systems. the system i was building, i broke the motherboard before i got it to work. the old version of the model is at https://github.com/BlinkDL/RWKV-LM and the newer version is at https://github.com/BlinkDL/RWKV-v2-RNN-Pile . the main repo is more organised for reusing but the internal layout is a little different, and appears to be missing a number of optimizations, not certain, and sugar for things like tokenization. the pile-trained model has two modes, an RNN mode and a GPT-like mode which I believe processes context faster because it doesn't have to predict the next state, not sure. They developed with the RNN mode, so it is more put-together, but it's slower to read a prompt. You can cache prompts by storing the state generated after. This is the run step of the RNN model from the second repo (src.model.RWKV_RNN.run). It produces the next state from the preceding one. I'm adding comments to describe each step. # ctx is provided as a list of tokens def run(self, ctx): # self.w is just an object to hold all the weights w = self.w # ignore all tokens but the last, and convert it to embeddings x = w.emb.weight[ctx[-1]] # pass the values through every layer # every step updates the state, held as self.xx, self.aa, and self.bb for i in range(n_layer): x = self.LN(x, w.blocks[i].ln1) x = x + self.SA(x, w.blocks[i].att, f'att.{i}') x = self.LN(x, w.blocks[i].ln2) x = x + self.FF(x, w.blocks[i].ffn, f'ffn.{i}') # i'm guessing these final weights are used only for producing vocab logits # this doesn't appear to update the state, the conversion to words x = self.LN(x, w.ln_out) x = w.head.weight @ x # looking at the example, it appears these are log probs that can be softmaxed or # argmaxed to pick the most likely next token # convert from tensor to python list before returning x = x.tolist() return x baffo32 — 06/18/2022 he states that the GPT model can be used to quickly produce the states, but given they are held in a different manner it isn't immediately clear to me how to do that. looking at the training code, it appears the GPT model is used for training (providing for more batched processing), and that explains its increased complexity. baffo32 — 06/18/2022 draft rnn demo script git clone https://github.com/BlinkDL/RWKV-v2-RNN-Pile cd RWKV-v2-RNN-Pile wget https://github.com/BlinkDL/RWKV-v2-RNN-Pile/releases/download/20220615-10803... unzip 20220615-10803.zip # all-10803.pth python3 test.py #!/usr/bin/env python3 import sys from src.model import RWKV_RNN model = RWKV_RNN(MODEL_NAME='all-10803') # loads contents of all-10803.pth tokenizer = model.tokenizer def process(token_id): # output token token = tokenizer.decode(token_id) sys.stdout.write(token) sys.stdout.flush() # pass through the model logits_list = model.run([token_id]) # use max(range,key) as an argmax for python list to do greedy sampling token_id = max(range(tokenizer.vocab_size), key=lambda token_id: logits_list[token_id]) return token_id # read context (prompt) print('Enter the context, terminate with EOF (^D):') token_id = 0 for next_token_id in tokenizer.encode(sys.stdin.read()): process(token_id) token_id = next_token_id # generate while True: token_id = process(token_id) example use, or hit ctrl-D to terminate context input echo -n 'Once upon a time,' | python3 test.py baffo32 — 06/18/2022 So, something you could do with this, is have it e.g. process an entire encyclopedia, and then save the state (.xx, .aa, .bb) for reuse later. It would also need to be finetuned to care about information that far apart: i dunno, but i'm guessing it would work to finetune it while it reads it if it's long enough. the model would also need to be large enough to comprehend the kind and variety of information you want. i tried this locally on a file and it tends to just repeat its last completion still, with basic pattern changes. i don't really know what i'm doing. i think the researcher says in their repo that you need to finetune it for longer contexts. anyway, next i imagine is gpt, maybe training [:party:] i made a model run <discord owner> — 06/18/2022 @baffo32 thank you that is awesome thanks for sharing your findings [ed: i'm kind of awkward here, these are the first things said in months] baffo32 — 06/18/2022 notes on finding the states in the faster GPT form: - states are stored like torch parameters, indexed by name associated with "att" (TimeMix) and "ffn" (ChannelMix). - xx is the input logits passed to each named layer - a and b in the RNN form are wkv and wk in the GPT form. aa and bb, the token-to-token-state, roughly coincide with A[t] and B[t] in the math docs, but the RNN code is a better guide; it is consistent with the GPT code. they're the time-weighted coefficients without the first_time coefficient which is only for output. - the aa and bb state outputs are only used internally in TimeMix's conv1d, and must be separately calculated if output, and summed into the conv1d if input. when this is done, the GPT model can be used as output or input with the recurrent model, and can train rapidly for infinite contexts in a recurrent manner (might need time decay tweaking to really be infinite, dunno). - gpt->rnn, in TimeMix.forward: aa = (torch.exp(torch.exp(self.time_decay) * self.time_curve)[None,:,-(T):-1] * kv[:,:,:-1]).sum(dim=-1) + kv[:,:,-1] bb = (torch.exp(torch.exp(self.time_decay) * self.time_curve)[None,:,-(T):-1] * k[:,:,:-1]).sum(dim=-1) + k[:,:,-1] - rnn->gpt, not quite figured yet. i found this before but have not reconstructed it. above took all day. i had to reread this stuff over and over and over again. - modified the time_shifts so as to shift in xx baffo32 — 06/19/2022 - simplification of gpt->rnn: bb = (torch.exp(torch.exp(self.time_decay) * self.time_curve)[None,:,-(T):-1] * k[:,:,:-1]).sum(dim=-1) + k[:,:,-1] bb = (wk[...,-1] - torch.exp(self.time_first[:,0]) * k[...,-1]) * torch.exp(-torch.exp(self.time_decay[:,0])) + k[...,-1] bb = (wk[...,-1] - w[...,0,-1] * k[...,-1]) * w[...,0,-3] + k[...,-1] aa = (wkv[...,-1] - w[...,0,-1] * kv[...,-1]) * w[...,0,-3] + kv[...,-1] - rnn->gpt: # replace both time_shifts with: torch.cat([xx[...,None,:], x], dim=-2)[...,:-1,:] # then after kv is calculated k = torch.cat([bb[...,None],k], dim=-1) kv = torch.cat([aa[...,None],kv], dim=-1) # uses of T need to be engaged to handle the increased kv sizes, and wk and wkv shrunk to remove the extra value from the start - when doing this, it was quite helpful to set torch.set_printoptions(precision=16); this works even in the debugger, and shows inaccuracies much earlier - fork draft: https://github.com/xloem/RWKV-v2-RNN-Pile/commit/0e3e8a4c262855c7381a4678f8b... - it turns out that there are a number of parallel implementations of the model. the one i updated is the one used for inference, not training. oops. - the model has a context length limit on it that is pending applying some tricks to the cuda kernel used in training. the documentation of the tricks is vague, although derivable; says contact author to contribute. comments in src/model_train.py - another work was linked in #research, https://magenta.tensorflow.org/perceiver-ar baffo32 — Yesterday at 7:18 AM - I'm chatting a little with the model dev and the other researchers a little bit in eleutherai #research. there's an existing script for validating model output; i had made my own, and should switch to the existing one. - other works: Transformer-XL (recurrent gpt sequence training), ERNIE (recursive layer to develop complex state), sentence-transformers (memory lookup, has a function to do this, mentioned months ago on this discord). integreting the concepts of the first two could be more familiar to me, since i've already been thinking on the topics. - I'm working a little on implementing the removal of the clamp() call from this rwkv model i've been journaling about, the presence of which is limiting context length. The optimization in question appears to be a usual one done in the implementation of softmax (haven't verified): k = exp(k - max(k)) instead of k = exp(k) to prevent floating point overflow in exp. this becomes equivalent to rwkv = r*(wkv*exp(-max(k)))/(wk*exp(-max(k)) which factors out on its own. - I tried preventing floating point errors by setting the mantissa of the maximum to 1/2 (binary 1000000...) as such: # break into mantissa, exponent max_k = torch.frexp(max_k) # reconstruct with mantissa of 0.5 so no need of rounding when used max_k = torch.ldexp(torch.tensor(0.5,device=k.device), max_k.exponent) to get same output as original model when the torch.exp(k - max_k) change is made; EDIT: not sure this is doing anything yet, another error remains. - I've found I can preserve accuracy across recurrent runs enough to see success by calculating this_aa = last_aa * torch.exp(last_max_k - this_max_k) to shift max_k's; there might be a better solution here that preserves more accuracy. - with those two it seems to work correctly for a single gpt recurrence. next: bugs, optimization. baffo32 — Yesterday at 8:27 AM [ - i'm losing it and may choose to not pursue completion of this work, unsure. i don't want to start spamming #research in a psychotic state of mind. - the frexp snippet above might help other models and projects if they aren't aware of it yet; the calculation it affects is within every softmax operation. - i have a multi-hour appointment today ] - i somehow resolved some errors via fudging while troubleshooting issues, and found a cool thing around 15:00UTC; the model retains memories much better with the max_k fix, and can form longer/more-accurate completions from its training data it could not form before the change, generalising a little as the dev expected it to in their claims of value to it - shared my max_k solution in #research around 17:25 UTC - spent a long time trying to push it through normalised tests after response in #research to that effect. i'm relatievly sure it gets the exact same scores as the original model - regarding divide-by-zero, i'm thinking on how the space is a little comparable to the max_k space; both numerator and denominator have been already scaled by max_k. hence adding the epsilon to the denominator, which it is currently doing, changes the results a little - torch has a nan_to_num function that can replace nan with zero: https://pytorch.org/docs/stable/generated/torch.nan_to_num.html - 01:00 UTC, i've made a branch in my fork called 'longctx' that contains a draft of code for training on data with unbounded length, using gpt-recurrence and the max_k/clamp and epsilon removal improvements baffo32 — Today at 8:24 AM - petering off. the dev invited me to their telegram chat. not much activity there. i asked about the huggingface integration work, which would be more productive for me to engage. no reply yet. earlier, they said the branch i am working on is the one to share changes in. - when i set up telegram, it thought i was somebody from taiwan, maybe a reused phone number. i didn't understand and people were talking to me like my name was something else. (i have visual cortex issues and didn't notice all the chat rooms were different from mine). i messaged telegram support (no reply, although they read the message), and switched accounts, a bit of scuffling, deleted a message i posted from the other account. left things in a bit of a confused state. - i got my branch running inference correctly using the cuda training accelerations, and added scripts. haven't tested training itself yet, nor decided on how to test it. [they have a handmade cuda kernel they use for training with] [could put shell example here of cloning my branch, compressing a prompt to a short state file, and completing from the state file using gpt, gpt-cuda, and rnn, all with same results] - maybe first work on verifying numerical accuracy of gpt-cuda, if following through with the plan to
- other works: Transformer-XL (recurrent gpt sequence training), ERNIE (recursive layer to develop complex state), sentence-transformers
these links were given in reply to the below message of mine in eleutherai #research. i ended up having a couple extensive and somewhat disjoint/strange interactions that could bear pasting, but i guess it am not doing that yet. me: I spent a couple days working with RWKV. Quite exciting: - if you are looking for cuda work, it would be helpful to quickly mention the optimization techniques you're thinking of. i'm more a software dev than a ml researcher, and i think i've read of the max_k approach in a paper but don't remember it offhand. [i think i found this, i'm looking at it a little, the max value can be pulled out of the quotient] - similarly, i briefly websearched for cuda divide-by-zero approaches, and basically the solution i saw mentioned was to mutate the output from inf or nan after the fact, so it would be helpful to know in which step these invalid values are generated to weigh the impacts of available approaches - as posted above in this chat, i took some time to derive how to use the GPT form in a recurrent manner, so as to train with long context right at the start. i'm curious why you didn't do this in your existing training code. - i'm not an ml researcher, but i'm worried that the architecture of the model appears to refrain from passing the output of a layer to the input of the analogous layer in a further recurrent pass. i'd love a mention of where to learn more about this concern. it seems the lack of recursion must prevent the model from summarising its memories into complex meaning over time, or pull relevant information out of such summaries, and require a lot more parameters to handle long context lengths. it seems it is decaying its hidden states with inference complexity limited by model depth in a unidirectional manner. if i were rearchitecting it, i would try having a "recursive" part where output ends up reaching input again over time, and then in the gpt form the depth of the recursive construct would be as long as the context length, to make them equivalent. i'm thinking of shortening up the context length, or maybe the recursive part could be made very simple, or given dropout to run less. reply: - yeah we need both [pull off max_k] and [fix divide-by-zero] for it to work. - yeah you can do that and it will be a bit like transformer-XL. on the other hand, train+fine-tune to longer ctxlen is very fast and easy too - that will be like https://arxiv.org/pdf/2012.15688.pdf and yes you can certainly do that. i am using some simple training code at this moment.
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many