[ot][spam][journal] last few days of ML work for me

Undiscussed Horrific Abuse, One Victim of Many gmkarl at gmail.com
Tue Jun 21 05:40:24 PDT 2022

this is a snapshot of a channel I made in a discord called 'Land of
AI'. I'm an honorary moderator in the discord, but never really come
in. I used to spam the chatbots to help myself move through my issues.
I made this channel in my subsection so as to work on a machine
learning model I saw in online feeds.

I've matured a little in my spamming. Rather than sending flurries of
stuff, I edited my posts throughout hours, shortening and
consolidating the information. Maybe that won't stay my behavior, who
knows. But others sometimes express value in what I say, so I tried to
make it more organised. This is what it looks like now:

baffo32 — 06/18/2022
an independent researcher from china shared a model in #research in
eleutherai that downscales and has unlimited context. it's an rnn that
reuses state like gptb. they made a demo that infers clientside in a
web browser on a phone.

i'm thinking of seeing what would be needed to make it into a text
adventure engine. it could retain consistent room and map state and
save/load games, across unlimited game size, if finetuned to. it of
course like all these models would be useful for many other wonderful
and scary things, and likely will be, as others are.

it's also small enough i can do more on my low-end systems. the system
i was building, i broke the motherboard before i got it to work.

the old version of the model is at https://github.com/BlinkDL/RWKV-LM
and the newer version is at
https://github.com/BlinkDL/RWKV-v2-RNN-Pile . the main repo is more
organised for reusing but the internal layout is a little different,
and appears to be missing a number of optimizations, not certain, and
sugar for things like tokenization.

the pile-trained model has two modes, an RNN mode and a GPT-like mode
which I believe processes context faster because it doesn't have to
predict the next state, not sure. They developed with the RNN mode, so
it is more put-together, but it's slower to read a prompt. You can
cache prompts by storing the state generated after.

This is the run step of the RNN model from the second repo
(src.model.RWKV_RNN.run). It produces the next state from the
preceding one. I'm adding comments to describe each step.

    # ctx is provided as a list of tokens
    def run(self, ctx):
        # self.w is just an object to hold all the weights
        w = self.w
        # ignore all tokens but the last, and convert it to embeddings
        x = w.emb.weight[ctx[-1]]
        # pass the values through every layer
        # every step updates the state, held as self.xx, self.aa, and self.bb
        for i in range(n_layer):
            x = self.LN(x, w.blocks[i].ln1)
            x = x + self.SA(x, w.blocks[i].att, f'att.{i}')
            x = self.LN(x, w.blocks[i].ln2)
            x = x + self.FF(x, w.blocks[i].ffn, f'ffn.{i}')
        # i'm guessing these final weights are used only for producing
vocab logits
        # this doesn't appear to update the state, the conversion to words
        x = self.LN(x, w.ln_out)
        x = w.head.weight @ x
        # looking at the example, it appears these are log probs that
can be softmaxed or
        # argmaxed to pick the most likely next token
        # convert from tensor to python list before returning
        x = x.tolist()
        return x

baffo32 — 06/18/2022

he states that the GPT model can be used to quickly produce the
states, but given they are held in a different manner it isn't
immediately clear to me how to do that. looking at the training code,
it appears the GPT model is used for training (providing for more
batched processing), and that explains its increased complexity.

baffo32 — 06/18/2022
draft rnn demo script

git clone https://github.com/BlinkDL/RWKV-v2-RNN-Pile
cd RWKV-v2-RNN-Pile
wget https://github.com/BlinkDL/RWKV-v2-RNN-Pile/releases/download/20220615-10803/20220615-10803.zip
unzip 20220615-10803.zip # all-10803.pth

#!/usr/bin/env python3
import sys
from src.model import RWKV_RNN
model = RWKV_RNN(MODEL_NAME='all-10803') # loads contents of all-10803.pth
tokenizer = model.tokenizer

def process(token_id):
  # output token
  token = tokenizer.decode(token_id)
  # pass through the model
  logits_list = model.run([token_id])
  # use max(range,key) as an argmax for python list to do greedy sampling
  token_id = max(range(tokenizer.vocab_size), key=lambda token_id:
  return token_id

# read context (prompt)
print('Enter the context, terminate with EOF (^D):')
token_id = 0
for next_token_id in tokenizer.encode(sys.stdin.read()):
    token_id = next_token_id

# generate
while True:
  token_id = process(token_id)

example use, or hit ctrl-D to terminate context input

echo -n 'Once upon a time,' | python3 test.py

baffo32 — 06/18/2022
So, something you could do with this, is have it e.g. process an
entire encyclopedia, and then save the state (.xx, .aa, .bb) for reuse
later.  It would also need to be finetuned to care about information
that far apart: i dunno, but i'm guessing it would work to finetune it
while it reads it if it's long enough. the model would also need to be
large enough to comprehend the kind and variety of information you

i tried this locally on a file and it tends to just repeat its last
completion still, with basic pattern changes. i don't really know what
i'm doing. i think the researcher says in their repo that you need to
finetune it for longer contexts. anyway, next i imagine is gpt, maybe

[:party:]  i made a model run

<discord owner> — 06/18/2022
@baffo32 thank you
that is awesome thanks for sharing your findings
[ed: i'm kind of awkward here, these are the first things said in months]

baffo32 — 06/18/2022
notes on finding the states in the faster GPT form:
- states are stored like torch parameters, indexed by name associated
with "att" (TimeMix) and "ffn" (ChannelMix).
- xx is the input logits passed to each named layer
- a and b in the RNN form are wkv and wk in the GPT form. aa and bb,
the token-to-token-state, roughly coincide with A[t] and B[t] in the
math docs, but the RNN code is a better guide; it is consistent with
the GPT code. they're the time-weighted coefficients without the
first_time coefficient which is only for output.
- the  aa and bb state outputs are only used internally in TimeMix's
conv1d, and must be separately calculated if output, and summed into
the conv1d if input. when this is done, the GPT model can be used as
output or input with the recurrent model, and can train rapidly for
infinite contexts in a recurrent manner (might need time decay
tweaking to really be infinite, dunno).
- gpt->rnn, in TimeMix.forward:

aa = (torch.exp(torch.exp(self.time_decay) *
self.time_curve)[None,:,-(T):-1] * kv[:,:,:-1]).sum(dim=-1) +
bb = (torch.exp(torch.exp(self.time_decay) *
self.time_curve)[None,:,-(T):-1] * k[:,:,:-1]).sum(dim=-1) + k[:,:,-1]

- rnn->gpt, not quite figured yet. i found this before but have not
reconstructed it. above took all day. i had to reread this stuff over
and over and over again. - modified the time_shifts so as to shift in
baffo32 — 06/19/2022
- simplification of gpt->rnn:

bb = (torch.exp(torch.exp(self.time_decay) *
self.time_curve)[None,:,-(T):-1] * k[:,:,:-1]).sum(dim=-1) + k[:,:,-1]
bb = (wk[...,-1] - torch.exp(self.time_first[:,0]) * k[...,-1]) *
torch.exp(-torch.exp(self.time_decay[:,0])) + k[...,-1]
bb = (wk[...,-1] - w[...,0,-1] * k[...,-1]) * w[...,0,-3] + k[...,-1]
aa = (wkv[...,-1] - w[...,0,-1] * kv[...,-1]) * w[...,0,-3] + kv[...,-1]

- rnn->gpt:

# replace both time_shifts with:
torch.cat([xx[...,None,:], x], dim=-2)[...,:-1,:]
# then after kv is calculated
k = torch.cat([bb[...,None],k], dim=-1)
kv = torch.cat([aa[...,None],kv], dim=-1)
# uses of T need to be engaged to handle the increased kv sizes, and
wk and wkv shrunk to remove the extra value from the start

- when doing this, it was quite helpful to set
torch.set_printoptions(precision=16); this works even in the debugger,
and shows inaccuracies much earlier
- fork draft: https://github.com/xloem/RWKV-v2-RNN-Pile/commit/0e3e8a4c262855c7381a4678f8b9370228281b4a
- it turns out that there are a number of parallel implementations of
the model. the one i updated is the one used for inference, not
training. oops.
- the model has a context length limit on it that is pending applying
some tricks to the cuda kernel used in training. the documentation of
the tricks is vague, although derivable; says contact author to
contribute. comments in src/model_train.py
- another work was linked in #research,
baffo32 — Yesterday at 7:18 AM
- I'm chatting a little with the model dev and the other researchers a
little bit in eleutherai #research. there's an existing script for
validating model output; i had made my own, and should switch to the
existing one.

- other works: Transformer-XL (recurrent gpt sequence training), ERNIE
(recursive layer to develop complex state), sentence-transformers
(memory lookup, has a function to do this, mentioned months ago on
this discord). integreting the concepts of the first two could be more
familiar to me, since i've already been thinking on the topics.

- I'm working a little on implementing the removal of the clamp() call
from this rwkv model i've been journaling about, the presence of which
is limiting context length. The optimization in question appears to be
a usual one done in the implementation of softmax (haven't verified):
k = exp(k - max(k)) instead of k = exp(k) to prevent floating point
overflow in exp. this becomes equivalent to rwkv =
r*(wkv*exp(-max(k)))/(wk*exp(-max(k)) which factors out on its own.

- I tried preventing floating point errors by setting the mantissa of
the maximum to 1/2 (binary 1000000...) as such:

        # break into mantissa, exponent
        max_k = torch.frexp(max_k)
        # reconstruct with mantissa of 0.5 so no need of rounding when used
        max_k = torch.ldexp(torch.tensor(0.5,device=k.device), max_k.exponent)

to get same output as original model when the torch.exp(k - max_k)
change is made; EDIT: not sure this is doing anything yet, another
error remains.

- I've found I can preserve accuracy across recurrent runs enough to
see success by calculating  this_aa = last_aa * torch.exp(last_max_k -
this_max_k) to shift max_k's; there might be a better solution here
that preserves more accuracy.

- with those two it seems to work correctly for a single gpt
recurrence. next: bugs, optimization.

baffo32 — Yesterday at 8:27 AM
- i'm losing it and may choose to not pursue completion of this work,
unsure. i don't want to start spamming #research in a psychotic state
of mind.
- the frexp snippet above might help other models and projects if they
aren't aware of it yet; the calculation it affects is within every
softmax operation.
- i have a multi-hour appointment today
- i somehow resolved some errors via fudging while troubleshooting
issues, and found a cool thing around 15:00UTC; the model retains
memories much better with the max_k fix, and can form
longer/more-accurate completions from its training data it could not
form before the change, generalising a little as the dev expected it
to in their claims of value to it
- shared my max_k solution in #research around 17:25 UTC
- spent a long time trying to push it through normalised tests after
response in #research to that effect. i'm relatievly sure it gets the
exact same scores as the original model
- regarding divide-by-zero, i'm thinking on how the space is a little
comparable to the max_k space; both numerator and denominator have
been already scaled by max_k. hence adding the epsilon to the
denominator, which it is currently doing, changes the results a little
- torch has a nan_to_num function that can replace nan with zero:
- 01:00 UTC, i've made a branch in my fork called 'longctx' that
contains a draft of code for training on data with unbounded length,
using gpt-recurrence and the max_k/clamp and epsilon removal
baffo32 — Today at 8:24 AM
- petering off. the dev invited me to their telegram chat. not much
activity there. i asked about the huggingface integration work, which
would be more productive for me to engage. no reply yet. earlier, they
said the branch i am working on is the one to share changes in.
- when i set up telegram, it thought i was somebody from taiwan, maybe
a reused phone number. i didn't understand and people were talking to
me like my name was something else. (i have visual cortex issues and
didn't notice all the chat rooms were different from mine). i messaged
telegram support (no reply, although they read the message), and
switched accounts, a bit of scuffling, deleted a message i posted from
the other account. left things in a bit of a confused state.
- i got my branch running inference correctly using the cuda training
accelerations, and added scripts. haven't tested training itself yet,
nor decided on how to test it. [they have a handmade cuda kernel they
use for training with]
[could put shell example here of cloning my branch, compressing a
prompt to a short state file, and completing from the state file using
gpt, gpt-cuda, and rnn, all with same results]
- maybe first work on verifying numerical accuracy of gpt-cuda, if
following through with the plan to

More information about the cypherpunks mailing list