[ot][spam][crazy][personal] morning journal maybe

Sat Jun 3 10:10:13 PDT 2023

i ended up being able to spend time at a cool algorithmic challenge
inside the self-attention functions of transformer models

i went through it and through it and figured about keys and queries and values

then i had a discouragement and the habits spread it to the
imhobitions around the puzzle and it got abandoned right now, but it
was still lots of fun

i was using a github repo called picogpt to experiment with idempotent
mutations of self-attention, which first meant learning about the
function itself

i haven’t studied it so i don’t know what it’s formally about quite.
i’m mentally modeling the embeddings that flow through transformer
layers as representing the log probabilities of properties held by
each token that are useful for the model.

the multi-headed self attention mechanism as implemented in picogpt
appears to break each of these embeddings into groups (“heads”) and
process them in parallel — this likely reduces ram usage significantly
maybe under the assumption that there is as much meaning as there is
groups that is not interdependent.

the properties (broken into heads) seem to be processed differently
depending on the tokens. usually when training, all tokens are
predicted, but you can for example mutate the function to only predict
the final token (for inference) and then the matmul is O(N) instead of
O(N^2).

the conventional function is something like softmax((final_x @ q) @
(all_x @ k).T) @ (all_x @ v)
where:
- softmax is exp(x)/sum(exp(x)) i.e. an arithmetically normalized
exponentiated vector
- q is a trained linear transform (a matrix and an offset) that
appears to be used to select the properties that are of interest for
predicting the next token
- k is a trained linear transform that appears to be used to select
the properties that are specific to each token for comparing with q’s
output
- v is a trained linear transform that appears to be used to select
the possible new information a token can provide

so the q and k products are matmul’d and exp’d, and i’m guessing this
produces a probability vector of scalar reals over the tokens relating
to how much their properties are relevent for picking a new one. since
the final matmul multiplies this by v for all the tokens.