i ended up being able to spend time at a cool algorithmic challenge inside the self-attention functions of transformer models i went through it and through it and figured about keys and queries and values then i had a discouragement and the habits spread it to the imhobitions around the puzzle and it got abandoned right now, but it was still lots of fun i was using a github repo called picogpt to experiment with idempotent mutations of self-attention, which first meant learning about the function itself i haven’t studied it so i don’t know what it’s formally about quite. i’m mentally modeling the embeddings that flow through transformer layers as representing the log probabilities of properties held by each token that are useful for the model. the multi-headed self attention mechanism as implemented in picogpt appears to break each of these embeddings into groups (“heads”) and process them in parallel — this likely reduces ram usage significantly maybe under the assumption that there is as much meaning as there is groups that is not interdependent. the properties (broken into heads) seem to be processed differently depending on the tokens. usually when training, all tokens are predicted, but you can for example mutate the function to only predict the final token (for inference) and then the matmul is O(N) instead of O(N^2). the conventional function is something like softmax((final_x @ q) @ (all_x @ k).T) @ (all_x @ v) where: - softmax is exp(x)/sum(exp(x)) i.e. an arithmetically normalized exponentiated vector - q is a trained linear transform (a matrix and an offset) that appears to be used to select the properties that are of interest for predicting the next token - k is a trained linear transform that appears to be used to select the properties that are specific to each token for comparing with q’s output - v is a trained linear transform that appears to be used to select the possible new information a token can provide so the q and k products are matmul’d and exp’d, and i’m guessing this produces a probability vector of scalar reals over the tokens relating to how much their properties are relevent for picking a new one. since the final matmul multiplies this by v for all the tokens.