the written text adventure toolkit welcome to wtat where are you?
huh? what do you mean?
what is the starting room?
oh um let's call it room 1
you are in room 1. what is here? are there a[ny exits 2225 one could say one of the dawnings of this era would be that aidungeon was closed source and had absolutely no room graph i wonder what more there is now — 2228 i finally came up with a way to do transformers peer to peer recently but i don’t remember it. maybe i can come up with another. uhhhh ummmmm ideas: - squishing to one layer would increase parallelism - [oops brain issue wasn’t expecting brain issue here ! surprised. um. usually i try to continue and it watches me, placing triggers to make it hard to both continue and repeat. as i keep trying more avenues this gets more thorough. right now i’m on an old ipad. running a pretrained model would be slow, large, and power hungry one idea could be squishing a model to one layer, having many peers perform operations in parallel, and combine them. i think it is well-known that that doesn’t work here. transformer layers have a few combination points where all data is summed or such, i’ve noticed from looking at them. i suspect some of these are needed less than other for inference, don’t really know. maybe i can look at one and wonder about it more [some complaint maybe relates edited turn of phrase] 2234 i’m wondering if i could make progress on guessing how transformers work enough to consider symbolically swapping depth for width. 2234 i’m looking at hf llama source (2237) the device is functioning poorly and it is difficult to do (2238) looks like a llama layer is: x += attention(rmsnorm1(x)) x += mlp(rmsnorm2(x)) so one maybe could think of x as a sum of 3 values: its initial value, its self attention calculation, and its mlp calculation 2243 rmsnorm: mat * (x / sqrt(mean(x^2))) # i think it scales the data to have a stddev of 1, and then applies a constant linear transformation (mat) specific to the instance of the call mlp: down_mat(act_fn(gate_mat)) * up_mat(x) # gate_mat and up_mat perform dimension stretching via linear transforms whereas down_mat undoes the stretching via another. that is, they are rectangular matrices where down_mat has swapped dimensions so, act_fn here applies a nonlinearity, ie something threshold based, on a set of properties or metrics that are all linear combinations of x after attention, and the linearly recombines them together with x to create a value to sum into it. i’m wondering if one might consider this a vector of conditionals of simple arithmetic functions of x, which then add another simple arithmetic function into x for their instances that evaluate true. i’m thinking of the act_fn relu which i think looks like x = y > 0 ? y : 0 not sure. it might use a different act_fn. 225 :s some of us are holding things next to knowledge that the current pursuit could be solved in public research already with reasonable likelihood. there’s interest in heading off for the night, here we go 2259 —- 0643 that was so much fun the transformer poking! 0643 we’re scared of “multinational cartels”