… — Recently I tried to make an MLP! An MLP is just a linear transform of data into and out of “activation functions” where activation functions are just vectorized conditionals often smoothed a little so they have gradients around the knee. So the linear transforms decide what the conditionals are around. (I could be totally wrong it’s just what it looks like). A transformer adds 2 things to MLPs — log space which lets it handle disparate scales, and attention which I guess lets it shrink unknown complex combinations into smaller relevant combinations (don’t remember attention well at this time) Transformers are missing the ability to model periodic functions which could likely be addressed by adding a fourier transform anywhere, like logarithms are added. The linear parts let it use “skip” connections — additions of one representation to another — to select which representation is most useful for the attention kernels and mlps. I don’t really know this stuff!