Attached is code that trains a model that itself can produce a trained model in roughly the same time as training the produced model manually (down to randomness of initialization). I wanted to keep working it to make something useful, but it's gotten too tense for me to be near and I am stopping it. This is why it is messy. I try to do this on every now and then from year to year. I'm sure many have already completed it. This is the farthest I've gotten before abandoning it! The basis of the theory is that a transformer performs as much computation as there is length to the input and output sequences, so you can increase the power by increasing the input and output size, while keeping the size of the model as smaller. This lets you work with the entirety of a large model using a smaller one, for example. A novelty I like in this approach is how I labeled the training weights by providing additional dimensions of data directly on the inputs ("encoded" in the source), rather than using position embeddings. Not sure what this is called, but it seems possibly more effective than position embeddings to me as the first linear encoder can learn arbitrary encoding forms but only needs to be as wide as the model dimension. Crazy Karl