- I'm surprised I kept the causal mask here. I did make code for removing the causal mask. If you do remove the causal mask, it generates a useful output for every input, as if the model is run separately for each item in its context, which seems much more useful for outputting large data. - I heard its of interest to extend the context to billions or trillions of items (of course if you make a general metamodel at small size it could generalize to larger sizes more flexibly on its own). To do this on limited ram you would need a tighter attention pass. It sounds fun to see if you can do the attention in multiple passes through the whole model by changing the attention kernel to only process the highest impact tokens, and cache their sum for reuse. Concept might simplify to including flash attention. - I mostly don't remember the theories for now other than the infinite scratchpad one that I actually wrote in this thread. So this morning I've been imagining removing the sister models and instead just using one model that outputs an entire other model, by removing the causal mask. I imagine I didn't do this when I made the thread because I wanted more avenues to ensure success via, and I wanted it to stay more similar to existing things. - I'm guessing when you do this the important things to do might include: talking effectively and inclusively about what to do; demonstrating the ability to solve problems to people interested in publicly and inclusively solving them; working on mental and emotional health issues as those drive our decisions; protecting things from harm without stimulating harm.