[ot][spam][crazy]

Fri Oct 20 03:22:46 PDT 2023

- I'm surprised I kept the causal mask here. I did make code for
removing the causal mask. If you do remove the causal mask, it
generates a useful output for every input, as if the model is run
separately for each item in its context, which seems much more useful
for outputting large data.

- I heard its of interest to extend the context to billions or
trillions of items (of course if you make a general metamodel at small
size it could generalize to larger sizes more flexibly on its own). To
do this on limited ram you would need a tighter attention pass. It
sounds fun to see if you can do the attention in multiple passes
through the whole model by changing the attention kernel to only
process the highest impact tokens, and cache their sum for reuse.
Concept might simplify to including flash attention.

- I mostly don't remember the theories for now other than the infinite
scratchpad one that I actually wrote in this thread. So this morning
I've been imagining removing the sister models and instead just using
one model that outputs an entire other model, by removing the causal
mask. I imagine I didn't do this when I made the thread because I
wanted more avenues to ensure success via, and I wanted it to stay
more similar to existing things.

- I'm guessing when you do this the important things to do might
include: talking effectively and inclusively about what to do;
demonstrating the ability to solve problems to people interested in
publicly and inclusively solving them; working on mental and emotional
health issues as those drive our decisions; protecting things from
harm without stimulating harm.