- I skimmed bigbird's description a little. it's trained for sequence
   lengths of 4096 tokens but it doesn't look like memory requirements
   would rise too much if that were increased somehow. curious if you can
   finetune a model with increased position embeddings, probably can.
   - I glanced at realm which apparently trains something to select among
   documents for helpful additional data. very brain-like.
   realm took 80 tpus to train, but that's possibly because it was
   developing understanding of human language and knowledge from scratch.
   we have existing pretrained models that can be finetuned to skip a lot
   of that.
   - thinking a little about training a model to sustain useful state as
   it moves through its data and tasks
   - thinking a little about quickly training a document retriever by
   calculating loss from multiple docs retrieved in parallel (it then
   might either backpropagate to the mechanism used to select them, or can
   be trained as a classifier from the results; i've tried this before
   briefly once with success)
   the "right" solution might likely be to separate:
   - binary layout
   - decompilation
   - commenting
   into three different tasks, with human design providing for the clear
   and simple bits to reduce the complexity.