- I skimmed bigbird's description a little. it's trained for sequence lengths of 4096 tokens but it doesn't look like memory requirements would rise too much if that were increased somehow. curious if you can finetune a model with increased position embeddings, probably can.- I glanced at realm which apparently trains something to select among documents for helpful additional data. very brain-like.
realm took 80 tpus to train, but that's possibly because it was developing understanding of human language and knowledge from scratch. we have existing pretrained models that can be finetuned to skip a lot of that.
- thinking a little about training a model to sustain useful state as it moves through its data and tasks
- thinking a little about quickly training a document retriever by calculating loss from multiple docs retrieved in parallel (it then might either backpropagate to the mechanism used to select them, or can be trained as a classifier from the results; i've tried this before briefly once with success)
the "right" solution might likely be to separate:
- binary layout
- decompilation
- commenting
into three different tasks, with human design providing for the clear and simple bits to reduce the complexity.