[crazy][hobby][spam] Automated Reverse Engineering

Mon Jan 17 04:40:59 PST 2022

note:
- additionally, the perceiver model structure may not need tokenization
- and, google made a new T5 called LongT5 that can handle much larger
data already, code usually released in coming months

given many functions are short, i might skip the length problem for now
but maybe now something is training and looks to have some success
(and be improveable with management of embedded strings) it oculd make
sense to:
- collect data for other languages
- organise code better
- implement reduced memory usage for faster training
- address string encoding for faster training
- improve the training itself

it will be clearer for me after seeing results of the current
training.  it's helpful to kind of look at results.

oh here we go: it needs to save the model for continued training if
interrupted, and for use after training.  that's important since colab
could halt during this test.