[crazy][hobby][spam] Automated Reverse Engineering

Mon Jan 17 04:18:54 PST 2022

the T5 tokenizer the current code uses removes linebreaks, so the
output source isn't recompilable.

last night i added to find_pycode.py to add functions to train a
tokenizer for the source, preserving linebreaks.

there is a further big issue: embedded strings are not tokenized on
the input side, so the model has to learn the patterns of the
tokenizer, reproducing it internally, to succeed in producing the
tokenized output.

i think a tokenizer just converts words to numbers in a compressing
way that preserves some meaning of structure, e.g. number-per-syllable
or such.

i'm thinking the best solution here might be to make a tokenizer that
preserves nonascii bytes in its output.  but i don't think it's the
most pressing issue.