17 Jan
2022
17 Jan
'22
12:18 p.m.
the T5 tokenizer the current code uses removes linebreaks, so the output source isn't recompilable. last night i added to find_pycode.py to add functions to train a tokenizer for the source, preserving linebreaks. there is a further big issue: embedded strings are not tokenized on the input side, so the model has to learn the patterns of the tokenizer, reproducing it internally, to succeed in producing the tokenized output. i think a tokenizer just converts words to numbers in a compressing way that preserves some meaning of structure, e.g. number-per-syllable or such. i'm thinking the best solution here might be to make a tokenizer that preserves nonascii bytes in its output. but i don't think it's the most pressing issue.