i posted but my email disappeared for me, here's another. my continuing on this is waning atm, maybe will change. the first model was lost after a day or two when the notebook closed itself. reusing the token ids of the T5 tokenizer really speeds training from the T5 model. i spent some time hacking a tokenizer that both engages byte data and toikenizes embedded strings, but it seems it would be better if the token ids of the previous tokenizer were used, to reuse more of the model. there's a very early model at https://huggingface.co/baffo32/pyc2py_alpha that could maybe seed the .ipynb or its associated .py file in https://github.com/xloem/techsketball . the model doesn't succeed yet, just gets more likely to. thinking of the intensity of this task, how [indescribably] hard it can be to continue: my example here was with pyc files, for which the source is almost always already available. this example gave me more ease when starting the project. however, more useful example data can give bigger return when projects struggle.