[crazy][hobby][spam] Automated Reverse Engineering

k gmkarl at gmail.com
Mon Jan 17 23:41:16 PST 2022


i posted but my email disappeared for me, here's another.

my continuing on this is waning atm, maybe will change.

the first model was lost after a day or two when the notebook closed itself.

reusing the token ids of the T5 tokenizer really speeds training from
the T5 model.  i spent some time hacking a tokenizer that both engages
byte data and toikenizes embedded strings, but it seems it would be
better if the token ids of the previous tokenizer were used, to reuse
more of the model.

there's a very early model at
https://huggingface.co/baffo32/pyc2py_alpha that could maybe seed the
.ipynb or its associated .py file in
https://github.com/xloem/techsketball .  the model doesn't succeed
yet, just gets more likely to.

thinking of the intensity of this task, how [indescribably] hard it
can be to continue: my example here was with pyc files, for which the
source is almost always already available.  this example gave me more
ease when starting the project.  however, more useful example data can
give bigger return when projects struggle.


More information about the cypherpunks mailing list