Re: [crazy][hobby][spam] Automated Reverse Engineering

19 Jan 2022

      this is currently autouploading new snapshots of the model training as
it goes, for as long as google lets my notebook stay running.  it's
presently between 1.0 and 2.0 loss and is making decompilations that
don't have weird symbols in them.  it's training on only a little
under 30k unreviewed and unprocessed function examples so the quality
of the result is limited.  the tokenizer can make "errors" when there
are strings adjacent to binary data that happen to be within the ascii
range; using single byte tokenization might make that moot.

original function:
    def example_sum(left, right):
        sum = left + right
        return sum

compiled bytecode:
b'\xe3\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00C\x00\x00\x00s\x0c\x00\x00\x00|\x00|\x01\x17\x00}\x02|\x02S\x00)\x01N\xa9\x00)\x03\xda\x04left\xda\x05right\xda\x03sumr\x01\x00\x00\x00r\x01\x00\x00\x00\xfa\x07demo.py\xda\x0bexample_sum\x1d\x00\x00\x00s\x04\x00\x00\x00\x00\x01\x08\x01'

decompiled function as of today:
\00 def example_sum(left, right, sum):
    sum

it doesn't look like much, but it's progress because it put the : and
the line break at the end of the function signature, which it wasn't
doing reliably before, and the body looks better than before to me.
it doesn't have extraneous symbols any more, aside from the nul
character at the start which i only just noticed.

might take me a bit to figure out what a really helpful next step is
here, but hopefully i'll figure out how to get more parts in of some
kind or another, somewhere.