this is currently autouploading new snapshots of the model training as it goes, for as long as google lets my notebook stay running. it's presently between 1.0 and 2.0 loss and is making decompilations that don't have weird symbols in them. it's training on only a little under 30k unreviewed and unprocessed function examples so the quality of the result is limited. the tokenizer can make "errors" when there are strings adjacent to binary data that happen to be within the ascii range; using single byte tokenization might make that moot. original function: def example_sum(left, right): sum = left + right return sum compiled bytecode: b'\xe3\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00C\x00\x00\x00s\x0c\x00\x00\x00|\x00|\x01\x17\x00}\x02|\x02S\x00)\x01N\xa9\x00)\x03\xda\x04left\xda\x05right\xda\x03sumr\x01\x00\x00\x00r\x01\x00\x00\x00\xfa\x07demo.py\xda\x0bexample_sum\x1d\x00\x00\x00s\x04\x00\x00\x00\x00\x01\x08\x01' decompiled function as of today: \00 def example_sum(left, right, sum): sum it doesn't look like much, but it's progress because it put the : and the line break at the end of the function signature, which it wasn't doing reliably before, and the body looks better than before to me. it doesn't have extraneous symbols any more, aside from the nul character at the start which i only just noticed. might take me a bit to figure out what a really helpful next step is here, but hopefully i'll figure out how to get more parts in of some kind or another, somewhere.