Re: [crazy][hobby][spam] Automated Reverse Engineering
i posted but my email disappeared for me, here's another. my continuing on this is waning atm, maybe will change. the first model was lost after a day or two when the notebook closed itself. reusing the token ids of the T5 tokenizer really speeds training from the T5 model. i spent some time hacking a tokenizer that both engages byte data and toikenizes embedded strings, but it seems it would be better if the token ids of the previous tokenizer were used, to reuse more of the model. there's a very early model at https://huggingface.co/baffo32/pyc2py_alpha that could maybe seed the .ipynb or its associated .py file in https://github.com/xloem/techsketball . the model doesn't succeed yet, just gets more likely to. thinking of the intensity of this task, how [indescribably] hard it can be to continue: my example here was with pyc files, for which the source is almost always already available. this example gave me more ease when starting the project. however, more useful example data can give bigger return when projects struggle.
note: in my bumbling i found this doc which gives a general intro to flax/jax/huggingface from google: https://github.com/huggingface/transformers/blob/master/examples/research_pr... . i'm wondering if stuff like that doc is how jax reached me.
- show and tell - the checkpoint on huggingface currently has a loss of around 2.1, so it doesn't succeed yet. but it turns out it can produce an output, and guesses a simple signature correctly: git clone https://github.com/xloem/techsketball cd techsketball python3 demo.py it compiles a very short function, and then passes it through the model. example_sum(a,b): return a + b is decompiled into say example_sum(a,br): a, br, a - problems - after i saw this success running with vmem on my pi, my colab terminal disconnected trying to reproduce it, and the notebook training restarted on its own. the loss is presently back up to 3 or 4, losing the 12 hours of training. my system isn't responding well enough to try the exmple again atm or fix it, but hopefully soon it will turn around if it hasn't already, and i can figure out how to plug the saved checkpoint in to resume.
um er - i went back to that and it turned out i had just scrolled up, and the training was all there - i think i may have uploaded another snapshot - i let it train for a number more hours, but when i returned the vm had run out of ram and X wasn't accepting keyboard input. it took me some time to figure out the problem was with X and not the website, and i ended up losing the training - it's running again for a bit
a jax contributor kindly shared this with me. you can store tpu models precompiled, which significantly speeds launch time, by using a compilation cache folder. from jax.experimental.compilation_cache import compilation_cache as cc cc.initialize_cache("/path/name/here", max_cache_size_bytes=32 * 2**30) not presently relevant as i'm on gpu, but should try to use this with tpus.
this is currently autouploading new snapshots of the model training as it goes, for as long as google lets my notebook stay running. it's presently between 1.0 and 2.0 loss and is making decompilations that don't have weird symbols in them. it's training on only a little under 30k unreviewed and unprocessed function examples so the quality of the result is limited. the tokenizer can make "errors" when there are strings adjacent to binary data that happen to be within the ascii range; using single byte tokenization might make that moot. original function: def example_sum(left, right): sum = left + right return sum compiled bytecode: b'\xe3\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00C\x00\x00\x00s\x0c\x00\x00\x00|\x00|\x01\x17\x00}\x02|\x02S\x00)\x01N\xa9\x00)\x03\xda\x04left\xda\x05right\xda\x03sumr\x01\x00\x00\x00r\x01\x00\x00\x00\xfa\x07demo.py\xda\x0bexample_sum\x1d\x00\x00\x00s\x04\x00\x00\x00\x00\x01\x08\x01' decompiled function as of today: \00 def example_sum(left, right, sum): sum it doesn't look like much, but it's progress because it put the : and the line break at the end of the function signature, which it wasn't doing reliably before, and the body looks better than before to me. it doesn't have extraneous symbols any more, aside from the nul character at the start which i only just noticed. might take me a bit to figure out what a really helpful next step is here, but hopefully i'll figure out how to get more parts in of some kind or another, somewhere.
On 1/19/22, k <gmkarl@gmail.com> wrote:
decompiled function as of today: \00 def example_sum(left, right, sum): it doesn't look like much, but it's progress
There will be a party if your new ghidra prints printf("Hello world.\n"); https://github.com/NationalSecurityAgency/ghidra
might take me a bit to figure out what a really helpful next step is here, but hopefully i'll figure out how to get more parts in of some kind or another, somewhere.
Maybe feed in lots of tiny source code example unit tests that have only one possible reverse, then two possible reverses using the one's as discriminator, then three using two's, etc. Probably no one has yet written a complete testbook for all the functions of any given language, that could then be compiled and dumped in. But it might be possible to automate the creation of one by grokking all the function definitions from the source of whatever language. The closer to machine instruction language the greater chance of correct reversal. So perhaps step work from the machine base backward in intermediate stages from the hardware level up the tree of each abstraction layers to the specific human language. Instead of trying straight from say some highlevel python lang directly to lowlevel x86 lang. And slam entire linux kernels and windows apps through it for noisy fun.
this has been going slower than needed because colab was bailing when i tried to run the model on google's tpus, during compilation. today i made a google cloud vm and precompiled the model in their shell, and addded precompilation support to the notebook. it was _really_ hard to make the vm, my psychosis kept having me forget i was doing it and do something else, over and over and over again. but now, using the tpus it is much faster, days turn into minutes. i haven't poked around with it much yet. i also found there is a series of t5 models pretrained on individual bytes instead of tokens: https://huggingface.co/docs/transformers/model_doc/byt5 exciting developments. big next steps: - have the code gather much much more data. - try a bigger model that can learn more complex things. i've been running the model on their shell for maybe an hour and the loss is down to 0.5 or so. they charge by the hour so i should really turn it off. i'm using the lowest-end tpus so that it models how the notebooks should perform after i terminate the vm.
- it turns out that deserialization of compiled tpu code isn't implemented in colab notebooks yet. might be easy to implement, might be nearly impossible, haven't looked. so not too much was accomplished by the use of tpu vms other than realising they're there for when a lot of speed is needed. - i started some code for gathering more data, currently at https://github.com/xloem/techsketball/blob/main/apt_.py . it enumerates packages from all architectures of debian and ubuntu distributions and pairs the source, debug symbols, and binaries together. it parses line number/address mapping information for all files from the dwarf information. collecting that data is a big thing. it's enough data for a large entity to solve this problem without issue. the intensity of it might change my task, unsure. it's notable that commonly available research models might need another year or two to really make this task simple, clear, and easy for a raspberry pi user like me, but that it's also still quite reasonable to shore up the situation with some understanding of binary layouts and model architectures.
- a large T5 model could be tpu compiled on colab notebooks by calling pmap() on individual blocks rather than the whole model - much larger models could be trained by masking the training weights to reduce autograd memory load as has been done for at-home training of large text generation models
- a large pretrained model that has significant understanding of english logic and knowledge could be finetuned on bytes by training perceiver-like cross attention embedding/tokenization encoders and decoders to match the behaviors if its original tokenizer and embeddings but accept byte streams. - the perceiver masked lm model uses cross attention as such: PerceiverLayer(config, is_cross_attention=True, qk_channels, v_channels, num_heads, q_dim, kv_dim, widening_factor=config.widening_factor, use_query_residual=config.use_query_residual) and calls it as: cross_attention(trained_embeddings, attention_mask=None, head_mask=None, inputs=inputs, inputs_mask=inputs_mask), roughly. - I'm curious what kind of memory and computation bounds there are on data input size for the mainstream trained models. Could we feed an entire binary in? Could we feed an entire tarball in? - I'm curious what the state of the large-input models is, like bigbird. Are they helpful here? - I'd also like to run the current model on colab by finding a workable trick to prevent the compilation crash, by either compiling in smaller chunks or using a different framework, possibly without compilation.
- I skimmed bigbird's description a little. it's trained for sequence lengths of 4096 tokens but it doesn't look like memory requirements would rise too much if that were increased somehow. curious if you can finetune a model with increased position embeddings, probably can. - I glanced at realm which apparently trains something to select among documents for helpful additional data. very brain-like. realm took 80 tpus to train, but that's possibly because it was developing understanding of human language and knowledge from scratch. we have existing pretrained models that can be finetuned to skip a lot of that. - thinking a little about training a model to sustain useful state as it moves through its data and tasks - thinking a little about quickly training a document retriever by calculating loss from multiple docs retrieved in parallel (it then might either backpropagate to the mechanism used to select them, or can be trained as a classifier from the results; i've tried this before briefly once with success) the "right" solution might likely be to separate: - binary layout - decompilation - commenting into three different tasks, with human design providing for the clear and simple bits to reduce the complexity.
idea: a model could be trained to guess the source layout by sequentially producing filepaths and selecting areas of the source code to consider, like an agent that's similar to language generation except the output words/phrases are unordered: a set of filepaths. might be interesting to try training a model based on there being multiple correct answers rather than just one. it likely works great but could run into an issue.
I'm thinking I'd like to try training a bytestokenizer for bigbird and extend its sequence length to entire binaries. I expect the result to be about 30% successful given my lack of experience and time.
Another idea: We could design something using human knowledge or ghidra, then review it and figure out how it could have designed it on its own.
Note: I won't be effective at using the cutting edge here, because I am not hanging in research chats on discord collaborating with researchers sharing their latest work. Anybody can do that by hopping through the chat servers, asking around. It feels a little overwhelming for me.
uhhh the discord i remember the best is eleutherai's. they made gptj and also an open source coding assistant app for vscode.
regarding the idea for saving state, that could work here. basically you take a fancy text generation model and finetune it to produce its own embeddings by feeding it one token at a time instead of a document, each time feeding back its generated state as embeddings. it then is possibly bound by state size and complexity rather than input size and output size, and can possibly generate sensical documents of arbitrary length. I have a git repo somewhere where I implemented this last year or so. also, I learned more about jax jit compilation working on the memory efficient attention improvements. jax has an option to inline or not inline subfunctions, so the issue is likely bisectable and removable by jit()ing subparts of the jax compilation that fails on colab. theoretically they aren't inlined by default (inlining would prevent the approach).
i'm suspecting some people have been using fairseq for things like this https://github.com/pytorch/fairseq it's a facebook project focused on training sequence transformer models. noticed there was a deep learning related repo on the old gitopia, too, could be meaningful to look through such things for community-built efforts
so maybe pip3 install https://github.com/xloem/GPTb from GPTB import GPTBLMHeadModel from transformers.models.gpt2.configuration_gpt2 import GPT2Config config = GPT2Config() # pass settings or you can pull the config from some pretrained model and tweak it config.rebias = True # additional parameter for GPTB model = GPTBLMHeadModel(config) model.train() model.zero_grad() optimizer.zero_grad() past_hidden_states = None past_logits = None for batch_of_tokens in data: # shape of batch_of_tokens is (batchsize, 1) if past_logits is not None: loss = torch.nn.functional.cross_entropy(past_logits.view(-1, vocab_size), batch_of_tokens.view(-1)) loss.backward() optimizer.step() model.zero_grad() optimizer.zero_grad() past_logits, past_hidden_states, extra = model(batch_of_tokens, past_hidden_states=past_hidden_states)
It's getting more normal to use recurrent models that no longer have bounds on their input and output sizes. This removes half the challenge of this task. https://github.com/BlinkDL/RWKV-LM
participants (4)
-
grarpamp
-
k
-
Undiscussed Horrific Abuse, One Victim & Survivor of Many
-
Undiscussed Horrific Abuse, One Victim of Many