Re: [crazy][hobby][spam] Automated Reverse Engineering
note: the huggingface demo passes information to the model using token ids token ids are just indexed sets of character orders that occur together frequently (the tokenizer counts and decides these) with something based on math, since it's going to be learning using linear algebra, i'm wondering if it might make sense to retain the numeric value of the inputs. this could mean bypassing the use of integer input ids. the integer input ids are converted into high-dimensioned vectors using an 'embedding' matrix at the very start of the model. this matrix could be hand-altered by finding its property in the model, or removed/skipped entirely. a thought. the coefficients in the embedding matrix are trained via backpropagated gradients to determine the vectors. i think they end up being roughly random with some of their dimensions clustering similar data near each other and such.
ok, this was great motion i think vanilla models have a maximum sequence length. this can be expanded by altering the algorithm to not be O(n^2) for memory in the attention function. there's a paper out there on one approach to this. another idea is to chunk the text in some way and train models to handle the chunked text. that could also become two small projects instead of one if something like reinforcement learning is used to make that effective.
i summarized some things at https://github.com/xloem/techsketball/blob/main/README.md including a link to that memory reducing paper at https://arxiv.org/abs/2112.05682 and some python import statements. there's code for this paper at https://github.com/AminRezaei0x443/memory-efficient-attention but i think the implementation there leaves out some oft-used attention features and may need adaptation, unknown. i'm looking at the huggingface source and seeing this model doesn't by default provide for raw embeddings as inputs. sometimes they do.
i wrote a quick call-and-go class to generate short pairs of bytecode and sourcecode from the python runtime at https://github.com/xloem/techsketball/blob/main/find_pycode.py it might be reasonable to use this as a proof of concept, filtering on input length. since others are likely adding the research paper to T5 already, somewhere.
yay progress! time for me to spin in circles a bit. [crazy]
i think this example notebook shows training a transformer model on the free tpus https://colab.research.google.com/github/huggingface/notebooks/blob/master/e...
On Wed, 29 Dec 2021 17:44:57 -0500 k <gmkarl@gmail.com> wrote:
i think this example notebook shows training a transformer model on the free tpus https://colab.research.google.com
again, fuck you karl and your fucking JOOGLE SPAM. Take it elsewhere.
yeah i dunno =/ but hey, big corps guiding advanced tech to use big computing resources and then monopolising control of them is just like how we spam lists to do things, maybe! use whatcha got? gotta figure out how to turn the problem into a different solution
On 12/29/21, Punk-BatSoup-Stasi 2.0 <punks@tfwno.gf> wrote:
On Wed, 29 Dec 2021 17:44:57 -0500 k <gmkarl@gmail.com> wrote:
i think this example notebook shows training a transformer model on the free tpus https://colab.research.google.com
again, fuck you karl and your fucking JOOGLE SPAM. Take it elsewhere.
see + marked lines below commit 2752300c472e56598ccce9b3887588779c9fd22c (HEAD -> main, origin/main) Author: xloem <0xloem@gmail.com> Date: Wed Dec 29 19:30:10 2021 -0500 Update README.md diff --git a/README.md b/README.md index b000b37..3d13e2b 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,8 @@ I have trouble doing things, and don't really know how to do that, so just a few I recommend using jax/flax and google cloud, because google has a TPU research program free trail w/ application that could be leveraged for compute once a setup is designed. +I do NOT normally recommend using google cloud, because your muscle contraction timing will be harvested by javascript to guide your behavior for some of the world's largest marketing computers. + Google's systems are accessible on the web to the public for use at https://colab.research.google.com/
i got some simple data prepared and into the training implementation but have not written it, hard to continue. i'm at https://colab.research.google.com/github/huggingface/notebooks/blob/master/e... code is https://github.com/xloem/techsketball/blob/main/model_import_sketch.py . model = is commented out near the top to make it faster to run it to find typos or such.
The linked function will need to be mutated for T5 per the T5 page linked earlier in this thread and farther down in my repo readme. Or the page's instructions could simply be used, rather than this TPU-oriented tutorial.
I've pasted a training function into the .ipynb in https://github.com/xloem/techsketball/ . it's not mutated into a .py yet. i've also added mmaping functionality to the data generator so data larger than ram can be used and cached between tests. it is not used yet. i code with dense bugs due to spasms i navigate, so the next step is roughly rote debugging of the .ipynb
looks like i pasted together data batching code that doesn't line up basically the code needs to be mutated such that each batch is a dict, rather than the whole data. the example at https://github.com/huggingface/transformers/blob/master/examples/flax/langua... uses transformers.BatchEncoding for this.
so, the jax/flax hugging face t5 output doesn't include loss the way the huggingface t5 documentation implies. the pytorch output does. here's the loss from the huggingface pytorch t5 code. for me this is line 1643 of my old checkout of github.com/huggingface/transformers src/transformers/models/modeling_t5.py: if labels is not None: loss_fct = CrossEntropyLoss(ignore_index=-100) loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1)) # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb... CrossEntropyLoss is a very common function in transformer models that takes a vector of logs of odds of options and which option is correct and returns how close they are to selecting the correct one. if you look it up it does something like take the log of them all, the different of one, and divide by the sum, or something not too complex and relatively intuitive.
oh, and .view(-1, ...) means to squish an n-dimensional vector so that is has the dimension sizes listed, where -1 means to make that dimension as large as needed to fit all the needed. so .view(-1) turns it into a 1-dimensional array.
wow those two emails are _full_ of errors. don't take the log of logits, you'll get a double-log probability and nobody will know what to do with it except people investigating the insides of neural network models that manipulate other neural network models or something
i'm looking at https://github.com/huggingface/transformers/blob/master/examples/flax/summar... , which is for flax as a summarization task, and noting that the decoder input ids are the labels shifted by one. i'm thinking that summarization is basically the same as translation: seq2seq. don't really know. here's their flax loss function: https://github.com/huggingface/transformers/blob/master/examples/flax/summar... and at https://github.com/huggingface/transformers/blob/master/examples/flax/summar... i'm thinking that the labels are _not_ passed to the model ('pop'), which lines up with not seeing the parameter in the source cod for the flax model
i've addressed bugs enough that it actually gets to the point where the tpus evaluate the model with passed data. so far the first evaluation pass hasn't returned, maybe cause this demo is low-end, unsure. i have no idea how long it hsould take and should try a smaller model to continue dbeugging.
[after a number of psychotic breaks] the training loop runs now. it's likely not running very effectively. for the notebook to run right now, an uncommitted change is needed: # compute loss loss = optax.softmax_cross_entropy(logits, flax.training.common_utils.onehot(labels, logits.shape[-1])) - padding_mask = decoder_attention_mask + padding_mask = batch['decoder_attention_mask'] loss = (loss * padding_mask).sum() / padding_mask.sum()
[missing change was committed] t5-base with batch size of 6 is looking for 22GB of hbm (tpu memory). crashes complained has only 7 gb, might be a notebook limit or a time of day thing
hbm limits relate to the TPU linked to the notebook. a v2-8 (i think?) has 64 GB which gets split into 8x 8GB if all 8 cores are used. TRC provides larger TPUs, but it still raises the memory size issue.
it's successfully fitting the model to the task on the colab gpu. the tpu compilation times out colab's rpc connection to google's cloud. the eta for 10 runs through my example data is within 520 hours (3 weeks) on the free colab gpu notebook using a batch size of 16.
it's successfully fitting the model to the task on the colab gpu. the tpu compilation times out colab's rpc connection to google's cloud. the eta for 10 runs through my example data is within 520 hours (3 weeks) on the free colab gpu notebook using a batch size of 16.
batchsize of 20 is about the same speed redaction: this is not actually the free colab. to make it work on the free colab, you'd drop the batchsize so it fit in ram. while frustrated with the tpu rpc timeouts i bought the paid colab. it didn't help, turns out because the timeout is hardcoded in the tensorflow source. google cloud sdk shouldn't have the timeout. this notebook is using a single tesla p100 with 16G of vram. batchsize=24 exhausts the vram. might let it run for a bit, see how fast it fits
the T5 tokenizer the current code uses removes linebreaks, so the output source isn't recompilable. last night i added to find_pycode.py to add functions to train a tokenizer for the source, preserving linebreaks. there is a further big issue: embedded strings are not tokenized on the input side, so the model has to learn the patterns of the tokenizer, reproducing it internally, to succeed in producing the tokenized output. i think a tokenizer just converts words to numbers in a compressing way that preserves some meaning of structure, e.g. number-per-syllable or such. i'm thinking the best solution here might be to make a tokenizer that preserves nonascii bytes in its output. but i don't think it's the most pressing issue.
note: - additionally, the perceiver model structure may not need tokenization - and, google made a new T5 called LongT5 that can handle much larger data already, code usually released in coming months given many functions are short, i might skip the length problem for now but maybe now something is training and looks to have some success (and be improveable with management of embedded strings) it oculd make sense to: - collect data for other languages - organise code better - implement reduced memory usage for faster training - address string encoding for faster training - improve the training itself it will be clearer for me after seeing results of the current training. it's helpful to kind of look at results. oh here we go: it needs to save the model for continued training if interrupted, and for use after training. that's important since colab could halt during this test.
hbm limits relate to the TPU linked to the notebook. a v2-8 (i think?) has 64 GB which gets split into 8x 8GB if all 8 cores are used. TRC provides larger TPUs, but it still raises the memory size issue.
participants (2)
-
k
-
Punk-BatSoup-Stasi 2.0