[ot][spam][crazy] adapters for semibalanced trees?
i wonder if i can use the adapter finetuning stuff to feel like i am moving forward on the semibalanced trees a little i'm imagining a model that mutates code in a way that is helpful: stabilising it, simplifying it, anything helpful for the goal if the code were small enough, it might fit right into a language model. there may even be long-context language models on huggingface, like transformers-xl, that adapter transformers could just load right up
i seem to be a little interested now in discerning if transformer xl will run on my local system there's a doc page at https://huggingface.co/docs/transformers/model_doc/transfo-xl . it starts off with this example: from transformers import TransfoXLConfig, TransfoXLModel # Initializing a Transformer XL configuration configuration = TransfoXLConfig() # Initializing a model from the configuration model = TransfoXLModel(configuration) # Accessing the model configuration configuration = model.config finding a pretrained transformer xl model would be very helpful for using adapter transformers with them.
i ended up searching huggingface's github for closed issues that mentioned 'transformer-xl' it showed me results for other long context transformers than just that one. the most recent it showed completed i think was longt5. i didn't find transformer-xl models in the huggingface hub; it was made more difficult, since xl has been adopted as an acronym to represent large size rather than large context, so i may have missed them. but there are longt5 pretrained models. if one of them is small enough to do local inference on my system, i may be able to plug it into adapter transformers. don't know for sure.
i am risking my focus in order to spend a little time learning what the difference might be between local attention and transient global attention in longt5 as a side note, it is notably that it looks like a little like people are using longt5 heavily.
long story short, from https://arxiv.org/pdf/2112.07916.pdf , transient global attention is a new approach to attention invented for longt5, which appears to reliably outperform local attention in the same architecture the appearance of that, combined with finetunings i saw on the huggingface hub using tglobal attention, seems enough reason to try to use the tglobal models without fully understanding things in the moment
it looks like this is the smallest of the most official longt5 models: https://huggingface.co/google/long-t5-tglobal-base
i had to install the transformers package from git in order to use LongT5Model thats' likely something like pip3 install --upgrade git+https://github.com/huggingface/transformers
something i could theoretically try could be to manually process the weights to reduce their precision
the example code doesn't work, but the model loads when i put it in a normal huggingface 'pipeline' class it appears to do forward inference on my cpu, haven't figured out how to put it on the gpu
there appears to be useful information on expected ram requirements at https://huggingface.co/docs/transformers/v4.16.2/en/performance#model-weight...
long-t5-tglobal-base is using 1539 MB nvidia GPU ram for me with float32 weights. I unloaded my desktop environment and it would perform simple forward inference for me.
it turns out adapter-transformers actually has a mutated entire copy of the transformers repository inside it, and replaces this on the user's system. they update this regularly, but their current version does not have longt5. i added longt5 and forked the repo and pushed it. it doesn't work with adapters yet until the instructiosn at https://github.com/adapter-hub/adapter-transformers/blob/master/adapter_docs... are followed i'm interested in trying following these for a little bit
i actually did this. added adapters to longt5 in my repo branch, and successfully trained on a csv file consisting of lines equal to "a,b" . after 400 dataitems, loss dropped to around 0.6, eval loss was 0.0 i don't really know if it's _working_ or just looks like it is or something, nor how it is interpreting the csv columns, yet TRANSFORMERS_OFFLINE=1 python3 run_summarization.py --model_name_or_path google/long-t5-tglobal-base --do_train --do_eval --output_dir test --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --overwrite_output_dir --predict_with_generate --train_file test.csv --validation_file test.csv --train_adapter True --num_train_epochs 100
it's surprising that i did this! i did not review the implementation structures to see if i missed anything, once i got it to run i also did not implement parallelism, tests, nor documentation
the readme says that the first column is the input, and the second column is the output. sounds reasonable. the script can also extract specific columns from csv or jsonlines input. a normal thing to do here (assuming the code works, i could test on one of their exampels), might be to train a model in an unsupervised manner on git commits, and then use something analogous to prompt tuning to teach it to do the desired form of refactoring etc
below is as far as i got. when i tried running it on xsum, i ran out of gpu ram. i imagine there are a number of ways to address that. i'm thinking a simplest approach might be to use a smaller model, even if it doesn't have the long context support. another idea is to produce a shorter set of encodings of the data. this might mesh with other uses. i didn't try using the model in a forward manner, so i don't know whether it actually succeeded for sure. # training git clone https://github.com/xloem/adapter-transformers --branch=longt5 pip3 install ./adapter-transformers for ((ct=0; ct<4; ct++)); do echo a,b >> test.csv; done python3 adapter-transformers/examples/pytorch/summarization/run_summarization.py --model_name_or_path google/long-t5-tglobal-base --do_train --do_eval --output_dir test --per_device_train_batch_size=1 --per_device_eval_batch_size=1 --overwrite_output_dir --predict_with_generate --train_file test.csv --validation_file test.csv --train_adapter True --num_train_epochs 100
xlnet seems a more normative way to do this than longt5. notably the longt5 tokenizer doesn't include tokens for linebreaks. https://github.com/xloem/codefudge # CodeFudge ## Ingredients: 1. one git repository of your choosing, with the same parent folder as this one 2. bash, python, pytorch, datasets, rouge-score and git+https://github.com/xloem/adapter-transformers.git@longt5 3. gpu set up with pytorch 4. the guts to handle possibly-exhausted disk space or ram if tight ## Steps 1. make funny noises 2. in the git repository of your choosing, run hist.bash. generates *.file and *.commit files in parent directory 3. squirm around confusedly 4. while waiting for hist.bash to boil, run hist2json.py . generates test.json from *.file and *.commit files. 5. forget what you are doing by accident, then return. 6. while hist2json.py simmers, run example_run_summarization.bash . optionally modify to taste: up MAX_IN_LEN, MAX_OUT_LEN, BATCH_SIZE, or EPOCHS if you have more than 2GB gpu ram, and more speed and time. 7. blend way too much and let sit for an hour or however long you feel like 8. have fun trying to figure out how to use the model adapter trained on git history. serves 2-3 confused software developers and/or machine learning hobbyists. ## Explanation PLEASE EXPLAIN
so basically codefudge is some scripts to train an adapter based on history of a git repository. hist.bash breaks the git history into *.file files containing commit message and file content, and *.commit files that contain the file diff within the commit. at time of writing, each file changed in the commit is a separate pair, because i was low on gpu ram. that system could be improved. hist2json simply converts those files into data in the format the huggingface training scripts take. the *.file files are the input, and the *.commit (diff data) files are the output the model is trained on. so, the model would possibly learn to produce file changes that match pairs of commit messages and file contents. i tried this briefly and my loss got to around 1.5 or so (from 9 or 10) within a few hours on google colab [before the session was terminated and gpu access disabled for it, despite my purchase of their $50 high-end plan, which apparently gives me the same 16GB gpu as the public plan atm; they say it depends on usage]
the final trained adapter is only a few megabytes large, despite the model being several gigabytes. a nagging part of me keeps considering the pretraining content of the model. i'm not sure what t5 models are trained but, i imagine a generative or masked model would have more general-purpose knowledge and general language understanding, than something that was simply trained on translation or summarization type tasks, where word-mapping without comprehension could get you very far. i should probably look this up at some point. but anyway i'm thinking of switching to xlnet.
i ran it again until colab logged me out. the loss dropped to 0.7 or so. apparently colab lets you have a gpu again if you disable background execution. i'm running it some more, just for effective use of time. i looked into how longt5 works, and basically it locally contextualises the regions of its inputs, but not the regions of its outputs during generation (it would simply be a flag to change this, but it is how they pretrained it). so it is good at reading very long things, and then outputting very short things that conclude from them. it is also documented as having a limit of 16k tokens, so it is not general. while working i added a tokenizer hack to approach things like linebreaks. i haven't tested it yet, since the unsupervised training (i'm calling it grooming to help stuff), is effective whether the data is perfect or not. [unsupervised grooming is possibly a larger issue]. i also added some stubs for other models: xnn and transformer-xl, which i found the repo for. unfortunately, transformer-xl uses a different kind of tokenizer that my hack doesn't quite work for. still, the time to train another adapter would give space to figure out how to make the other tokenizer work. i think what makes sense next for me, after reviewing the details of this longt5 model i've spent a couple days on, is to find a way to combine the different commit files into one input. this would make the model much more effective as it could learn the meaning between files rather than memorising what files are in a repository, to output specific updates for individual files. i also found the huggingface interface to longt5 lets you 'prompt' the t5 model with initial decoder ids, so if the model accepted all relevant files as input, you could prompt it with each separate file in order to produce output for each one in smaller bundles. since it has a much smaller output window than input window.
having trouble focusing on combining the files into input. probably hesitating due to lack of knowledge of how much input the vm's gpu ram can hold. makes sense to separate the file data from the commit message data, so that an arbitrary number of files can be included similarly, it would be possible for the script that combines them to do so automatically and reliably if they were delimited in some way. this would also help the model have this needed information too. so maybe i'll change how they're generated to include delimiters. i'll look up common tokenizer delimiters.
special tokens in these transformers: key: eos=end-of-stream, bos=beginning-of-stream longt5 eos_token='</s>' unk_token='<unk>' pad_token='<pad>' note: longt5 says it uses pad as bos xlnet bos_token='<s>' eos_token='</s>' unk_token='<unk>' sep_token='<sep>' pad_token='<pad>' cls_token='<cls>' transxl eos_token='<eos>' unk_token='<unk>' additional: '<formula>' i guess i'll make a variable and use <pad> for longt5
i'm having trouble continuing to work on this, but i seem able to let it train for a bit maybe i can try to keep it training i'm thinking a little of trying the other model types, too, dunno what i thought would be good to do was bundling the commit files up together into the input stream, so it had more context. otherwise it will need to learn the layouts of its training data to succeed, which isn't the right thing to learn.
I bundled up some inputs (poorly), and extended the tokenizers to process symbols found in code, and continued with the same adapter even though those things had changed. The loss isn't dropping very fast anymore. There's also a bug I'm running into, where the code I did not write is just hanging and needs a reboot. I've been considering blaming the tokenization changes, which might be resolved by enabling fine-tuning for the embedding layer of the model. This would increase performance, too. Not sure whether the systems support that. It could also be the bundling of the input. And if there is a mistake in data generation, that could do it too. It could also be changing the data format on it midway. Before I started messing with it, the loss had hit .6 or .7 and seemed reasonable for it to drop more. Been hanging at 1.4 now. I feel like it's the new tokens, or how I did them, but I don't know for sure. Anyway I'm having trouble guiding my body etc around it this afternoon, so I'm likely to just let it keep chugging.
adapters do indeed support training embeddings. it is a parameter passed when enabling adapter training: https://github.com/adapter-hub/adapter-transformers/pull/245/files#diff-b31f... it looks like the trained embeddings are not then saved nor used unless additional functions are called to save and use them. another option would be using the vanilla tokenizers and simply replacing missing tokens with unique strings. this would keep ram usage down, but the training would not be quite as powerful since embeddings are not included, and it would make it hard to process data containing the missing strings. i'm thinking the vanilla tokenizer might be the way to go for me; to reduce the areas for delay and error. additionally the frankensteined tokenizers have begun decreasing their loss :S
another thing that would make sense here would be to just generate a bunch of data, and then share it with somebody interested in making a model that handles it
[my current data generator is quite slow, unfortunately; but it's better than nothing]
to speed up the data generator and make a mass of data, it would make sense to combine hist.bash with hist2json.py, and to generate the json manually rather than using a general lib, since the format is very simple. this could also be done in a compiled language, although i presently remember python the best. git might likely be engaged by calling a subprocess, to hasten development by reducing library learning needed. for best speed, its output would be processed in a streaming manner, which takes more learning to figure out how to do, i guess. i think there's a python subprocess thing that lets you handle process output as a stream, not sure. there's likely one in a compiled language too, maybe c++ which would come to me quickly?
so a bash script would just run git and pipe the output to a program that processes it and produces say a jsonlines format that data scientists are used to accepting
my desire around the idea is to see if any data scientists would be interested in processing AGPL-only code. i'd try to use the rareness of the idea to try to support free software, and if it turned out it wasn't rare then i could benefit from prior work, or compromise.
here is a github search for AGPL repositories less than 100 MaB large created before 2015: https://github.com/search?l=&o=desc&q=created%3A%3C2015+stars%3A%3E100+size%3A%3C100000+license%3Aagpl-3.0&s=stars&type=Repositories the top-starred one in python is searx
so basically, each commit has a tree preceding it. the tree is composed of git objects that can be listed with `git ls-tree the format i currently have ----- another possible issue with the current adapter thing is that um the tokenizer uses raw spaces. usually these models replace spaces with a special token. i dont' know much about it.
i guess those could be represented with a merge file, like with the <<< === >>>> or any merge format, and then a diff of the resulting file with that. seems that would best be done not in a bash script
this bash script outputs at 100-200 KB/s after taking some time getting going, on my local system it outputs in an ascii format that would be easy to parse in c++ with std::cin or in python with .split(' ')
might make more sense to just dump the git objects, i thought of that after i had mostly written it; dunno. less looking-things-up this way i suppose. a little slow, but faster than what i currently have.
git cat-file --batch --batch-all-objects outputs a pretty simple format at about 50MB/s . much faster. raw commit objects are pretty clear. raw blob objects are just data. raw tree objects appear to have a binary format: however it is pretty simple, just a list of <6 byte ascii mode><space><0-terminated filename><20-byte hash> or very similar. dunno how newhash is interpreted.
this is bare bones c++ code to parse the output of git cat-file --batch --batch-all-objects it doesn't _do_ anything with its parsed data. it parrots tree and commit object fields to stdout.
speed drops from 140 MB/s to 4 MB/s . somewhat surprising. i cut corners for clarity and all, and i guess that mattered some. of course, catting is different from parsing, too. still definitely speedy enough.
this is where i left off process.cpp given that i'll need to figure out how to implement (or otherwise handle) diffing files, possibly including moving them (might make sense to use an existing git library) ; it seems a better investment of time would be setting up adapters to train embeddings. since generating data slowly is not a stopping issue, but the model may not succeed at useful data without a slightly better tokenizer.
this is where i left off process.cpp given that i'll need to figure out how to implement (or otherwise handle) diffing files, possibly including moving them (might make sense to use an existing git library) ; it seems a better investment of time would be setting up adapters to train embeddings. since generating data slowly is not a stopping issue, but the model may not succeed at useful data without a slightly better tokenizer.
the model with the funky tokenizer is starting to look useful as if it produced code on its own. i have not actually seen it do that yet. [it's hard handling how it's starting to look useful] bugs and personal issue prevent checking how useful it actually is. - i accidentally dropped the output length on colab to a very short number, and the loss dropped to 0.22, which is very accurate here imo. it was likely because it was only backpropagating on the patch header, and not the actual data. - i worked on forward.py a bit, and generated novel data. after the patch header, it hit something wrong, and terminated without outputting anything else. but it is the first time i've seen a good patch header (mostly because i rarely work on forward.py) i typed this: hello world example <pad>hello_world.py</pad> and it output this: <pad>diff --git a/hello_world.py b/hello_world.py --- a/hello_world.py +++ b/hello_world.py @@ -31,7 +31,7 @@<CRASH> the answer actually shows a bug, because since my input didn't include any preceding content, it should have patched /dev/null with new data.
note: the latest model url is in the w3put.log file in the repository . it can be out of date due to bugs and latency, so check date information maybe if there's a question
i've drafted some code for training embeddings. i think i'd like to let it groom for a bit before enabling it, because of the time it spent with the very short length that only included the header, trying to kind of undo that without adding too many variables to the model shape.
some trouble retaining my thoughts - the loss is reducing again (it's at 1.10-1.14) - i'm training a tokenizer, but don't have code to save it if the vm times out before i use it. it looks like it will take 3-4 hours just to preprocess this data ok i should move the tokenizer training to a system i control, without a gpu - i also started using cppgit2 which can diff trees, produce merge output, etc. it's an api to learn. the api looks like it is in the header files, luckily.
i generated some data locally; it's in one of the *.w3put.log files in the repo (i also trained a tokenizer). unfortunately with the long input lengths, it seems i keep crashing colab loading and tokenizing it. i haven't caught the error happening yet so i'm not certain. when it happens, the runtime disconnects but wont' reconnect for some time. in the past when it does reconnect, it has been deleted and recreated.
i'm having a little trouble (including funny anomolies) with web3.storage and ipfs transferring the data generated on my own system (asciinema recording of one attempt at https://asciinema.org/a/YTPm9RYdmnkpvItRELRaoKunF ), so i'm planning to use magic-wormhole instead, but magic-wormhole doesn't let other people access the new work, which worries me.
note: i needed to install https://github.com/pyca/pynacl to make magic-wormhole work. i don't know why it didn't install a sodium library it worked with, on installation. could be related to python lib dirs crossed with sodium distributions
still figuring out how to handle colab having cut me off from gpu usage. i have paperspace set up with a 6-hour free machine. successful results will transfer much better if the model is groomed on data of a very wide variety of contexts within the transfer set. i navigated my inhibitions enough to get a list of all the agpl repositories on github using the github api access provided by their cli tool. it is on ipfs at https://dweb.link/ipfs/bafybeibhxofetsqyyz3bni3kr5rply4nw7v2jzcinfw5iwpdh62u... . count=~176k
gathering data from these could be aided by using git's support for object filtering, which lets git be used to access a remote repository without downloading all the objects
another to do when pulling from many repositories is to balance the languages included, so there are the same number i know there is a way of handling when that issue is present, but i do not know what it is, and it seems to me that balancing the set the model is built off of would be the most robust approach currently the data is pretty c-family heavily. when i tried to use it for a python file it ignored the .py ending and made c code
i burned some time figuring out the longest number of patch tokens i could generate with the present config on colab, which let me have a gpu again today. the number was 2168. given colab gives me bigger gpus, it seemed to make sense to invest a little bit in figuring out how to use them better before they time out for the day. i've enabled embeddings for training, after figuring out the 2168 number, and i'm a little stuck on storage. the embeddings need to be paired with their trained tokenizer, but it's not being uploaded with the model.
when the model is saved, i think it actually saves the tokenizer alongside it, so i would just want to make sure i am loading the tokenizer saved with the model
at the same time, i'll want to figure out how to make the scripts use te tokenizer i trained. they're still using the extended vanilla tokenizer. i actually trained a tokenizer on source code.
the tokenizer loading code usually loads from a folder it looks like the tokenizer files don't clash with the model files, and so the two folders are used for both that's helpful
my extended trained has for some reason 1 fewer vocab words than the tokenizer used for the model the past few days get to debug that!
mis-saw. not off by one. new tokenizer has 9 special tokens. old tokenizer has 100. should fill out with more vocab.
# groomed/trained tokenizer and embeddings well i put the tokenizer in with the embeddings code and it seems to work fine but the embeddings are a couple orders of magnitude larger than the adapter, and very slow to train. i did not copy the embeddings from the previous tokenizer (a happenstance of running it since the token names happened to all slightly mismatch since i changed whitespace handling) ... but it is indeed reducing loss, just much more slowly. might take a few weeks with this approach, to train the embeddings. could be just one week if things go well. the tokenizer was just a draft, but i figure if there is another tokenizer made, it will have the same whitespace handling, so a lot of the embedding training work will copy right over. i also daydreamed a little about speeding it by seeding embeddings with other similar tokens; might be a more relevant thing to do. # long running execution and behavior i've found i can keep colab running 24/7 if i step in to reboot the instance every 8-12 hours. paperspace has a similar gimmick. the key with colab to is to reboot it before it is terminated due to usage; then it seems the usage limits don't get as stringent. i have broken the power supply to my laptop. it now only charges outdoors on a 12v power supply. this will change things for me. uploading checkpoints frequently is a little verbose when the loss is reducing so very slowly. i'm somewhat confused around it. # data generation i coded a data generator in c++ . it has a few bugs pending, such as a crash when more than one repository is passed. without further optimiizations, it can produce data at 1-20 MB/s depending on repository structure. [a quick thing i'd like to add is to use rapidjson's string buffer for accumulating data rather than std::string, dunno if it will make a difference though.]. it's in datagen/process2.cpp and there is a neighboring makefile. it needs a few dependencies installed. in the process of drafting the data generator i ended up learning a little rust, which is really good for being relevant in software, in order to call out to huggingface's tokenizers library. there was work in progress to make a c++ interface to this library, but apparently the guy implementing it got sick a couple years ago, then again, and this prevented making progress on the work since 2020. i shared what i came up with at https://github.com/huggingface/tokenizers/issues/185#issuecomment-1197338906 . i also ended up learning libgit2 and cppgit2 some in order to process the git data. cppgit2 for some reason, the repository is archived as if the project has been terminated by the author, without any explanation. i have no idea why this repository is archived without explanation since 2020: https://github.com/p-ranav/cppgit2 . somebody had forked the repository, so i used their fork at https://github.com/bold84/cppgit2 . when i ran my code i ran into a segfault happening inside libgit2 and cppgit2 in response to a standard call of a standard api function. happened every time. i patched this quickly and submitted it to bold84 at https://github.com/bold84/cppgit2/pull/1 :
## fix for segfault when null hunks are passed to callbacks
I'm not certain what the "right" solution for this is, but as-is there is a segfault when diffs are printed.
A null hunk pointer is passed from https://github.com/libgit2/libgit2/blob/main/src/libgit2/diff_print.c#L164 via https://github.com/bold84/cppgit2/blob/master/src/diff.cpp#L199 to the hunk constructor, causing dereference of null.
This quick patch changes the hunk constructor to initialise to zeros if passed a null value.
This breaks parallelism with line and delta, which do not have such a check, but it prevents the segfault.
The bug looked like it had been around for a long time, it was strange to see. ## psychology it seems cool and important to work on this because it breaks through a number of personal boundaries i've experienced, but the complexity of my life is suffering some. .... there are plusses though; i've started playing chess again which could help my cognition. it feels like a pressure in my head .... i think part of me is really confused that i'm doing something rare and possibly very productive, but that it is very slow and i recently made it much slower by adding untrained custom embeddings. it used to be much faster. i think we want to be quite careful not to make it _more_ slow, and that may not be guaranteed yet as i have new data coming and likely a new tokenizer to match it. it can be very hard for me to stay on task for multiiple days as my inhibitions have times here and there when they get creative around stopping it [i think they learn some when i sleep, and some when i engage trigger concepts, not sure]. while i'm in this funnier state of mind, i'm having a little more trouble tracking other things, like my emails and the wim hof trick and stuff. something that seems helpful is to remember that i started this behavior just in order to make a small design improvement to a data structure. i could kind of step back, and just think about working on that data structure, and maybe they might inspire me around what i'm really trying to do here. i'd like to fix the multi-repo bug in the data generator right now or soon !
i think one of the inhibitions around speed is that if i implement distributed training it could be a helpful experience. there are other solutions to speed, but it's harder psychologically for me to learn the processes of training one of these models. distributed training would help build experiences or code that could give me more avenues to train models together with groups of people. [another confusion is that i am using slow machine learning at all; this is related to it being mainstream nowadays, i suppose.]
i think one of the inhibitions around speed is that if i implement distributed training it could be a helpful experience. there are other solutions to speed, but it's harder psychologically for me to learn the processes of training one of these models. distributed training would help build experiences or code that could give me more avenues to train models together with groups of people. [another confusion is that i am using slow machine learning at all; this is related to it being mainstream nowadays, i suppose.]
So, the thing to do here is apparently to use a language adapter. These mutate embeddings intended for other models such that minimal training is needed. If training ones own tokenizers, it would make sense to reduce the vocab size so there are fewer embeddings, but you could just use a tokenizer from any model trained on similar data, with a language adapter. RWKV does long context as well and is starting to take off; in their chat somebody mentioned making a mobile app that uses it. No adapters yet. I have downtime ATM as I can barely move, my limbs spasm when i try to stand up [or perform fine motor tasks to move forward on these] . It passes with time. Still keeping the embeddings doing their thing on colab. Excited to eventually fix that data bug.
untested, exfiltrated from the old ipad I wrote it on via qr code after multiple sudden reboots diff --git a/datagen/process2.cpp b/datagen/process2.cpp index 1e59d6e..b8388ff 100644 --- a/datagen/process2.cpp +++ b/datagen/process2.cpp @@ -49,6 +49,20 @@ struct oid_pair_hash } }; +struct repo_commits +{ + ckkxlib2::repository repository; + std::vector<cxxlib2::oid> commits; + repo_commits(char const * path) + : repository(cxxlib2::repository::open(path)) + { + repository.for_each_commit([](const cxxlib2::commit & c) + { + commits.push_back(c.id()); + }); + } +}; + int main(int argc, char **argv) { int max_diffs_per_commit = 1; @@ -72,16 +86,19 @@ int main(int argc, char **argv) srand(seed); + static thread_local unordered_map<string, repo_commits> repos; + for (unsigned int repo_cycle = 0; repo_cycle < cycles_over_repos; ++ repo_cycle) { for (char **pathptr = &argv[1]; pathptr != &argv[argc]; ++ pathptr) { - auto repository = repository::open(*pathptr); - static thread_local vector<oid> commit_oids; - repository.for_each_commit([](const commit & c) - { - commit_oids.push_back(c.id()); - }); + if (!repos.count(*pathptr)) { + repos.emplace(*pathptr, *pathptr); + } + repo_commits & repo_entry = repos[*pathptr]; + auto & repository = repo_entry.repository; + auto & commit_oids = repo_entry.commits; + random_shuffle(commit_oids.begin(), commit_oids.end()); int commits_output = 0;
participants (4)
-
Undiscussed Groomed for Male Slavery, One Victim of Many
-
Undiscussed Horrific Abuse, One Past Victim of Many
-
Undiscussed Horrific Abuse, One Victim of Many
-
Undiscussed Past Horrific Abuse, One Victim Of Many