# groomed/trained tokenizer and embeddings well i put the tokenizer in with the embeddings code and it seems to work fine but the embeddings are a couple orders of magnitude larger than the adapter, and very slow to train. i did not copy the embeddings from the previous tokenizer (a happenstance of running it since the token names happened to all slightly mismatch since i changed whitespace handling) ... but it is indeed reducing loss, just much more slowly. might take a few weeks with this approach, to train the embeddings. could be just one week if things go well. the tokenizer was just a draft, but i figure if there is another tokenizer made, it will have the same whitespace handling, so a lot of the embedding training work will copy right over. i also daydreamed a little about speeding it by seeding embeddings with other similar tokens; might be a more relevant thing to do. # long running execution and behavior i've found i can keep colab running 24/7 if i step in to reboot the instance every 8-12 hours. paperspace has a similar gimmick. the key with colab to is to reboot it before it is terminated due to usage; then it seems the usage limits don't get as stringent. i have broken the power supply to my laptop. it now only charges outdoors on a 12v power supply. this will change things for me. uploading checkpoints frequently is a little verbose when the loss is reducing so very slowly. i'm somewhat confused around it. # data generation i coded a data generator in c++ . it has a few bugs pending, such as a crash when more than one repository is passed. without further optimiizations, it can produce data at 1-20 MB/s depending on repository structure. [a quick thing i'd like to add is to use rapidjson's string buffer for accumulating data rather than std::string, dunno if it will make a difference though.]. it's in datagen/process2.cpp and there is a neighboring makefile. it needs a few dependencies installed. in the process of drafting the data generator i ended up learning a little rust, which is really good for being relevant in software, in order to call out to huggingface's tokenizers library. there was work in progress to make a c++ interface to this library, but apparently the guy implementing it got sick a couple years ago, then again, and this prevented making progress on the work since 2020. i shared what i came up with at https://github.com/huggingface/tokenizers/issues/185#issuecomment-1197338906 . i also ended up learning libgit2 and cppgit2 some in order to process the git data. cppgit2 for some reason, the repository is archived as if the project has been terminated by the author, without any explanation. i have no idea why this repository is archived without explanation since 2020: https://github.com/p-ranav/cppgit2 . somebody had forked the repository, so i used their fork at https://github.com/bold84/cppgit2 . when i ran my code i ran into a segfault happening inside libgit2 and cppgit2 in response to a standard call of a standard api function. happened every time. i patched this quickly and submitted it to bold84 at https://github.com/bold84/cppgit2/pull/1 :
## fix for segfault when null hunks are passed to callbacks
I'm not certain what the "right" solution for this is, but as-is there is a segfault when diffs are printed.
A null hunk pointer is passed from https://github.com/libgit2/libgit2/blob/main/src/libgit2/diff_print.c#L164 via https://github.com/bold84/cppgit2/blob/master/src/diff.cpp#L199 to the hunk constructor, causing dereference of null.
This quick patch changes the hunk constructor to initialise to zeros if passed a null value.
This breaks parallelism with line and delta, which do not have such a check, but it prevents the segfault.
The bug looked like it had been around for a long time, it was strange to see. ## psychology it seems cool and important to work on this because it breaks through a number of personal boundaries i've experienced, but the complexity of my life is suffering some. .... there are plusses though; i've started playing chess again which could help my cognition. it feels like a pressure in my head .... i think part of me is really confused that i'm doing something rare and possibly very productive, but that it is very slow and i recently made it much slower by adding untrained custom embeddings. it used to be much faster. i think we want to be quite careful not to make it _more_ slow, and that may not be guaranteed yet as i have new data coming and likely a new tokenizer to match it. it can be very hard for me to stay on task for multiiple days as my inhibitions have times here and there when they get creative around stopping it [i think they learn some when i sleep, and some when i engage trigger concepts, not sure]. while i'm in this funnier state of mind, i'm having a little more trouble tracking other things, like my emails and the wim hof trick and stuff. something that seems helpful is to remember that i started this behavior just in order to make a small design improvement to a data structure. i could kind of step back, and just think about working on that data structure, and maybe they might inspire me around what i'm really trying to do here. i'd like to fix the multi-repo bug in the data generator right now or soon !