[ot][spam][crazy] adapters for semibalanced trees?

Thu Jul 28 02:57:09 PDT 2022

# groomed/trained tokenizer and embeddings

well i put the tokenizer in with the embeddings code and it seems to
work fine but the embeddings are a couple orders of magnitude larger
than the adapter, and very slow to train. i did not copy the
embeddings from the previous tokenizer (a happenstance of running it
since the token names happened to all slightly mismatch since i
changed whitespace handling) ... but it is indeed reducing loss, just
much more slowly. might take a few weeks with this approach, to train
the embeddings. could be just one week if things go well.

the tokenizer was just a draft, but i figure if there is another
tokenizer made, it will have the same whitespace handling, so a lot of
the embedding training work will copy right over. i also daydreamed a
little about speeding it by seeding embeddings with other similar
tokens; might be a more relevant thing to do.

# long running execution and behavior

i've found i can keep colab running 24/7 if i step in to reboot the
instance every 8-12 hours. paperspace has a similar gimmick. the key
with colab to is to reboot it before it is terminated due to usage;
then it seems the usage limits don't get as stringent.

i have broken the power supply to my laptop. it now only charges
outdoors on a 12v power supply. this will change things for me.

uploading checkpoints frequently is a little verbose when the loss is
reducing so very slowly. i'm somewhat confused around it.

# data generation

i coded a data generator in c++ . it has a few bugs pending, such as a
crash when more than one repository is passed. without further
optimiizations, it can produce data at 1-20 MB/s depending on
repository structure. [a quick thing i'd like to add is to use
rapidjson's string buffer for accumulating data rather than
std::string, dunno if it will make a difference though.]. it's in
datagen/process2.cpp and there is a neighboring makefile. it needs a
few dependencies installed.

in the process of drafting the data generator i ended up learning a
little rust, which is really good for being relevant in software, in
order to call out to huggingface's tokenizers library. there was work
in progress to make a c++ interface to this library, but apparently
the guy implementing it got sick a couple years ago, then again, and
this prevented making progress on the work since 2020. i shared what i
came up with at
https://github.com/huggingface/tokenizers/issues/185#issuecomment-1197338906
.

i also ended up learning libgit2 and cppgit2 some in order to process
the git data. cppgit2 for some reason, the repository is archived as
if the project has been terminated by the author, without any
explanation. i have no idea why this repository is archived without
explanation since 2020: https://github.com/p-ranav/cppgit2 . somebody
had forked the repository, so i used their fork at
https://github.com/bold84/cppgit2 .

when i ran my code i ran into a segfault happening inside libgit2 and
cppgit2 in response to a standard call of a standard api function.
happened every time. i patched this quickly and submitted it to bold84
at https://github.com/bold84/cppgit2/pull/1 :

> ## fix for segfault when null hunks are passed to callbacks
>
> I'm not certain what the "right" solution for this is, but as-is there is a
> segfault when diffs are printed.
>
> A null hunk pointer is passed from
> https://github.com/libgit2/libgit2/blob/main/src/libgit2/diff_print.c#L164 via
> https://github.com/bold84/cppgit2/blob/master/src/diff.cpp#L199 to the
> hunk constructor, causing dereference of null.
>
> This quick patch changes the hunk constructor to initialise to zeros if
> passed a null value.
>
> This breaks parallelism with line and delta, which do not have such a
> check, but it prevents the segfault.

The bug looked like it had been around for a long time, it was strange to see.

## psychology

it seems cool and important to work on this because it breaks through
a number of personal boundaries i've experienced, but the complexity
of my life is suffering some. .... there are plusses though; i've
started playing chess again which could help my cognition. it feels
like a pressure in my head ....

i think part of me is really confused that i'm doing something rare
and possibly very productive, but that it is very slow and i recently
made it much slower by adding untrained custom embeddings. it used to
be much faster. i think we want to be quite careful not to make it
_more_ slow, and that may not be guaranteed yet as i have new data
coming and likely a new tokenizer to match it. it can be very hard for
me to stay on task for multiiple days as my inhibitions have times
here and there when they get creative around stopping it [i think they
learn some when i sleep, and some when i engage trigger concepts, not
sure].

while i'm in this funnier state of mind, i'm having a little more
trouble tracking other things, like my emails and the wim hof trick
and stuff.

something that seems helpful is to remember that i started this
behavior just in order to make a small design improvement to a data
structure. i could kind of step back, and just think about working on
that data structure, and maybe they might inspire me around what i'm
really trying to do here.

i'd like to fix the multi-repo bug in the data generator right now or soon !