[ot][spam][crazy][log] transformer?

Fri Oct 7 11:19:08 PDT 2022

2022-10-07 0935+0000 notes while away:
- causal shape exception seems like a sequence length issue elsewhere,
sounds fun
- exponential decay may need to be updated every epoch
- epochs should have multiple batches
1203 it actually looks like using a causal mask here would mean
improving upon the research in the paper, or at least understanding
the math more in-depth. without already having a working example of
anything at all, it would make sense to not use one at first, which
kind of means classification or regression data rather than sequential
generation data
1229 i glanced through this gpt implementation and it seems to always
have the causal mask enabled. i quickly copied the core changes in the
gpt algorithm over to bert, untested,
pushed to git. excited to switch to bert. it was hard to make the decision.
1425 popping on between stuff to note that causal masking can likely
be done simplistically by changing the binding sum to an accumulation.
this might broadcast the unbinding, slowing the kernel, but not
changing the overall complexity, could be wrong
1636 i’ve patched cumsum into the gpt variant for causal masking, and
gotten my test to run. loss actually is reducing steadily on the first
run O_O kind of dissociated and pushed to git quickly
1727 in my tiny test (2M tokens of binary addition), i see slightly
better results w/out HRR, and faster time w/ HRR. i’m using the cumsum
hack and an incredibly tiny gpt model with only 4 params. (i dropped
the params to see if the approaches would get more different results;
they didn’t) i’m not seeing order of magnitude differences with this
tiny setup.
1755 i added layer and separated out the data gen overhead from the
actual model passes and there is actually a 60% training speedup with
comparable loss, 8 params and the cumsum slowdown. my interest has
waned some. i’d like to port t5 since seq2seq models are supposed to
perform much better, and org my code, …
1812 https://github.com/mosaicml/composer/issues/1602

editor’s note for visitors not experienced with machine learning:
models can have billions of  parameters and operate on very complex
data, so my different results with only a few params and unimaginably
trivial data would not be considered relevant