[spam] [personal] perceiver model notes

Fri Jan 21 09:24:24 PST 2022

- there's already a function to convert these models in the perceiver
transformers subfolder
- once you get the tokenizer right (byte + 6), torch runs google's example fine
- there's some code in that file that sets two additional model
parameters i wasn't setting (config.qk_channels = 8 *32,
config.v_channels = config.d_latents)
- i eventually went to google's original paper and looked at their
training parameters for masked language modeling.  their learning rate
was 0.00125.  mine was 0.0001.

- when i set my learning rate to 0.001, with the config change, the
model now learns a constant output somewhat quickly.  I think i also
changed from the SGD optimizer to Adam.  the paper used Lamb with a
learning rate warmup, cosine cycle, and weight decay.

- i think ended up trying with a large batch size (256 or 512) and
small config parameters (depth=3, ~width=256), and the model got stuck
around in the 2's and then suddenly burst up to 4 and dropped down to
1.3 and was outputting numbers of roughly the right length with often
the right first digit, but wouldn't proceed further.  i didn't note
these parameters and couldn't reproduce it.

- after fiddling with things a bit on colab's gpu, this is the first
set of parameters i found to solve the problem, around step 2000:
https://bafkreie7gyyy3alribjyl72hlm4pk4allyul7xem7yqmpl66yzcidumfnq.ipfs.dweb.link/
.  i think it could do it faster.