- there's already a function to convert these models in the perceiver transformers subfolder - once you get the tokenizer right (byte + 6), torch runs google's example fine - there's some code in that file that sets two additional model parameters i wasn't setting (config.qk_channels = 8 *32, config.v_channels = config.d_latents) - i eventually went to google's original paper and looked at their training parameters for masked language modeling. their learning rate was 0.00125. mine was 0.0001. - when i set my learning rate to 0.001, with the config change, the model now learns a constant output somewhat quickly. I think i also changed from the SGD optimizer to Adam. the paper used Lamb with a learning rate warmup, cosine cycle, and weight decay. - i think ended up trying with a large batch size (256 or 512) and small config parameters (depth=3, ~width=256), and the model got stuck around in the 2's and then suddenly burst up to 4 and dropped down to 1.3 and was outputting numbers of roughly the right length with often the right first digit, but wouldn't proceed further. i didn't note these parameters and couldn't reproduce it. - after fiddling with things a bit on colab's gpu, this is the first set of parameters i found to solve the problem, around step 2000: https://bafkreie7gyyy3alribjyl72hlm4pk4allyul7xem7yqmpl66yzcidumfnq.ipfs.dwe... . i think it could do it faster.