So, the thing to do here is apparently to use a language adapter. These mutate embeddings intended for other models such that minimal training is needed.
If training ones own tokenizers, it would make sense to reduce the vocab size so there are fewer embeddings, but you could just use a tokenizer from any model trained on similar data, with a language adapter.
RWKV does long context as well and is starting to take off; in their chat somebody mentioned making a mobile app that uses it. No adapters yet.
I have downtime ATM as I can barely move, my limbs spasm when i try to stand up [or perform fine motor tasks to move forward on these] . It passes with time.
Still keeping the embeddings doing their thing on colab.
Excited to eventually fix that data bug.