[ot] 3B / 3GB quantized edge language model
I might be wondering if this is useful for hobby finetuning. Only 8k context length though (some models have 128k now), although ALiBi is purported to be extendable to longer context lengths than it was trained on. https://huggingface.co/papers/2309.11568 BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model We introduce the Bittensor Language Model, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source language model. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful language model on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.
Long Sequence Lengths To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate it on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows for substantially improved extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this. figure_5_image https://huggingface.co/cerebras/btlm-3b-8k-base/raw/main/figure_5_xentropy_w... the image shows very good longer context (charted up to 35k almost constant) implies it would be much better if they had slowly ramped the length up during training. there’s another paper, yarn llama 2, that cites a paper that downplays the performance of alibi. i checked this paper and they had only tried a length of 20 and did not do multistep training.
https://huggingface.co/cerebras/btlm-3b-8k-base/discussions/25 Context length schedule and performance #25 by baffo32 - opened less than a minute ago Discussion
Hey,
I’m looking at your chart showing incredible performance improvement greatly extending the context length with a smaller portion of training at the end.
It’s quite notable most of the gains are in the untrained context lengths.
It looks to me like steadily increasing the context length throughout training could possibly flatline the chart, these relative gains are so big.
Has anyone tried training on steadily increasing context lengths?
participants (1)
-
Undescribed Horrific Abuse, One Victim & Survivor of Many