Long Sequence Lengths To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate it on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows for substantially improved extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this. figure_5_image https://huggingface.co/cerebras/btlm-3b-8k-base/raw/main/figure_5_xentropy_w... the image shows very good longer context (charted up to 35k almost constant) implies it would be much better if they had slowly ramped the length up during training. there’s another paper, yarn llama 2, that cites a paper that downplays the performance of alibi. i checked this paper and they had only tried a length of 20 and did not do multistep training.