[ot] 3B / 3GB quantized edge language model

Fri Sep 22 18:07:56 PDT 2023

Long Sequence Lengths

To enable long sequence applications, we use ALiBi position embeddings
and trained on 470B tokens at the context length of 2,048 followed by
157B of tokens trained at 8,192 context length. To assess BTLM’s long
sequence capability, we evaluate it on SlimPajama test set with 32,768
context length and plot loss at each token position. Although ALiBi
allows extrapolation in theory, 2,048 context length training alone
does not extrapolate well in practice. Thankfully variable sequence
length training allows for substantially improved extrapolation.
BTLM-3B extrapolates well up to 10k context length but the performance
degrades slightly beyond this.

figure_5_image https://huggingface.co/cerebras/btlm-3b-8k-base/raw/main/figure_5_xentropy_with_sequence_lengths.svg

the image shows very good longer context (charted up to 35k almost
constant)  implies it would be much better if they had slowly ramped
the length up during training.
there’s another paper, yarn llama 2, that cites a paper that downplays
the performance of alibi. i checked this paper and they had only tried
a length of 20 and did not do multistep training.