[old][ai] Open Source Alternative to Megatron-based Language Models Released February
The latest free language model from EleutherAI is 20B parameters large. It doesn't look like it is integrated into mainstream hubs like huggingface yet. This takes somebody to do the integration, and time to work with the people who run the hub. Maybe the team is looking for more community-oriented efforts to develop. Last time EleutherAI released a free service for people on low end systems to play with their model. I made a bot that automated this at high frequency on discord, by inspecting their javascript. Now, the service is for-pay, but the cost is microcents. https://github.com/EleutherAI/gpt-neox https://blog.eleuther.ai/announcing-20b/ February, 2022 - links removed from plaintext Announcing GPT-NeoX-20B, a 20 billion parameter model trained in collaboration with CoreWeave. As of February 9, 2022, GPT-NeoX-20B checkpoints are available for download from The Eye under Apache 2.0. More in-depth information on GPT-NeoX-20B can be found in our preliminary technical report. Looking for a demo? Try GPT-NeoX-20B via CoreWeave and Anlatan’s new inference service, GooseAI! After a year-long odyssey through months of chip shortage-induced shipping delays, technical trials and tribulations, and aggressively boring debugging, we are happy to finally announce EleutherAI’s latest open-source language model: GPT-NeoX-20B, a 20 billion parameter model trained using our GPT-NeoX framework on GPUs generously provided by our friends at CoreWeave. GPT-NeoX-20B is, to our knowledge, the largest publicly accessible pretrained general-purpose autoregressive language model, and we expect it to perform well on many tasks. We hope that the increased accessibility of models of this size will aid in research towards the safe use of AI systems, and encourage anyone interested in working in this direction to reach out to us. As a thank you to our generous compute donors, we are delaying the public downloadable release of the model by 7 days. On February 9, 2022, the full model weights will be downloadable for free under a permissive Apache 2.0 license from The Eye. There will be a #20b channel set up in our Discord for discussions of this model. Please note that much like our other language models and codebases, GPT-NeoX and GPT-NeoX-20B are very much research artifacts and we do not recommend deploying either in a production setting without careful consideration. In particular, we strongly encourage those looking to use GPT-NeoX-20B to read the paper and datasheet on our training data. There are still bugs to be ironed out and many inefficiencies that could be addressed—but hey, we do this in our free time, give us a break lol Task Category Babbage Curie GPT-J-6B FairSeq-13B GPT-NeoX-20B DaVinci LAMBADA Sentence Completion 62.49% 69.51% 68.29% 70.95% 71.98% 75.16% ANLI R1 Natural Language Inference 32.40% 32.80% 32.40% 34.00% 33.50% 36.30% ANLI R2 Natural Language Inference 30.90% 33.50% 34.00% 33.00% 34.40% 37.00% ANLI R3 Natural Language Inference 33.75% 35.50% 35.50% 34.75% 35.75% 36.83% WSC Coreference Resolution 54.54% 49.54% 49.54% 55.44% 49.04% 59.18% WinoGrande Coreference Resolution 59.51% 64.56% 64.01% 67.40% 65.27% 69.93% HellaSwag Sentence Completion 40.38% 54.81% 36.53% 57.69% 53.61% 63.46% Total 39.40% 42.57% 40.28% 44.67% 43.31% 48.40% Accuracy on standard language modeling tasks. Subject Group Babbage Curie GPT-J-6B FairSeq-13B GPT-NeoX-20B DaVinci Humanities 27.01% 26.48% 28.07% 27.27% 28.70% 32.30% Social Science 27.94% 29.24% 28.73% 27.94% 31.63% 35.87% STEM 25.83% 24.25% 25.71% 24.63% 26.27% 28.60% Other 26.86% 28.84% 27.95% 27.33% 29.83% 36.85% Total 26.78% 26.90% 27.38% 26.53% 28.77% 32.86% Zero-shot accuracy of factual knowledge by subject group, as measured by the HendrycksTest evaluation.
https://mystic.the-eye.eu/public/AI/models/GPT-NeoX-20B/ The full weight files weigh in at 308033580802 bytes (286.88 GiB). The slim weight files, which usually means precision is reduced to float16 (sometimes float8), weight in at 41112854242 bytes (38.29 GiB). Traditionally the entire model is loaded into VRAM to evaluate it, although it can also be streamed in and out or distributed across multiple machines with some hacks. There is additional overhead than just the weights, and significantly additional overhead if the model is further being trained for a specific task. https://arxiv.org/abs/2101.00027 https://arxiv.org/abs/2201.07311 # GPT-NeoX-20B ## Model Description GPT-NeoX-20B is an autoregressive transformer language model trained using [GPT-NeoX](https://github.com/EleutherAI/gpt-neox). "GPT-NeoX" refers to the aforementioned framework, while "20B" represents the number of trainable parameters. **Hyperparameter**|**Value** :-----:|:-----: Num. parameters|20,556,201,984 Num. layers|44 D\_model|6,144 D\_ff|24,576 Num. Heads|64 Context Size|2,048 Vocab Size|50257/50432* Positional Encoding|[Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) Rotary Dimensions|25% Tensor Parallel Size|2 Pipeline Parallel Size|4 \* The embedding matrix is padded up to 50432 in order to be divisible by 128, but only 50257 entries are used by the tokenizer. The model consists of 44 layers with a model dimension of 6144, and a feedforward dimension of 24,576. The model dimension is split into 64 heads, each with a dimension of 96. Rotary Position Embedding is applied to the first 24 dimensions of each head. The model is trained with the same vocabulary size as in GPT-2/GPT-3, but with a new tokenizer trained on [the Pile](https://pile.eleuther.ai/), our curated pretraining dataset (described below). ## Training data GPT-NeoX-20B was trained on [the Pile](https://pile.eleuther.ai/), a large-scale curated dataset created by EleutherAI. ## Training procedure GPT-NeoX-20B was trained for 470 billion tokens over 150,000 steps on 96 40GB A100 GPUs for around three months. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly. ## Intended Use and Limitations GPT-NeoX-20B learns an inner representation of the English language that can be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating text from a prompt. Due to the generality of the pretraining set, it has acquired the ability to generate completions across a wide range of tasks - from programming to fiction writing. ## Limitations and Biases The core functionality of GPT-NeoX-20B is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-NeoX-20B it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon GPT-NeoX-20B to produce factually accurate output. GPT-NeoX-20B was trained on [the Pile](https://pile.eleuther.ai/), a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-NeoX-20B may produce socially unacceptable text. See Sections 5 and 6 of [the Pile paper](https://arxiv.org/abs/2101.00027), or [the Pile Datasheet](https://arxiv.org/abs/2201.07311) for a more detailed analysis of the biases in the Pile As with all language models, it is hard to predict in advance how GPT-NeoX-20B will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
The full weight files weigh in at 308033580802 bytes (286.88 GiB). The slim weight files, which usually means precision is reduced to float16 (sometimes float8), weight in at 41112854242 bytes (38.29 GiB).
Just a note that I might be wrong here about what full and slim mean.
Traditionally the entire model is loaded into VRAM to evaluate it, although it can also be streamed in and out or distributed across multiple machines with some hacks. There is additional overhead than just the weights, and significantly additional overhead if the model is further being trained for a specific task.
Can also add that people have been training models on low-end hardware by tracing and training only a subset of the parameters at once. Traditionally all are trained at once. Systems also support a form of checkpointing that discards and regenerates the derivatives when needed, as I've mentioned in a spamlog somewhere.
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many