carperai released large diff models last week

Thu Feb 2 17:32:13 PST 2023

https://carper.ai/diff-models-a-new-way-to-edit-code/ [full of links
that i did not paste through]

Diff Models – A New Way to Edit Code
Jan 27, 2023 | Uncategorized
CarperAI is releasing a series of diff models—models trained to
predict a code diff, trained on millions of commits scraped from
GitHub. We are releasing 3 models of different sizes, all fine-tuned
from Salesforce’s CodeGen code synthesis models:
diff-codegen-350m
diff-codegen-2b
diff-codegen-6b
The dataset of diffs we scraped to train these models will be released
separately in the near future. We hope these models will be useful for
suggesting intelligent changes to existing code, controllable through
a specific commit message describing the change. We will continue to
iterate on our diff models, so stay tuned for further releases.
Read on for more details on how the models were trained, with benchmark results!
Introduction
A diff model is an autoregressive language model trained on edits to a
piece of text, formatted in Unified Diff Format. These diff models can
suggest, given a section of text and a description of the desired
change, an intelligent change to the text that fits the description,
marking the lines added, changed, and deleted in diff format. The
primary use case for these models is for suggesting changes to code—as
such, the models we are releasing are fine-tuned versions of models
already trained on code datasets.
In comparison to few-shot prompting of normal code generation models,
diff models are specialized for suggesting intelligent changes to
existing code, particularly longer pieces of code and where a change
is required to follow some natural language text description (provided
in the form of a commit message).
Prior work by Microsoft Research (Li et al., 2022) and OpenAI (Ray and
McCandlish 20201; Lehman et al. 2022) identified the potential for
diffs as a source of rich data on how to make changes to code, and
trained models on diffs, but did not release any diff models or
publish an analysis of how to obtain good performance.
1 Alex Ray and Sam McCandlish, OpenAI. Independent contribution:
Training diff models, 2020.
A Diff Dataset
Our dataset for this fine-tune consists of commits from GitHub,
obtained using the Google BigQuery Public Dataset, a public up to date
snapshot of a huge number of open-source GitHub repositories. We took
this dataset and filtered using BigQuery on the number of stars in the
repository to exclude repos with less than 100 stars, and further
restricted the query to only repositories with open-source
non-copyleft licenses (e.g. MIT, Apache, etc) and commits with more
than 10 characters in the commit message. We also restricted ourselves
to a list of 22 popular programming, scripting, and markup languages,
including Python, HTML, Bash scripts, SQL, C++, etc. This resulted in
a dataset of 19 million commits after filtering.
At this point we had the commit hashes, repository names, and other
metadata for the commits we wanted in our dataset. We then ran git
clone on every repository in our dataset and used a Python script to
obtain the raw code files before the diff is applied, together with
the diff itself in Unified Diff Format. These were processed into
Apache Parquet format using Dask with Apache Arrow to efficiently get
it into a dataframe format, with one row per file changed (e.g. if a
diff affected multiple files it was split up), and included only rows
where each file + diff was short enough to fit into the context of the
language model.
>From there, we processed the dataset into EleutherAI’s lm_dataformat,
a utility to create compressed data files for efficient language model
training. The final format of the data seen by the language model
consisted of the filename changed by the diff, the file before
changes, the commit message, and the diff itself, all concatenated
together with delineating tags in between:
<NME> {filename}

<BEF> {file_before_changes}

<MSG> {commit_message}

<DFF> {diff}
The model is then typically prompted with everything up to <DFF>, but
you can also optionally include the section heading of the unified
diff format immediately after <DFF>, which specifies which lines
exactly the model should change. For example, appending @@ -1,3 +1,9
@@ after the diff tag would instruct the model to change the file at
line 1, adding 9 - 3 = 6 lines. We do not add these four tags as
special tokens, since we prioritized leaving the tokenizer unchanged.
The final dataset consisted of 1.4 million files from 19 million
commits, which resulted in 1.086 billion tokens after tokenizing with
a modified GPT-2 tokenizer to include whitespace tokens—an average of
888 tokens per sample.
Fine-tuning CodeGen
The model suite we worked with as a base was Salesforce’s CodeGen
series of models, which are decoder-only transformer language models
trained to predict the next token in a sequence. These models were
first pre-trained on The Pile, an 800GB dataset of diverse text
released by EleutherAI, and then further trained on a large dataset of
permissively licensed code from GitHub BigQuery in 6 programming
languages, before finally being trained on Python only code from the
same source. Note that the code in these pre-training datasets will
inevitably overlap to some degree with our diff dataset, although they
do not contain diffs.
Salesforce have released variants of their models at 4 scales (350M,
2B, 6B, and 16B parameters) with 3 variants at each scale
corresponding to the 3 different stages of pre-training described
above. We chose to fine-tune the “mono” variants at each model scale,
meaning the version trained on Python only code in addition to
multi-language code.
In order to fine-tune these models on our diff dataset, we used
HuggingFace’s standard fine-tuning script with slight modifications to
customize to CodeGen’s architecture, using the default hyperparameters
and without freezing any layers. To pre-process the data we
concatenated each sample (file with changes) together in the format
described above and cut it into chunks of 2048 tokens (the context
length of the CodeGen models). We then fine-tuned all of the model
sizes with this dataset as an initial trial run and baseline for
further experiments. For all fine-tuning experiments in this post, we
used 64 Nvidia A100 GPUs—we thank Stability AI for access to their
compute resources!
To test a range of hyperparameters, we did a 12 run sweep with the
350m model across a range of learning rates and batch sizes, and
settled on a learning rate of 3e-5 and a batch size of 1024 samples.
Token Masking
We then experimented with masking tokens in the loss computation, as
described in the ELM paper. Specifically, we include only the tokens
in the diff (including the tag <DFF>) in the loss, which is intended
to encourage the model to predict the diff and not memorize the file
and commit message. For example, we expect that filenames in <NME> and
file contexts in <BEF> are given by the prompt, while <DFF> is the
only goal in the diff generation. Therefore, it is natural to ignore
unrelated prediction targets and exclude tokens before <DFF> in the
computation of the loss function. We fine-tuned the full suite of
models with this modification to compare the results across model
scale.
File Truncation
We also experimented with different ways of truncating the file before
changes to fit more of it into the context length. Without any
truncation, roughly half of the files in the original dataset fit into
the 2048 context length, for a total of 1.086 billion tokens. If we
crop the file before changes to only contain the lines in the diff
file, we can then fit 95% of the original dataset in the context, for
a total of 2.181 billion tokens (see Figure 1). We hoped that
including the extra data at the cost of some context in the file being
changed would improve the model’s performance. However, we found that
this experiment resulted in a model significantly worse than without
truncation, likely because being able to see an entire class/function
that a change relies on is important for modelling.

Figure 1: Histograms of the samples in the dataset, ordered by length
of file in tokens on the x-axis. (left) The baseline dataset, showing
that the 2048 context length of our language models cuts off around
50% of the files. (right) The result of truncating the code file
before changes to only contain lines altered by the diff and
surrounding context.
Results
To evaluate our models, we test their bug fixing capabilities on two
tasks: 4-Parity, a simple toy benchmark where the model is required to
fix basic bugs in a Python function to calculate the parity of a 4-bit
sequence, and a more complex dataset of many synthetic and real Python
bugs scraped from GitHub repositories by He et al. (2022). These
benchmarks provide a simple testbed for whether diff LLMs can make
multiple coordinated and effective changes to code.
For 4-Parity, we generate completions using a prompt consisting of the
original function followed by the commit message <MSG> # Fixed bugs.
We generate 3200 completions for each model, apply the resulting diff
patches to the original function, execute the generated code and
report the % of the generations where the generated 4-Parity function
is correct across all test cases, at the best model temperature from
{0.7, 0.8. 0.9}. We report results across 1-5 bugs synthetically
introduced to the original function.
For the latter task of real Python bugs, we filter the dataset down to
1000 bugs across several bug fixing problems (e.g. a wrong binary
operator and incorrect variable name problem), where we generate a
diff for each bug and measure the exact string match accuracy between
the generated function after applying the diff, and the correct
(bug-free) function. The commit message for this task is Fix
{bug_class}, where the bug class might be, for example, “incorrect
binary operator”. Note that in this case we do not execute the
generated code to test it, since these bugs are scraped from many
different GitHub repositories and execution would be impractical.
The results from 4-Parity, shown in Figure 2, demonstrate that our
diff models can perform basic bug fixing at comparable skill to the
CodeGen models, which are prompted with the bugged function followed
by #Fixed bugs. There is a clear performance increase with scale, and
the 350M diff model performs better at the bug fixing task. We can
also see that the loss masking approach described above results in
significantly better diff models on this task.

Figure 2: Results from evaluating our diff models on the simple
4-Parity bug fixing task. The x-axis is the number of progressively
introduced bugs in the 4-Parity function. The bolded lines show our
best diff models, while the dot-dash lines show the CodeGen models we
used as a starting point, and the baseline models are trained without
loss masking.
Table 1 shows the results from our diff models on the synthetic + real
bugs benchmark, using the pass at k metric with k = 1 (defined as the
fraction of problems solved when the model generates k code samples
per problem. We can see that the masked diff models perform slightly
better
Model	pass at 1 - Synthetic + Real Bugs
Baseline Diff 350M	0.9%
Baseline Diff 2B	1.9%
Baseline Diff 6B	2.3%
Masked Diff 350M	1.7%
Masked Diff 2B	3.9%
Masked Diff 6B	4.8%
CodeGen 350M	2.0%
CodeGen 2B	3.8%
CodeGen 6B	4.5%
Table 1: Pass at 1 (Chen et al., 2021) on the real + synthetic bugs
benchmark (He et al., 2022).
Qualitatively, we also evaluated the accuracy of the line numbers in
the generated diff hunk, and noticed that the larger scale models do
very well at accurately generating line numbers which correspond to
the lines which the diff below actually changes. This opens the door
to prompting the model with specific line numbers to change, add, or
remove, allowing for more control over the code generation in
comparison with a non-diff model.
We also noticed that diff models (especially the 2B and 6B) tend to do
better when prompted with longer code generation tasks (such as fixing
bugs in a large function, and that varying the prompt induces greater
diversity in generated code in comparison with the normal CodeGen
models.
In further work, we hope to examine in greater detail the enhanced
diversity and localised mutation abilities that diff models offer over
standard code generation models, across many model scales.
Accelerated Inference with Triton and FasterTransformer
We also investigated the use of Nvidia’s FasterTransformer (FT)
framework with the Triton Inference Server using an FT backend to
achieve significantly accelerated inference. FasterTransformer is a
collection of fused CUDA kernels optimized for inference, written in
C++. The Triton Inference Server is an optimized system for serving
large language models at scale, in both multi-GPU and multi-node
setups using Docker containers.
Converting the CodeGen models to FT involved significant technical
work, since CodeGen is not supported natively in FT. We first
converted the CodeGen weights to GPT-J format via a linear algebra
trick, since GPT-J has a very similar architecture, building on
Brendan Dolan-Gavitt’s work with the Fauxpilot framework. From there,
we used the FT script to convert the GPT-J HuggingFace checkpoint into
FT’s format, which can be run with the Triton server. We struggled to
get this to run on our cluster (which does not use Docker), but
eventually succeeded and achieved a significant speedup on inference
of our models—in some cases up to an order of magnitude faster.
Model	Time: HuggingFace Transformers	Time: FasterTransformer + Triton
Inference Server
CodeGen 350m	5m 44s	31s
CodeGen 2B	9m 38s	1m 27s
CodeGen 6B	10m 45s	2m 9s
Table 2: Time benchmark results for the base CodeGen models on the
4-Parity task described above, comparing HuggingFace Transformers
inference speed with FasterTransformer using the Triton Inference
Server.
Our scripts to convert and run these models with FasterTransformer and
Triton are available in the OpenELM library.
We hope that this work inspires others to take our models and
experiment with the potential of diff-based code generation!
To cite this blog post, please use the following entry:
H. Bradley, H. Fan, H. Saini, R. Adithyan, S. Purohit, and J. Lehman.
(Jan 2023). Diff Models - A New Way to Edit Code. CarperAI Blog.
https://carper.ai/diff-model/.
Or
@article{bradley2023diffmodels,
  title   = "Diff Models - A New Way to Edit Code",
  author  = "Bradley, Herbie and Fan, Honglu and Saini, Harry and
Adithyan, Reshinth and Purohit, Shivanshu and Lehman, Joel",
  journal = "CarperAI Blog",
  year    = "2023",
  month   = "Jan",
  url     = "https://carper.ai/diff-model/"
}
Change Log: Changed y-axis on Figure 2 to be clearer.
Acknowledgements
The CarperAI diff models team consisted of Herbie Bradley, Honglu Fan,
Harry Saini, Reshinth Adithyan, Shivanshu Purohit, and Joel Lehman.
We thank Stability AI for providing compute resources.