VersiCode: Towards Version-controllable Code Generation

1Monash University, Australia; 2Nanjing University of Posts and Telecommunications, China; 3ByteDance Ltd., China; 4CSIRO's Data61, Australia;

VersiCode is the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions.

Introduce tasks.

Abstract

Significant research and development efforts have gone into enhancing large language models' performance on code-related tasks due to their practical importance. Models' performance on these tasks is typically measured on public benchmark datasets. Current datasets, however, are oblivious to the notion of \emph{versions}, an essential concept in professional software development. In this paper, we introduce VersiCode, the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions. VersiCode encompasses 301 libraries across more than 2,000 versions spanning 9 years. We design two dedicated evaluation tasks: version-specific code completion (VSCC) and version-aware code editing (VACE). Comprehensive experiments are conducted to benchmark the performance of LLMs, revealing the challenging nature of these tasks and VersiCode, that even state-of-the-art LLMs struggle to generate version-correct code. This dataset, together with the proposed tasks, sheds light on LLMs' capabilities and limitations in handling version-specific code generation, and opens up an important new area of research for further investigation.

VersiCode: A Benchmark for Version-controllable Code Generation

VersiCode is a large-scale code generation benchmark dataset focusing on evolving library dependencies. We propose two tasks to simulate real-world applications: version-specific code completion and version-aware code editing, incorporating version information into code generation constraints.

The Data statistics of VersiCode is as follows.
Python Java C\# JavaScript
# Data Source StackOverflow; Library Source Code; Downstream Application StackOverflow StackOverflow StackOverflow
# Num. of Libraries 301 19 16 33
# Num. of Versions 2,208 25 16 60
# Size of Meta Data 11,269 29 16 62
# Task Type Completion Editing (old to new) Editing (new to old) Completion Completion Completion
# Granularity Token Line Block Block Block Block Block Block
# Avg. Input Token 1,233 1,212 44 115 116 47 51 56
# Avg. Output Token 1 9 77 70 69 220 148 131
# Num. of Instances 13,533 13,531 1,618 49,346 49,346 32 21 82

Dataset Curation and Collection

As shown in Figure 2, we first collected permissively licensed Python repositories from GitHub, ranked by popularity (stars). For each library, we gathered data from three sources.

1. Library Source Code: Collected all available versions from GitHub and verified with PyPI to ensure they are officially released and pip-installable. Extracted official usage examples for each API from the docstrings.

2. Downstream Application Code: Collected source code from top-tier research papers over the past 10 years, given Python's popularity in scientific programming. These applications are lightweight, diverse in topics, and have release timelines tied to publishing venues, implicitly covering evolving libraries.

3. StackOverflow: Using library names as queries, collected FAQ data from StackOverflow, providing real user queries and diverse user answers.

dataset curation and collection

Task Design for Version-controllable Code Generation

As shown in Figure 3, we define each meta-instance \( m_i = [l_i, v_i, d_i, c_i] \in M \), where \( l \), \( v \), \( d \), and \( c \) represent the library name, version, functionality description, and code snippet, respectively. We then design the following two version-controllable code generation tasks.

  • Version-Specific Code Completion (VSCC): Given a meta-instance \( x = [l_i, v_i, d_i, c'_i] \), where \( c'_i \) is the code snippet \( c_i \) with selective masking, replacing the library- and version-sensitive contents with a special token. Depending on the length of the masked contents, the special token is defined as “[token-mask]”, “[line-mask]”, or “[block-mask]”, reflecting code completion on different granularity levels. The output y is the masked content, typically containing function names or variables
  • Version-Aware Code Editing (VACE): Given a pair of meta-instances \( (m_i, m_j \mid l_i == l_j , d_i == d_j , v_i \neq v_j) \), the input \( x = [l_i, v_i, d_i, c_i, v_j] \), and the output \( y = c_j \). Note that version editing may require refactoring of the code structure, making it difficult to format as detailed as in token-level or line-level completion. Additionally, depending on the numerical relationship between \( v_i \) and \( v_j \), various scenarios arise, such as editing from an old version to a new version, or vice versa.
tasks design

Comparison between VersiCode and other datasets

Code editing datasets

  • VersiCode stands out as the largest annotated dataset specifically tailored for version adaptation.
coompare code editing datasets

Code completion datasets

  • VersiCode stands out in annotated data size, marking it as the inaugural dataset tailored for version-specific generation.
compare code completion datasets

Results and Analysis


Main Results

VersiCode is challenging.

main results

Analysis

analysis datasets

(A) Even token-level code completion is challenging

We present the pass@1 results of token-level code completion for LLMs on VersiCode, sorted by release time (see Figure 4-a1, highlighted in green). When compared to the Pass@1 results on HumanEval (in blue) and MBPP (in orange), all models perform significantly worse on VersiCode (in green). This result indicates the difficulty in disambiguating and recalling version-specific library usage. It is worth noting that the larger and latest models, such as GPT-4o (M13) and LLaMA3-70B (M12), achieve significantly better performance than the other models (See Appendix D.1 for the error analysis of GPT-4o.). However, the performance gap with HumanEval and MBPP is still large, with at least 15 points. Thus, for the simplest token-level completion task, state-of-the-art LLMs struggle to achieve satisfactory performance.

(B) Differences in LLM performance across different data sources

We present the Pass@1 results, categorized by data sources, of token-level code completion for LLMs on VersiCode in Figure 4-a2. Comparing these three data sources, most models perform much better on Stack Overflow than on the other two, especially the source code from downstream applications. This result may be due to the high diversity in downstream applications, which requires a strong ability to tackle them. It may also suggest that Stack Overflow is highly represented in the pre-training data of LLMs, hence a greater chance of data leakage. Similar to Figure 4-a1, the outliers are still GPT-4o (M13) and LLaMA3-70B (M12), which excel in dealing with downstream applications increasing the likelihood of models memorizing specific content. Please refer to our paper for full numeric results.

(C) Challenges in casual intermediate library versions

We present the token-level Pass@1 results categorized by lifespan features: addition (in blue), deprecation (in orange), and general (referring to intermediate versions; in green) for the token-level code completion task (see Figure 4-b). Most models perform well in the cases of addition and deprecation due to newly added or deprecated APIs, as these versions are likely emphasized in documentation or by the community. However, most models struggle with reasoning and adapting to intermediate versions. When viewed alongside Figure 4-a2, it is evident that models such as LLaMA3-70B perform better in downstream applications and are also good at intermediate versions, benefiting from the diversity of use cases.

(D) Reduced context increases error risk in code generation

Based on the token-level code completion performance of each model, we selected the top models for further analysis. We present the multi-granularity comparison in Table 3. When comparing the performance between line-level and block-level code completion, we can observe that smaller models tend to fail more frequently at generating correct code in block-level completion, due to less code context and the requirement to generate more content, which aligns with our intuition. Note that the results shown here have been filtered by grammar verification, a post-generation validation step that only counts code that successfully compiles in Python. If we remove grammar verification, the overall performance of block-level completion in Table 11 (Appendix E.3) is comparable to line-level completion in Table 3. This suggests that while the models can predict code-style content, they cannot guarantee correct programming grammar.

(E) The context code in another version is still helpful, but its benefits are limited

The comparison between block-level code completion and block-level code editing is shown in Table 3. There is a significant improvement across most models, except for LLaMA3-70B and GPT4-o. When provided with code in another version as context (i.e. in the code editing task), these models can generate correct code with a much higher success rate. However, a bottleneck is evident in LLaMA3-70B and GPT4-o, where the code context hinders their performance compared to code completion.

(F) Major version matters in version-aware code editing

As shown in Table 3, “old to new” editing and “new to old” demonstrate similar performance among models. As shown in Figure 4, we categorize editing instances according to their source and target versions, distinguishing between major and minor versions. It’s evident that when the major version serves as the source, the model’s editing performance is inferior compared to other scenarios.

(G) The programming knowledge of LLMs, particularly regarding version-specific information, is surprisingly outdated

In Figure 5 we present the Pass@1 performance for token-level code completion, grouped by year, covering 2015-2023. Additionally, we show the histogram of data distribution for each year. To ensure precise timestamps and minimize noise, we only used instances collected from library source code. As shown in Figure 5-a we can observe a general trend: the later the time, the worse the models’ performance. This is counter-intuitive compared to temporal knowledge question answering [48], where performance initially increases before declining. We further filtered for “deprecation” (Figure 5-b) and “addition” (Figure 5-c) to identify version-sensitive cases. While the sparsity of data decreases confidence in results, in both cases, we can observe a consistent decreasing trend over time. This suggests that LLMs have outdated programming knowledge, highlighting the need for rapid adaptation to newer libraries and APIs.

time analysis

BibTeX

@article{versicode,
  author={Tongtong Wu and Weigang Wu and Xingyu Wang and Kang Xu and Suyu Ma and Bo Jiang and Ping Yang and Zhenchang Xing and Yuan-Fang Li and Gholamreza Haffari},
  title        = {VersiCode: Towards Version-controllable Code Generation},
  journal      = {CoRR},
  volume       = {abs/2406.07411},
  year         = {2024},
  url          = {https://arxiv.org/abs/2406.07411},
}