eyeballvul: a future-proof benchmark for vulnerability detection in the wild
This is my personal note about the paper. https://arxiv.org/abs/2407.08708
Abstract
This paper introduces "eyeballvul": a benchmark designed to test the vulnerability detection capabilities of LLMs. Eyeballvul is available on GitHub. As of July 2024, eyeballvul contains 24,000+ vulnerabilities and is approximately 55GB in size. https://github.com/timothee-chauvin/eyeballvul
Objective
LLMs have large context windows, making them promising candidates for use as SAST tools. However, no benchmark or dataset currently exists to evaluate their performance in this area. This paper addresses that gap.
eyeballvul details
Quoted from the introduction to the paper
- real world vulnerabilities: sourced from a large number of CVEs in open-source repositories;
- realistic detection setting: directly tests a likely way that vulnerability detection could end up being deployed in practice (contrary to many previous classification-type datasets);
- large size: over 6,000 revisions and 24,000 vulnerabilities, over 50GB in total size;
- diversity: no restriction to a small set of programming languages;
- future-proof: updated weekly from the stream of published CVEs, alleviating training data contamination concerns; far from saturation
Benchmark create process
- Download CVE data related to open-source repositories from the OSV dataset.
- Group CVEs by repository and read the affected version list.
- Select the smallest hitting set:
- Use Google's CP-SAT solver.
- Switch revisions using Git.
- Compute repository size and language using linguist.
Interesting points
- CVE data quality is not always good. I heard similar topics discussed on a podcast.
- The repository size is very large (around 37GB). If you use it as a benchmark with Claude 3 Opus, it could cost over $150k.
- To use this benchmark via API, you can utilize a subset of the data.
Phrase
Vulnerability detection is a dual-use capability, that is seeked by both defenders and attackers.