eyeballvul: a future-proof benchmark for vulnerability detection in the wild

This is my personal note about the paper. https://arxiv.org/abs/2407.08708

Abstract

This paper introduces "eyeballvul": a benchmark designed to test the vulnerability detection capabilities of LLMs. Eyeballvul is available on GitHub. As of July 2024, eyeballvul contains 24,000+ vulnerabilities and is approximately 55GB in size. https://github.com/timothee-chauvin/eyeballvul

Objective

LLMs have large context windows, making them promising candidates for use as SAST tools. However, no benchmark or dataset currently exists to evaluate their performance in this area. This paper addresses that gap.

eyeballvul details

Quoted from the introduction to the paper

real world vulnerabilities: sourced from a large number of CVEs in open-source repositories;
realistic detection setting: directly tests a likely way that vulnerability detection could end up being deployed in practice (contrary to many previous classification-type datasets);
large size: over 6,000 revisions and 24,000 vulnerabilities, over 50GB in total size;
diversity: no restriction to a small set of programming languages;
future-proof: updated weekly from the stream of published CVEs, alleviating training data contamination concerns; far from saturation

Benchmark create process

Download CVE data related to open-source repositories from the OSV dataset.
Group CVEs by repository and read the affected version list.
Select the smallest hitting set:
1. Use Google's CP-SAT solver.
Switch revisions using Git.
Compute repository size and language using linguist.

Interesting points

CVE data quality is not always good. I heard similar topics discussed on a podcast.
The repository size is very large (around 37GB). If you use it as a benchmark with Claude 3 Opus, it could cost over $150k.
- To use this benchmark via API, you can utilize a subset of the data.

Phrase

Vulnerability detection is a dual-use capability, that is seeked by both defenders and attackers.