Skip to main content
Back to overview
The benchmarks

How we measure GitLab Duo Code Review.

One independently scored. One internally scored based on a public methodology. Numbers converge across both.

Benchmarks
2 public
Coverage
150 MRs · 10 languages
Datasets
Open source on GitHub
The results

#Two public benchmarks. Same result.

GitLab Duo Code Review has been measured on two public benchmarks. One is independently scored by the benchmark's publisher (Martian). The other is our own internal evaluation based on Qodo's published methodology. Different repos, different languages. Both report F1 in the same range.

Self-run, public methodologyOpen source

PR-Review-Bench

Published by Qodo · Snapshot: May 22, 2026

50.5%
Hit F1
45.2%
Unified F1

GitLab Duo Code Review isn't on Qodo's leaderboard. Our internal evaluation scored 50.5% Hit F1, which would put us in the top 3 of the 9 tools currently listed there.

Methodology

Static dataset of real merge requests from production open-source repositories, evaluated against human-annotated ground truth at file and line granularity. We ran our own evaluation against this dataset using the published methodology; Qodo did not score our results.

Scope

100 MRs · 580 annotated issues · 7 languages

Repos

prefect, dify, firefox-ios, ghost, cal.com, aspnetcore, tauri, redis

Independently scoredOpen source

Martian Code Review Benchmark

Published by Martian · Snapshot: Jun 3, 2026

48.3%
F1
45.2%
Precision
51.8%
Recall

Rank #5 of 20 tools tested.

Methodology

Static dataset of MRs from established open-source projects, evaluated against expert-verified golden comments tagged by category (bugs, security, performance, style) and severity. Martian's team runs every tool through the evaluation and publishes the leaderboard.

Scope

50 MRs · 5 languages

Repos

Sentry, Grafana, Cal.com, Discourse, Keycloak

Understanding the metric

#What is F1 score?

When a code review tool scans an MR, it can fail in two different directions. F1 captures both, and punishes a tool equally for either failure. The numbers below are from the Martian Code Review Benchmark.

Precision

45.2%
of GitLab Duo Code Review's comments are real issues

Of the comments GitLab Duo Code Review leaves, roughly this share are legitimate catches. The rest are false alarms: flagging something that isn't a problem.

Low precision = crying wolf. Reviewers learn to ignore the tool.

Recall

51.8%
of real issues are caught by GitLab Duo Code Review

Of the real problems in an MR, GitLab Duo Code Review surfaces roughly this share. The remainder slip through unnoticed: false negatives that reach production unless a human reviewer catches them.

Low recall = false confidence. The tool says nothing, bugs ship.

F1 score

48.3%
harmonic mean of precision and recall

F1 combines both into a single number using the harmonic mean, so a tool with great precision but terrible recall still scores low. You can't game it by optimising one at the expense of the other.

F1 = 2 × P × R / (P + R)

Why not just maximise precision?

A tool could achieve 100% precision by leaving zero comments: every comment it leaves would be correct, because it never leaves any. That's useless. F1 collapses to zero if either number is zero, so silence is never a winning strategy.

Two benches, two evaluation methods

Martian uses semantic matching: does the tool's output describe the same issue an expert flagged? PR-Review-Bench uses hit matching: does the comment land on the right file and line? (It also reports a stricter unified-match score that adds LLM semantic verification.) Different lenses on the same underlying question: is the tool right when it speaks up?

Can you measure F1 on your own MRs?

Not directly. Measuring F1 needs ground truth: every real issue that should have been caught on every MR. That's what these benchmarks provide for their respective datasets. What you can do is steer the review focus with custom review instructions, tuning it to your team's priorities while we improve the baseline for everyone.

How it works

#How Martian scores tools

Real MRs, expert-verified comments

50
MRs
5
Repos
5
Languages
20
Tools scored

50 merge requests from 5 production open-source projects across 5 programming languages. Each MR has expert-curated golden comments identifying real issues a thorough code review should catch. Those golden comments are the ground truth every tool is scored against.

sentrygrafanacal.comdiscoursekeycloak

Severity-tagged golden comments

Each golden comment carries a severity label. A tool's score reflects whether it catches the issues the experts thought mattered, not just whether it surfaces anything at all.

LowMediumHighCritical

Semantic matching, not location matching

Comments are scored on what they describe, not where they land. Martian compares each tool's output to the golden comments semantically: does the tool flag the same underlying issue an expert flagged? Substantive observations get credit; vague position-based hits do not.

golden commentssemantic match

Publicly ranked

Tool authors run their tool against the public dataset and submit the outputs. Martian then runs their scoring pipeline (LLM judges checking each output against the expert-curated golden comments) and publishes the rank order. Tool authors don't score themselves, even though they generate the raw outputs.

scored by Martianpublic leaderboard
Model independence

#The model doesn't matter much.

We measured GitLab Duo Code Review across five AI models from two providers (Anthropic and OpenAI) on the PR-Review-Bench dataset. Hit F1 ranges from 49.9% to 52.0%. That 2.1-point spread sits inside normal run-to-run noise. Whichever model is hot next year, the quality you get will be roughly the same.

Hit F1 by model

Across five models from two different providers, Hit F1 ranges from 49.9% to 52.0%, a spread of just 2.1 points. Normal variation between repeats of the same model is ~1 to 2 F1 points, so the cross-model spread sits inside the run-to-run noise floor. Swapping the underlying LLM does not move the needle materially. The intelligence is in the pipeline.

What the data tells us

#What stood out.

The pipeline does the heavy lifting

GitLab Duo Code Review's quality comes from a multi-stage review pipeline that processes the MR before any language model sees it. Prescan, context selection, deterministic checks, then synthesis. The LLM is one stage of many. That's why five different AI models from two providers all land within ~2 F1 points of each other on the PR-Review-Bench dataset. The quality you get today holds tomorrow, no matter which model is making headlines.

2.1 pts
F1 spread across 5 models from 2 providers

Big MR, small MR. Same review depth.

A common question: do large MRs get the same review quality as small ones? We tested across the full size range on the PR-Review-Bench dataset, from small fixes of a few files up to 360-file refactors with 2,500+ lines changed. No correlation between MR size and how many real issues the reviewer caught. Whether you're shipping a one-line fix or a major rewrite, GitLab Duo Code Review applies the same review depth.

r ≈ 0
correlation between MR size and issues found

Review every MR for $0.25.

Flat-rate pricing on the GitLab Free tier. To see it on your own codebase, assign @GitLabDuo to your next MR or use /request_review. The review begins immediately.