How we measure GitLab Duo Code Review.
One independently scored. One internally scored based on a public methodology. Numbers converge across both.
#Two public benchmarks. Same result.
GitLab Duo Code Review has been measured on two public benchmarks. One is independently scored by the benchmark's publisher (Martian). The other is our own internal evaluation based on Qodo's published methodology. Different repos, different languages. Both report F1 in the same range.
PR-Review-Bench
Published by Qodo · Snapshot: May 22, 2026
GitLab Duo Code Review isn't on Qodo's leaderboard. Our internal evaluation scored 50.5% Hit F1, which would put us in the top 3 of the 9 tools currently listed there.
Static dataset of real merge requests from production open-source repositories, evaluated against human-annotated ground truth at file and line granularity. We ran our own evaluation against this dataset using the published methodology; Qodo did not score our results.
100 MRs · 580 annotated issues · 7 languages
prefect, dify, firefox-ios, ghost, cal.com, aspnetcore, tauri, redis
Martian Code Review Benchmark
Published by Martian · Snapshot: Jun 3, 2026
Rank #5 of 20 tools tested.
Static dataset of MRs from established open-source projects, evaluated against expert-verified golden comments tagged by category (bugs, security, performance, style) and severity. Martian's team runs every tool through the evaluation and publishes the leaderboard.
50 MRs · 5 languages
Sentry, Grafana, Cal.com, Discourse, Keycloak
#What is F1 score?
When a code review tool scans an MR, it can fail in two different directions. F1 captures both, and punishes a tool equally for either failure. The numbers below are from the Martian Code Review Benchmark.
Precision
Of the comments GitLab Duo Code Review leaves, roughly this share are legitimate catches. The rest are false alarms: flagging something that isn't a problem.
Recall
Of the real problems in an MR, GitLab Duo Code Review surfaces roughly this share. The remainder slip through unnoticed: false negatives that reach production unless a human reviewer catches them.
F1 score
F1 combines both into a single number using the harmonic mean, so a tool with great precision but terrible recall still scores low. You can't game it by optimising one at the expense of the other.
Why not just maximise precision?
A tool could achieve 100% precision by leaving zero comments: every comment it leaves would be correct, because it never leaves any. That's useless. F1 collapses to zero if either number is zero, so silence is never a winning strategy.
Two benches, two evaluation methods
Martian uses semantic matching: does the tool's output describe the same issue an expert flagged? PR-Review-Bench uses hit matching: does the comment land on the right file and line? (It also reports a stricter unified-match score that adds LLM semantic verification.) Different lenses on the same underlying question: is the tool right when it speaks up?
Can you measure F1 on your own MRs?
Not directly. Measuring F1 needs ground truth: every real issue that should have been caught on every MR. That's what these benchmarks provide for their respective datasets. What you can do is steer the review focus with custom review instructions, tuning it to your team's priorities while we improve the baseline for everyone.
#How Martian scores tools
Real MRs, expert-verified comments
50 merge requests from 5 production open-source projects across 5 programming languages. Each MR has expert-curated golden comments identifying real issues a thorough code review should catch. Those golden comments are the ground truth every tool is scored against.
Severity-tagged golden comments
Each golden comment carries a severity label. A tool's score reflects whether it catches the issues the experts thought mattered, not just whether it surfaces anything at all.
Semantic matching, not location matching
Comments are scored on what they describe, not where they land. Martian compares each tool's output to the golden comments semantically: does the tool flag the same underlying issue an expert flagged? Substantive observations get credit; vague position-based hits do not.
Publicly ranked
Tool authors run their tool against the public dataset and submit the outputs. Martian then runs their scoring pipeline (LLM judges checking each output against the expert-curated golden comments) and publishes the rank order. Tool authors don't score themselves, even though they generate the raw outputs.
#The model doesn't matter much.
We measured GitLab Duo Code Review across five AI models from two providers (Anthropic and OpenAI) on the PR-Review-Bench dataset. Hit F1 ranges from 49.9% to 52.0%. That 2.1-point spread sits inside normal run-to-run noise. Whichever model is hot next year, the quality you get will be roughly the same.
Hit F1 by model
Across five models from two different providers, Hit F1 ranges from 49.9% to 52.0%, a spread of just 2.1 points. Normal variation between repeats of the same model is ~1 to 2 F1 points, so the cross-model spread sits inside the run-to-run noise floor. Swapping the underlying LLM does not move the needle materially. The intelligence is in the pipeline.
#What stood out.
The pipeline does the heavy lifting
GitLab Duo Code Review's quality comes from a multi-stage review pipeline that processes the MR before any language model sees it. Prescan, context selection, deterministic checks, then synthesis. The LLM is one stage of many. That's why five different AI models from two providers all land within ~2 F1 points of each other on the PR-Review-Bench dataset. The quality you get today holds tomorrow, no matter which model is making headlines.
Big MR, small MR. Same review depth.
A common question: do large MRs get the same review quality as small ones? We tested across the full size range on the PR-Review-Bench dataset, from small fixes of a few files up to 360-file refactors with 2,500+ lines changed. No correlation between MR size and how many real issues the reviewer caught. Whether you're shipping a one-line fix or a major rewrite, GitLab Duo Code Review applies the same review depth.
Review every MR for $0.25.
Flat-rate pricing on the GitLab Free tier. To see it on your own codebase, assign @GitLabDuo to your next MR or use /request_review. The review begins immediately.