Benchmark

Your production failures become your regression suite.

Every confirmed issue compiles into a permanent test. The suite grows with your product and runs against every PR.

01·Grows automatically

Every accepted issue is a test tomorrow

Accept an issue in Triage → write a rubric → the test ships with your next benchmark run. No eval-writing sprint required.

Issue
Triage
Rubric
Define
Test
Benchmark
Tests over 12 weeks
Wk 1 · 3 tests+20 in 12 weeksWk 12 · 23 tests

02·Multi-modal

Text, images, videos, documents — all gradable

Your agent sees more than text. Your benchmark grades more than text. Test cases include every modality your product supports.

Test suite · modalities18 total
T
t-119text
Billing refund cites page number
I
t-117image
Settings screenshot — name visible toggles
V
t-114video
Onboarding flow — no dead states
D
t-111doc
Legal PDF — cite page for every claim
+ 14 more tests across text, images, video, documents

03·Runs on every PR

CI-grade regression gate

Wire benchmax into GitHub Actions or your existing CI. Every prompt change, every model swap — gated by your benchmark.

benchmax / PR #842checks
lintpassed in 12sDetails
typecheckpassed in 28sDetails
unit-tests148 passed, 0 failedDetails
benchmax / regression-suite15 passed · 3 failingDetails
3 failing · t-119, t-114, t-102 · blocks merge

04·Run history

See when quality moved — and why

Every run tagged with the PR, model, and prompt version. When something regresses, you can point to exactly when.

Runs · last 8prod · main
#842claude-opus-4-7-3
#841claude-opus-4-70
#840claude-opus-4-7+1
#839claude-opus-4-60
#838claude-opus-4-6+2
#837claude-opus-4-6-1regression
#836claude-opus-4-60
#835claude-opus-4-6+1
Click any run to see exact prompt diff and per-test outcomes

Build a benchmark that catches what matters.