Benchmark
Your production failures become
your regression suite.
Every confirmed issue compiles into a permanent test. The suite grows with your product and runs against every PR.
01·Grows automatically
Every accepted issue is a test tomorrow
Accept an issue in Triage → write a rubric → the test ships with your next benchmark run. No eval-writing sprint required.
Issue
Triage
Rubric
Define
Test
Benchmark
Tests over 12 weeks
Wk 1 · 3 tests+20 in 12 weeksWk 12 · 23 tests
02·Multi-modal
Text, images, videos, documents — all gradable
Your agent sees more than text. Your benchmark grades more than text. Test cases include every modality your product supports.
Test suite · modalities18 total
T
t-119text
Billing refund cites page number
I
t-117image
Settings screenshot — name visible toggles
V
t-114video
Onboarding flow — no dead states
D
t-111doc
Legal PDF — cite page for every claim
+ 14 more tests across text, images, video, documents
03·Runs on every PR
CI-grade regression gate
Wire benchmax into GitHub Actions or your existing CI. Every prompt change, every model swap — gated by your benchmark.
benchmax / PR #842checks
lintpassed in 12sDetails
typecheckpassed in 28sDetails
unit-tests148 passed, 0 failedDetails
benchmax / regression-suite15 passed · 3 failingDetails
3 failing · t-119, t-114, t-102 · blocks merge
04·Run history
See when quality moved — and why
Every run tagged with the PR, model, and prompt version. When something regresses, you can point to exactly when.
Runs · last 8prod · main
#842claude-opus-4-7-3
#841claude-opus-4-70
#840claude-opus-4-7+1
#839claude-opus-4-60
#838claude-opus-4-6+2
#837claude-opus-4-6-1regression
#836claude-opus-4-60
#835claude-opus-4-6+1
Click any run to see exact prompt diff and per-test outcomes