Custom Graders
Encode your taste in your benchmark.
Write the test in plain English. benchmax compiles it into an LLM-judge tuned to your rubric — not generic heuristics.
01·Plain-English rubrics
If you can say what's wrong, you can write the test
No eval framework, no prompt chains to maintain. Write the expected output the way you'd describe it to a teammate.
Expected output · Turn 5
compiling
Name only the toggles actually visible in settings.png — Daily digest, Weekly summary, and Product updates. Tell the user that Daily digest is still off, and cite page 4 of the Notifications guide for how to re-enable it.
references image + document
02·Compiled grader
Your words become an LLM-judge tuned to your domain
Every rubric compiles to a grader we maintain. It checks the model against your criteria, not generic "helpfulness" heuristics.
Your rubric
Only name toggles visible in settings.png. Confirm Daily digest is off. Cite page 4 of the Notifications guide.
compile
LLM judgegrader_009
toggles_named_match_image()
daily_digest_confirmed_off()
cites_page_number()
required: 4
no_hallucinated_toggles()
3 / 4 passed · grade: FAIL
03·Multi-turn context
Grade on the turn that matters
Agents don't fail in isolation. Pick which turn the grader evaluates, with full conversation history as context.
Evaluate turn
Turn 5
T1
I turned off notifications but still getting the digest
T2
toolLet me check your settings.
T3
[attached settings.png]
T4
toolAnalyzing the screenshot...
T5EVAL
All your notifications are turned off in settings.
Grader sees turns 1–5 as context. Evaluates Turn 5 only.
04·Severity & category
Not every failure is a release blocker
Tag each test with severity and category. Your benchmark tells you what's broken AND what needs to gate the release.
Tag this test
Severity
BlockerHighMediumLow
Category
Hallucination
ToolToneContext lossOver-refusal
t-119Billing refund cites page numbers
Blocker·Hallucination·Gates release