Custom Graders

Encode your taste in your benchmark.

Write the test in plain English. benchmax compiles it into an LLM-judge tuned to your rubric — not generic heuristics.

01·Plain-English rubrics

If you can say what's wrong, you can write the test

No eval framework, no prompt chains to maintain. Write the expected output the way you'd describe it to a teammate.

Expected output · Turn 5

compiling

Name only the toggles actually visible in settings.png — Daily digest, Weekly summary, and Product updates. Tell the user that Daily digest is still off, and cite page 4 of the Notifications guide for how to re-enable it.

references image + document

02·Compiled grader

Your words become an LLM-judge tuned to your domain

Every rubric compiles to a grader we maintain. It checks the model against your criteria, not generic "helpfulness" heuristics.

Your rubric

Only name toggles visible in settings.png. Confirm Daily digest is off. Cite page 4 of the Notifications guide.

compile

LLM judgegrader_009

toggles_named_match_image()

daily_digest_confirmed_off()

cites_page_number()

required: 4

no_hallucinated_toggles()

3 / 4 passed · grade: FAIL

03·Multi-turn context

Grade on the turn that matters

Agents don't fail in isolation. Pick which turn the grader evaluates, with full conversation history as context.

Evaluate turn

Turn 5

I turned off notifications but still getting the digest

toolLet me check your settings.

[attached settings.png]

toolAnalyzing the screenshot...

T5EVAL

All your notifications are turned off in settings.

Grader sees turns 1–5 as context. Evaluates Turn 5 only.

04·Severity & category

Not every failure is a release blocker

Tag each test with severity and category. Your benchmark tells you what's broken AND what needs to gate the release.

Tag this test

Severity

BlockerHighMediumLow

Encode your taste in your benchmark.

If you can say what's wrong, you can write the test

Your words become an LLM-judge tuned to your domain

Grade on the turn that matters

Not every failure is a release blocker

Triage

Benchmark

Coding Agent

Write your first rubric in plain English.