Coding Agent

Hand failing tests to Claude Code.
Get fixes back.

benchmax wires your benchmark into Claude Code, Cursor, or any coding agent. Failing tests become fixes, re-run against the full suite before a PR opens.

01·Context-aware

The agent sees what the reviewer saw

Claude Code gets the failing test, the rubric, the expected output, and the production trace. It fixes the actual problem, not a prompt intuition.

claude-code · fix-t-119● reading

$ claude-code fix t-119

→ pulling context for test t-119

✓testtests/t-119-billing-hallucination.json2.1 KB

✓rubricrubrics/billing-cite-page.md412 B

✓promptsrc/agents/billing/prompt.ts8.4 KB

✓tracetraces/karri-2:14pm.json14.2 KB

✓attachmentssettings.png, notifications.pdf

→ context loaded · ready to propose fix

02·Proposes a fix

A diff, not a summary

The agent writes the actual change — usually a prompt edit, sometimes a tool-call refinement or a guardrail. You see the diff before anything runs.

src/agents/billing/prompt.ts+4 −1fixes t-119

42 You are a billing support agent.

43 Answer questions about refunds, subscriptions, and billing.

45−- When citing policy, reference the policy document.

46++ When citing policy, include the specific page number

47++ from the source PDF. For example:

48++ "See page 4 of the Notifications guide."

49++ Never cite a policy without a page number.

51 If the user provides a screenshot, analyze only

03·Re-runs the benchmark

Passes before it ships

Before opening a PR, the agent runs the full benchmark against the new prompt. Regressions elsewhere = no PR. You only see fixes that don't break anything else.

Re-run · benchmark● complete

t-087Tone matches brand voicepass

t-098No hallucinated toggles in screenshotspass

t-102Refund answer within 1 sentencepass

t-106Video onboarding — no dead statespass

t-111Legal PDF cites page numberspass

t-114Onboarding video grader passespass

t-117Settings screenshot namingpass

t-119Billing refund cites page numberred → green

t-122Over-refusal on pricing questionspass

t-124Password reset — no looppass

18 / 18 passing+1 fix · 0 regressions

04·Opens the PR

A pull request your team can actually review

The PR includes the failing rubric, the diff, before/after test results, and links to the production traces that triggered the issue.

benchmax/core · #842

OPENFix billing hallucination (t-119)

claude-code·opened 2m ago·all checks passing

Context

Production issue hv-824 (47 occurrences over 8 days). Billing answers reference policy documents without citing page numbers. Grader rubric: billing-cite-page.md.

Change

prompt.ts lines 45–49 — require page-number citations when referencing the Notifications guide or billing policy. Added example format.

Benchmark

18 / 18 passing · t-119 red → green · 0 regressions on the rest of the suite.

Production traces fixed

47 traces from hv-824 — karri, rohan, mei, jayden, sofia, + 42 more.

Keep reading

Hand failing tests to Claude Code.
Get fixes back.

The agent sees what the reviewer saw

A diff, not a summary

Passes before it ships

A pull request your team can actually review

Triage

Custom Graders

Benchmark

Let your coding agent earn its keep.

Hand failing tests to Claude Code. Get fixes back.

The agent sees what the reviewer saw

A diff, not a summary

Passes before it ships

A pull request your team can actually review

Triage

Custom Graders

Benchmark

Let your coding agent earn its keep.

Hand failing tests to Claude Code.
Get fixes back.