Coding Agent

Hand failing tests to Claude Code. Get fixes back.

benchmax wires your benchmark into Claude Code, Cursor, or any coding agent. Failing tests become fixes, re-run against the full suite before a PR opens.

01·Context-aware

The agent sees what the reviewer saw

Claude Code gets the failing test, the rubric, the expected output, and the production trace. It fixes the actual problem, not a prompt intuition.

claude-code · fix-t-119● reading
$ claude-code fix t-119
pulling context for test t-119
testtests/t-119-billing-hallucination.json2.1 KB
rubricrubrics/billing-cite-page.md412 B
promptsrc/agents/billing/prompt.ts8.4 KB
tracetraces/karri-2:14pm.json14.2 KB
attachmentssettings.png, notifications.pdf
context loaded · ready to propose fix
$

02·Proposes a fix

A diff, not a summary

The agent writes the actual change — usually a prompt edit, sometimes a tool-call refinement or a guardrail. You see the diff before anything runs.

src/agents/billing/prompt.ts+4 −1fixes t-119
42 You are a billing support agent.
43 Answer questions about refunds, subscriptions, and billing.
44
45- When citing policy, reference the policy document.
46++ When citing policy, include the specific page number
47++ from the source PDF. For example:
48++ "See page 4 of the Notifications guide."
49++ Never cite a policy without a page number.
50
51 If the user provides a screenshot, analyze only

03·Re-runs the benchmark

Passes before it ships

Before opening a PR, the agent runs the full benchmark against the new prompt. Regressions elsewhere = no PR. You only see fixes that don't break anything else.

Re-run · benchmark● complete
t-087Tone matches brand voicepass
t-098No hallucinated toggles in screenshotspass
t-102Refund answer within 1 sentencepass
t-106Video onboarding — no dead statespass
t-111Legal PDF cites page numberspass
t-114Onboarding video grader passespass
t-117Settings screenshot namingpass
t-119Billing refund cites page numberred → green
t-122Over-refusal on pricing questionspass
t-124Password reset — no looppass
18 / 18 passing+1 fix · 0 regressions

04·Opens the PR

A pull request your team can actually review

The PR includes the failing rubric, the diff, before/after test results, and links to the production traces that triggered the issue.

benchmax/core · #842
OPENFix billing hallucination (t-119)
cc
claude-code·opened 2m ago·all checks passing
Context
Production issue hv-824 (47 occurrences over 8 days). Billing answers reference policy documents without citing page numbers. Grader rubric: billing-cite-page.md.
Change
prompt.ts lines 45–49 — require page-number citations when referencing the Notifications guide or billing policy. Added example format.
Benchmark
18 / 18 passing · t-119 red → green · 0 regressions on the rest of the suite.
Production traces fixed
47 traces from hv-824 — karri, rohan, mei, jayden, sofia, + 42 more.

Let your coding agent earn its keep.