Coding Agent
Hand failing tests to Claude Code.
Get fixes back.
benchmax wires your benchmark into Claude Code, Cursor, or any coding agent. Failing tests become fixes, re-run against the full suite before a PR opens.
01·Context-aware
The agent sees what the reviewer saw
Claude Code gets the failing test, the rubric, the expected output, and the production trace. It fixes the actual problem, not a prompt intuition.
02·Proposes a fix
A diff, not a summary
The agent writes the actual change — usually a prompt edit, sometimes a tool-call refinement or a guardrail. You see the diff before anything runs.
03·Re-runs the benchmark
Passes before it ships
Before opening a PR, the agent runs the full benchmark against the new prompt. Regressions elsewhere = no PR. You only see fixes that don't break anything else.
04·Opens the PR
A pull request your team can actually review
The PR includes the failing rubric, the diff, before/after test results, and links to the production traces that triggered the issue.