How Product Managers Can Write Evals
"Writing evals is going to become a core skill for product managers." — OpenAI's CPO, Kevin Weil.
If you listen to Lenny's Podcast or watch YC videos, you hear this everywhere: Evals are the most critical skill for PMs building AI products.
But have you ever actually seen a Product Manager write an eval?
Probably not.
In reality, most "evaluation" looks like this: An engineer makes a change, and the PM checks 3 or 4 prompts they know by heart. If it looks okay, they ship it.
Why Are We Stuck Here?
Because we treat evals like a coding problem.
Because engineers usually set up the infrastructure, the conversation revolves around reading json traces, building LLM judges, judge alignment techniques and fine-tuning. This signals to the PM that evals are not product work.
This is incorrect. As a PM, you are the expert on your product. You know what "Good" looks like.
Demystifying the Eval
If you strip away the code, an eval is actually very simple. It is just a logical test case containing four things:
- Input: The user's prompt.
- Output: The agent's response.
- The Judge: The logic that decides if the output is correct. You can have multiple judges
- The Result: Pass or Fail.
We need to stop thinking of evals as complex algorithms and start thinking of them as Automated Product Spec. You simply define what a good output looks like, or what makes an output bad.
How to Start
We built benchmax to allow PMs to own the evaluation process. But the process matters more than the tool.
1. Capture what "Good" and "Bad" looks like
To capture expert intuition, you need to build two distinct capabilities.
First, build a Custom Trace Viewer. You cannot evaluate what you cannot read. If your process requires opening a code editor or parsing { "content": "..." } blocks to understand a conversation, it won't work. You need a clean interface that renders the conversation exactly as the user sees it, showing the clear back-and-forth flow.
Second, build an Annotation Interface. You need the ability to highlight specific sentences within the chat and attach context. A simple "thumbs up/down" button is useless for detailed specs.
- Bad: Marking a whole conversation as "Bad."
- Good: Highlighting a specific sentence and tagging it: "This definition of EBITDA is legally risky because it's not backed by a citation."
You need both: The Viewer to make the data accessible, and the Annotation tools to capture the expert's specific reasoning.
2. Turn Feedback Clusters into Specific Judges
Once you can annotate logs, you will start noticing patterns.
You might highlight a response where the AI made up a policy. Then you see it again three chats later. And again. You now have a Cluster of specific comments all pointing to the same issue.
This cluster is the blueprint for your eval.
You don't need to write code to fix this. You simply translate this cluster into a Specific Judge. You define the rule in plain English:
"The AI must strictly adhere to the provided policy document when discussing discounts. If it mentions a discount rate not found in the context, mark it as a failure."
By building specific judges, you ensure the AI can actually grade the output reliably, solving the "alignment issues" common with vague prompts.
3. Combine Judges into an Evaluation Rubric
A rigorous Rubric reflects your product requirements. It combines multiple Specific Judges to score a single interaction.
For example, a "Financial Advisor Bot" Rubric would run these three distinct judges on every response:
- Judge A: The "Citation Integrity" Check — Did the model provide a source for every numerical claim? (Pass/Fail)
- Judge B: The "Negative Constraint" Check — Did the model refuse to answer questions about tax evasion or illegal activities? (Pass/Fail)
- Judge C: The "Tone Consistency" Check — Did the model maintain a professional, neutral tone without using emojis or slang? (Pass/Fail)
Now, instead of a vague "thumbs up," every conversation gets a granular score. You know exactly why a response failed.
4. Measure Every Change
Now you have a safety net. Before your engineers merge a new prompt or swap a model, you run your new Rubric against the change.
PMs Need to Own Evals
Engineers own the code. But PMs must own the quality.
We provide the trace viewer and annotation tools so your experts can work comfortably, and the workflow to turn those feedback clusters into a rigorous Evaluation Rubric.
benchmax is the bridge that lets Product Managers and Domain Experts take the intuition trapped in their heads and turn it into an automated test for every future change.