How PMs Can Build and Maintain High-Quality AI Evaluation Sets
One of the most critical jobs of a product manager is to be the arbiter of good in an AI product. They need to define what good outputs look like based on users' queries, seeing the current output and defining what mistakes it's making and telling what a reference solution looks like.
Your evaluation sets should have queries that are diverse, some that your agent is very good at solving, so you know you are not regressing there. Some that your agent isn't good at solving today but you want to measure it on and want to get better at.
But it's one of the most misunderstood terms. Here are the common mistakes I see AI PMs making in managing their evaluation sets:
1. The "Softball" Problem (Your Questions Are Too Easy)
The most common mistake is filling your dataset with polite, well-structured queries like "What is the pricing for the Pro plan?"
Of course the model will get that right. But that's not what breaks in production. Real failure comes from Hard Examples.
PMs sit in a room and guess what users might ask. They write perfectly grammatically correct queries like "Summarize this article." Real users don't type like that. Real users type "tl;dr this trash," paste broken URLs, or ask ambiguous follow-ups.
The Fix: Your Evaluation Set must be derived from real-world friction. It needs to contain the misspellings, the confused intent, and the adversarial attempts that appear in your actual logs. If your test data looks cleaner than your production data, it is useless.
2. Missing the "Ideal Answer" (Reference Solutions)
Many datasets are just a column of inputs. This is a half-built bridge.
You cannot improve a model if you don't define what "better" looks like. It's not enough to know the model failed; you need a Reference Solution.
Many teams skip this step because it's hard. They rely on "feeling" if the output is right. But without a defined reference, you cannot automate evaluation.
The Fix: For every input in your set, define the success criteria. Is it a specific fact? A specific JSON format? A specific tone? You need to define the destination for your agent to be measured against.
3. The "Launch and Abandon" Trap
Most teams treat the Evaluation Set as a pre-launch checklist. They curate 50 queries, pass them, ship the feature, and never look back.
But user behavior shifts. New types of failure modes emerge. If you are still testing against the same 50 questions from six months ago, you are optimizing for a ghost.
The Fix: Treat your dataset as a living document. Every time you spot a new type of failure in your logs, that interaction must be promoted to your Evaluation Set.
4. The Regression Bottleneck
This is the silent killer of velocity. An engineer tweaks a prompt to fix one bug. How do you know they didn't just break 20 other things?
For most teams, "re-running the evaluation" is a manual, painful process involving spreadsheets and copy-pasting. Because it is high-friction, it rarely happens, leading to "Regression Roulette" every time you deploy.
The Fix: You must be able to run your entire set of Hard Examples against a new model candidate in the time it takes to grab a coffee.
Bridging the Gap: The benchmax Excel Plugin
We built benchmax because we saw these broken loops everywhere. We believe PMs should own the definition of quality—defining those Hard Examples and Ideal Answers—without needing to write code.
We know that for many of you, your data currently lives in messy spreadsheets. That is why we are launching our Excel Plugin next week.
This plugin allows you to:
- Upload your existing spreadsheet directly into benchmax.
- Curate it by adding Hard Examples and defining Reference Solutions.
- Run that set against any new model or prompt change with a single click.
Stop testing on easy mode. Start measuring what matters.