Was the agent any good?
eval9 will bring offline evals and online quality to agents. Where owl9 tells you what happened, eval9 will tell you whether it was good — run a suite on every change, then score live traffic to catch regressions before your users do.
From "what happened" to "was it good."
Define your cases in a suite and eval9 will run them on every change, reporting pass and fail per case and breaking quality down by tag. Then point it at production and it will score live traffic for regressions. owl9 captures what an agent did; eval9 judges whether it was any good.
# preview — eval9 is coming soon $ eval9 run suite.yaml → ran the suite · pass / fail per case $ eval9 report → quality broken down by case and tag $ eval9 watch --prod → scoring live traffic for regressions
Run. Report. Watch.
eval9 will close the loop between shipping an agent change and knowing if it helped.
Offline eval suites
Define cases in a suite and run them on every change, so a regression shows up before it ships instead of after a user hits it.
Quality, broken down
See results by case and by tag rather than one opaque score — so you know which behaviors improved and which slipped.
Online quality
Point eval9 at production and it will score live traffic, flagging regressions in real behavior, not just in the test set.
Quality on top of what you already capture.
eval9 pairs with observability, compute, and data to judge agent behavior end to end.
Observability
owl9 captures what happened; eval9 judges whether it was good.
owl9 → run9Sandboxes
Run eval suites across forkable sandboxes — many cases in parallel.
run9 → db9Postgres database
Keep eval cases and results in a branchable database you can query.
db9 → mem9Memory
Reference the memory an agent drew on when judging the answer it gave.
mem9 →Know whether your agents are good.
eval9 is on the way. Request early access and we'll reach out when offline and online evals open up.