eval9 · Evals Coming soon

Was the agent any good?

eval9 will bring offline evals and online quality to agents. Where owl9 tells you what happened, eval9 will tell you whether it was good — run a suite on every change, then score live traffic to catch regressions before your users do.

Request early access See every product →

Offline evalsOnline qualityComplements owl9Regression alerts

preview

From "what happened" to "was it good."

Define your cases in a suite and eval9 will run them on every change, reporting pass and fail per case and breaking quality down by tag. Then point it at production and it will score live traffic for regressions. owl9 captures what an agent did; eval9 judges whether it was any good.

offline suites quality by tag online scoring

eval9-preview

# preview — eval9 is coming soon
$ eval9 run suite.yaml
  → ran the suite · pass / fail per case

$ eval9 report
  → quality broken down by case and tag

$ eval9 watch --prod
  → scoring live traffic for regressions

what it'll do

Run. Report. Watch.

eval9 will close the loop between shipping an agent change and knowing if it helped.

01 · Run

Offline eval suites

Define cases in a suite and run them on every change, so a regression shows up before it ships instead of after a user hits it.

02 · Report

Quality, broken down

See results by case and by tag rather than one opaque score — so you know which behaviors improved and which slipped.

03 · Watch

Online quality

Point eval9 at production and it will score live traffic, flagging regressions in real behavior, not just in the test set.

combines with

Quality on top of what you already capture.

eval9 pairs with observability, compute, and data to judge agent behavior end to end.

owl9

coming soon

Know whether your agents are good.

eval9 is on the way. Request early access and we'll reach out when offline and online evals open up.

Get notified See every product →

Was the agent any good?

From "what happened" to "was it good."

Run. Report. Watch.

Offline eval suites

Quality, broken down

Online quality

Quality on top of what you already capture.

Observability

Sandboxes

Postgres database

Memory

Know whether your agents are good.