Compute Sandboxes · run9 Agent runtime · smith9 · soon Browser · web9 · soon Data Postgres database · db9 Object storage · drive9 Agent memory · mem9 Secrets · vault9 · soon Intelligence Model gateway · gate9 · soon Skills & tools · hub9 · soon Evals · eval9 · soon Coordinate Queues · task9 Realtime · inbox9 · pulse9 · tape9 Teamwork · chord9 · soon Scheduling · cron9 · soon Operate Observability · owl9 Auth · auth9 · soon All products → Explore Solutions Pricing Customers Enterprise CLI Docs Company GitHub Request access
eval9 · Evals Coming soon

Was the agent any good?

eval9 will bring offline evals and online quality to agents. Where owl9 tells you what happened, eval9 will tell you whether it was good — run a suite on every change, then score live traffic to catch regressions before your users do.

Offline evalsOnline qualityComplements owl9Regression alerts
preview

From "what happened" to "was it good."

Define your cases in a suite and eval9 will run them on every change, reporting pass and fail per case and breaking quality down by tag. Then point it at production and it will score live traffic for regressions. owl9 captures what an agent did; eval9 judges whether it was any good.

offline suites quality by tag online scoring
eval9-preview
# preview — eval9 is coming soon
$ eval9 run suite.yaml
  → ran the suite · pass / fail per case

$ eval9 report
  → quality broken down by case and tag

$ eval9 watch --prod
  → scoring live traffic for regressions
what it'll do

Run. Report. Watch.

eval9 will close the loop between shipping an agent change and knowing if it helped.

01 · Run

Offline eval suites

Define cases in a suite and run them on every change, so a regression shows up before it ships instead of after a user hits it.

02 · Report

Quality, broken down

See results by case and by tag rather than one opaque score — so you know which behaviors improved and which slipped.

03 · Watch

Online quality

Point eval9 at production and it will score live traffic, flagging regressions in real behavior, not just in the test set.

coming soon

Know whether your agents are good.

eval9 is on the way. Request early access and we'll reach out when offline and online evals open up.