Skip to main content

Documentation Index

Fetch the complete documentation index at: https://axiom-support-improvements-2025-10-20.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Evaluation is the systematic process of measuring how well your AI performs against known correct examples. Instead of relying on manual spot-checks or subjective assessments, evaluations provide quantitative, repeatable benchmarks that let you confidently improve your AI systems over time.

Why systematic evaluation matters

AI systems fail in non-deterministic ways. The same prompt can produce different results. Edge cases emerge unpredictably. As capabilities grow from simple single-turn interactions to complex multi-agent systems, manual testing becomes impossible to scale. Systematic evaluation solves this by:
  • Establishing baselines: Measure current performance before making changes
  • Preventing regressions: Catch quality degradation before it reaches production
  • Enabling experimentation: Compare different models, prompts, or architectures
  • Building confidence: Deploy changes knowing they improve aggregate performance

The evaluation workflow

Axiom’s evaluation framework follows a simple pattern:
1

Create a collection

Build a set of test cases with inputs and expected outputs (ground truth). Start small with 10-20 examples and grow over time.
2

Define scorers

Write functions that compare your capability’s output against the expected result. Use custom logic or prebuilt scorers from libraries like autoevals.
3

Run evaluations

Execute your capability against the collection and score the results. Track metrics like accuracy, pass rate, and cost.
4

Compare and iterate

Review results in the Axiom Console. Compare against baselines. Identify failures. Make improvements and re-evaluate.

What’s next?