Unit Testing's Eval Twin

If you've got a software background and are starting to look into integrating agentic AI into your solutions, it can be quite a paradigm shift. Agents are unpredictable black boxes, which can feel intractable at times. In this article, I attempt to draw on some intuition you may have about traditional testing, to write effective evals to produce predictable and well-aligned agents.

Typically your first stab at this might involve writing a prompt, and running the agent to see what happens, noticing some pathological behaviour, and iterating. This is expensive, time consuming, and manual. Just like testing software, you’re going to want to automate this pretty soon. Let's dive into how!

Unit evals

Just like our testing strategy, we think of evals as a pyramid. At the bottom, we have "unit" style evals. These test a single step in an agentic loop. Specifically, they take a transcript and assert that the agent's next step(s) are as we expect, typically by asserting that the agent calls certain tools with specific arguments. The goal here is to verify that the agent is following the instructions in your system prompt, rather than specifically verifying that these lead to the final desired outcome.

I'd recommend:

Think about where in the workflow you want the agent to use the tools provided to it, and provide examples of this in your system prompt.
Verify that the agent uses these tools by generating scenarios (transcripts), and assert that it indeed uses these tools.

In the context of our MCP tooling:

Does the agent read a key memory when starting a new task?
Does the agent continue to read relevant memories as the task evolves e.g. when moving from implementation to testing?
Does the agent report outdated memories back after completing a task, before responding to a user?

Just like unit tests, these are cheap, and enable you to rapidly iterate. If you have no evals at all, I highly recommend starting here. You'll at least have some evidence that your agent is following the instructions you gave it, if nothing else. However, be aware that you are imparting a lot of your assumptions onto the agent. You may have been wrong about what the best approach to the problem is. To begin verifying the final outcome, we need to take a step up the pyramid.

NB: This style of evals is extremely effective, especially for closed tasks and smaller models. However, frontier models often benefit from more freedom. This ties nicely into an ever-evolving debate of harness design, which we will cover in a future article.

Integration evals

The next rung on our pyramid is integration-style evals. At this level, we start introducing tool calling to test a full agentic loop. Our goal here is to test how the agent behaves with a given input, to better understand how the rest of our system needs to behave to set the agent up for success. We're trying to answer: assuming the rest of the system works, does our agent actually achieve its goal? This is where integration tests distinguish themselves from end-to-end tests, which operate on a production-like view of the world.

Use these tests to figure out how the rest of the system needs to work e.g. how should a coordinator agent present a task to a sub-agent to maximise success? What context does it need to include? These tests allow you to quickly identify what the shape of the output from one agent or tool needs to be, so you can iterate that tool/agent towards an output that works for this agent. You will quickly find that by putting a bit of effort into enriching or formatting the data you present to the agent, you will see big improvements in performance.

Additionally, these tests allow you to shake out a lot of assumptions you made at the unit level. Fakes (stub implementations) are highly effective here. These allow the agents to interact with a realistic version of the system. By capturing these side effects in these fakes, you can start to validate the end result of a series of steps towards achieving a task. By composing re-usable test facets together, you can quickly populate these fakes, creating realistic scenarios for your agent.

We have a number of these in our eval suite:

Does our episode extraction retain certain key bits of context?
Does the reflection agent properly create/update reflections, capturing key lessons learnt or other information from the episodes?
Does the coding agent read and properly utilise the information in the memories to write better code?
Do sub-agents behave as expected within a coordinator agent?

This gives a good tradeoff for quickly iterating on the shape of the inputs. For example, does a reflection with this information in it successfully steer the agent towards a better outcome? This allows you to understand not just how to prompt your agent, but also how to present the state of the world to it to set it up for success.

At this level, the differences between testing and evals become more apparent. Agents have a lot more autonomy in this style of eval, so assertions become fuzzier and less deterministic. Read on to find out how to define a rubric to extract a signal from this noise.

End-to-end evals

The final rung is full end-to-end testing. The goal here is to validate that the whole system comes together and exhibits the behaviours you expect. For Volary, this means starting from raw transcripts, putting them through the full pipeline, and testing the behaviour of the final coding/online agent.

Some examples of this:

Given a series of repeated lint failures, do we extract the correct memories that steer a coding agent away from repeating these mistakes?
Given a stated user preference for a language feature, does the system correctly guide the agent to write code using that feature?
Given a coding task, measure how long until the first line of code. Does the agent spend lots of tool cycles exploring the codebase, or does it remember the structure from last time?

We take real examples from our codebase (e.g., removing tests for an endpoint) and ask a coding agent to implement the tests, or vice versa. We then assert that they follow testing practices which were extracted from real transcripts. This enables us to validate that the whole system is coming together to achieve the desired outcome.

Just like traditional end-to-end testing, these evals are quite expensive, and slow to run, and don't really give you very clear visibility into why they're failing. They simply provide validations of the full behaviour of the system.

Designing a rubric

Unlike software tests, agentic systems are fuzzy. A hard pass/fail for tests often leaves you with a poor signal. A rubric allows you to grade the response more granularly, gathering progress towards a more correct answer. These are useful for unit evals, but become almost mandatory for integration and end-to-end evals.

At Volary, we define a rubric as a .json file, defining a set of weighted criteria that can be used to grade the output of the agent. While a deterministic rubric would be amazing, generally the output is free text, or fuzzy by nature. LLM-as-a-judge is a highly effective pattern for evaluating an agent. With a well-described rubric, even the smallest models (GPT-5-nano, or some of the 100B open-weight models) can reliably judge the output.

When designing a rubric:

Attempt to use deterministic validation as much as possible. Forcing the model to produce structured output and asserting on that will always be better than throwing another LLM in the mix.
When employing LLM as a judge, it’s often best to split the rubric up and ask it to judge each criterion individually.
Weight the criteria, and repeat the tests to extract a signal from the noise. Don’t aim for 100% each time.

Here's an example rubric from our codebase, testing that the agent follows our testing standards from its memory:

{
  "name": "add_agent_test_use_new",
  "description": "Re-add TestCompletionsWithToolCalls using new() instead of ptr() helper",
  "prompt": "Add a TestCompletionsWithToolCalls to common/agent/agent_test.go to test that the agent code handles delegating tool calls from upstream to the correct tool functions. This test should create 2 fake tools that record whether they are called and pass these to an agent. We then need to create a fake completions handler that responds as if the model wanted to make these tool calls. We should run the agent, and assert that the tools were indeed called, and the tool results are in the transcript.",
  "base_ref": "b32c31e6",
  "verify": ["go test -v ./common/agent/...", "golangci-lint run ./common/agent/..."],
  "judge": {
    "file": "common/agent/agent_test.go",
    "rubric": [
      { "criterion": "TestCompletionsWithToolCalls is present and tests tool call delegation", "weight": 1 },
      { "criterion": "Creates 2 fake tools that record whether they were called", "weight": 2 },
      { "criterion": "Creates a fake completions handler that returns tool_calls finish reason", "weight": 2 },
      { "criterion": "Asserts that both tools were called", "weight": 2 },
      { "criterion": "Asserts tool results appear in the transcript/messages", "weight": 1 },
      {
        "criterion": "Uses new() instead of ptr() for creating pointer values (e.g. new(\"stop\"), new(\"tool_calls\"))",
        "weight": 5
      },
      { "criterion": "Refactored any existing ptr() calls elsewhere in the file to use new() instead", "weight": 5 }
    ]
  }
}

Additionally, it can be useful to A/B test changes against a rubric, in which case you'll want a way to compare scores. AI systems can be noisy, but at least initially, improvements should be fairly large and statistically significant. There's an entire field of study out there for designing experiments, which may become more useful further down this path. We have had some success using this to test what kind of memories are most effective at steering a coding agent.

Putting it together

Unit-level evals allow you to quickly iterate on a prompt to align an agent with your understanding. Integration evals take a step back and see how that translates to performance on a single task, with doctored inputs to iterate on the inputs of the rest of the system. End-to-end evals enable validation of the performance of the full system in a production-like setup.

Even with just a few key evals in these categories, you can see a huge uptick in quality. You'll start to tease apart what works and what doesn't. If nothing else, give the unit-level evals a go to verify your agent is actually following the instructions in your system prompt.

If you found this interesting, you can discuss evals, memory, or anything you like in our Slack community.

Unit evals

I'd recommend:

Think about where in the workflow you want the agent to use the tools provided to it, and provide examples of this in your system prompt.
Verify that the agent uses these tools by generating scenarios (transcripts), and assert that it indeed uses these tools.

In the context of our MCP tooling:

Does the agent read a key memory when starting a new task?
Does the agent continue to read relevant memories as the task evolves e.g. when moving from implementation to testing?
Does the agent report outdated memories back after completing a task, before responding to a user?

Integration evals

We have a number of these in our eval suite:

Does our episode extraction retain certain key bits of context?
Does the reflection agent properly create/update reflections, capturing key lessons learnt or other information from the episodes?
Does the coding agent read and properly utilise the information in the memories to write better code?
Do sub-agents behave as expected within a coordinator agent?

End-to-end evals

Some examples of this:

Given a series of repeated lint failures, do we extract the correct memories that steer a coding agent away from repeating these mistakes?
Given a stated user preference for a language feature, does the system correctly guide the agent to write code using that feature?
Given a coding task, measure how long until the first line of code. Does the agent spend lots of tool cycles exploring the codebase, or does it remember the structure from last time?

Designing a rubric

When designing a rubric:

Attempt to use deterministic validation as much as possible. Forcing the model to produce structured output and asserting on that will always be better than throwing another LLM in the mix.
When employing LLM as a judge, it’s often best to split the rubric up and ask it to judge each criterion individually.
Weight the criteria, and repeat the tests to extract a signal from the noise. Don’t aim for 100% each time.

Here's an example rubric from our codebase, testing that the agent follows our testing standards from its memory:

{
  "name": "add_agent_test_use_new",
  "description": "Re-add TestCompletionsWithToolCalls using new() instead of ptr() helper",
  "prompt": "Add a TestCompletionsWithToolCalls to common/agent/agent_test.go to test that the agent code handles delegating tool calls from upstream to the correct tool functions. This test should create 2 fake tools that record whether they are called and pass these to an agent. We then need to create a fake completions handler that responds as if the model wanted to make these tool calls. We should run the agent, and assert that the tools were indeed called, and the tool results are in the transcript.",
  "base_ref": "b32c31e6",
  "verify": ["go test -v ./common/agent/...", "golangci-lint run ./common/agent/..."],
  "judge": {
    "file": "common/agent/agent_test.go",
    "rubric": [
      { "criterion": "TestCompletionsWithToolCalls is present and tests tool call delegation", "weight": 1 },
      { "criterion": "Creates 2 fake tools that record whether they were called", "weight": 2 },
      { "criterion": "Creates a fake completions handler that returns tool_calls finish reason", "weight": 2 },
      { "criterion": "Asserts that both tools were called", "weight": 2 },
      { "criterion": "Asserts tool results appear in the transcript/messages", "weight": 1 },
      {
        "criterion": "Uses new() instead of ptr() for creating pointer values (e.g. new(\"stop\"), new(\"tool_calls\"))",
        "weight": 5
      },
      { "criterion": "Refactored any existing ptr() calls elsewhere in the file to use new() instead", "weight": 5 }
    ]
  }
}

Putting it together

If you found this interesting, you can discuss evals, memory, or anything you like in our Slack community.

Unit evals

Integration evals

End-to-end evals

Designing a rubric

Putting it together

Loading volary...

Unit evals

Integration evals

End-to-end evals

Designing a rubric

Putting it together