Eval an Agent result with tools

Hi there,

I’m trying to Eval an Agent result that uses a tool.

In the dataset I have a test case that is supposed to test the “Final answer”.

The problem is that the Eval fails because the “output” is set as “Output is not an answer. Received TOOL instead”.

Here the full output:

Output & Scoring
Score
0.0%

Rationale
Output is not an answer. Received TOOL instead
Agent Output
Tool used: 
Search Web
Args:
{
  "search_query": "carbonara recipe"
}

I understand that such info is useful if I want to eval if an Agent picked the right tool with the right payload.

But how to evaluate the final answer, independently of whether it uses a tool or not?

Thanks

Hi @abusedmedia,

The amazing @kent gave me a rundown on the Eval tool for Agents, Eval Items and why the output from the Eval tool is giving you that message :sweat_smile:

The Eval tool only ever runs one iteration of an Agent, by design. So in this case, the Eval tool is more or less running the tool call, then running its scoring and exiting before the Agent gets the result of the tool call. Resulting in that message.

An analogy to this would be how in Javascript, if you don't use async and await correctly, you might end up returning a Promise object(result of the tool call), instead of the result of an async function call you are awaiting the resolution of.

We can fix this by having the result of the tool call to the Eval Item. Then the LLM evaluating the Eval Item will have the additional context of both the tool call, as well as the data from the result of said tool call, so it can judge the accuracy properly.

There are two ways to build out the Eval Item to have both the tool call and the tool's results. One way is very manual the other is quicker and works just as well. Potentially even leaving less margin for typos or small bugs.

The quicker and easier way is done from the chat window. By clicking "Add eval to dataset", at the bottom under the '...' button.

This will fully create the Eval item, which will appear in the modal that pops up. All you need to do is click your Dataset at the top and then it will be ready for you to click 'Create' at the bottom!

Once it has been created you can run the Eval and get the correct scoring.

Thanks @Jack_T for the hint

1 Like