Skip to content

Builtin.Faithfulness drops multi-part toolResult.content blocks #1393

@tycenjmccann

Description

@tycenjmccann

Field: Trace ID
Value: 6a16b25a518549363c40a85f2113b7f8
────────────────────────────────────────
Field: Evaluation Timestamp
Value: 2026-05-27T09:25:05.243Z
────────────────────────────────────────
Field: Score
Value: 0.25 ("Not Generally")


Summary

The Builtin.Faithfulness evaluator scored this trace 0.25, claiming the
agent reported values not present in tool outputs. However, the values are
present in the tool response — they appear in content[1] of a multi-part
toolResult.content array. The evaluator's reasoning references only
content[0] (the summary block), suggesting the serialization of multi-part
tool results into the judge's {context} template variable may be dropping
subsequent content blocks.


Evidence

What the evaluator said (from gen_ai.evaluation.explanation):

▎ "Shell tool (grep useEffect): The tool output shows 'Execution Summary:
Total commands: 1, Successful: 1' but doesn't show the actual count '2'. The
assistant reports 'actual: 2' which is not directly visible in the tool
output."

What the tool actually returned (from the OTEL span in the same trace):

{
"toolResult": {
"toolUseId": "tooluse_ykEPaU6JqjoJjVVdtTIpcC",
"content": [
{
"text": "Execution Summary:\nTotal commands: 1\nSuccessful:
1\nFailed: 0"
},
{
"text": "Command: grep -c "useEffect"
/tmp/buggy-component.tsx\nStatus: success\nExit Code: 0\nOutput:
2\r\n\nError: "
}
],
"status": "success"
}
}

The value 2 is clearly present in content[1].text as Output: 2. The agent
faithfully reported this value. The evaluator only references the text from
content[0].

Additional flagged items with the same pattern:

The evaluator also flagged:

  • image_reader: Tool returned [non-text-response] — this one may be
    legitimately unfaithful (agent claimed to see visual content)
  • http_request: Tool returned "Status Code: 200" — unclear if body was in a
    second content block

Hypothesis

When the Evaluations service serializes the conversation history into the
{context} placeholder for the Faithfulness judge prompt, multi-element
toolResult.content arrays may only include the first text block
(content[0]), dropping subsequent elements. This causes the judge to
correctly conclude (given its input) that the agent's claims aren't grounded
— but the judge was never shown the full tool response.


How to Reproduce

  1. Retrieve the trace 6a16b25a518549363c40a85f2113b7f8 from the runtime log
    group
  2. Find the toolResult for toolUseId: tooluse_ykEPaU6JqjoJjVVdtTIpcC —
    confirm it has 2 content blocks with "Output: 2" in the second
  3. Run the Faithfulness evaluator on-demand against this session
  4. Observe the evaluator claims "2" is not in the tool output

Expected Behavior

The Faithfulness evaluator should have access to all content blocks in a
toolResult.content array when building the judge's context, not just the
first element. The correct score for this trace should be higher (agent was
faithful to the full tool output).


Impact

This false negative caused our automated self-improvement pipeline to
generate unnecessary prompt changes (adding "output grounding rules" to
prevent hallucination that wasn't actually occurring), wasting compute and
adding prompt bloat to 4 production agents.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions