Builtin.Faithfulness drops multi-part toolResult.content blocks

 Field: Trace ID                                                             
  Value: 6a16b25a518549363c40a85f2113b7f8                                     
  ────────────────────────────────────────                        
  Field: Evaluation Timestamp                                                 
  Value: 2026-05-27T09:25:05.243Z                                           
  ────────────────────────────────────────                                    
  Field: Score                                                                
  Value: 0.25 ("Not Generally")
                                                                              
  ---                                                                         
  Summary              
                                                                              
  The Builtin.Faithfulness evaluator scored this trace 0.25, claiming the
  agent reported values not present in tool outputs. However, the values are  
  present in the tool response — they appear in content[1] of a multi-part  
  toolResult.content array. The evaluator's reasoning references only         
  content[0] (the summary block), suggesting the serialization of multi-part
  tool results into the judge's {context} template variable may be dropping
  subsequent content blocks.                                                

  ---
  Evidence
                                                                              
  What the evaluator said (from gen_ai.evaluation.explanation):
                                                                              
  ▎ "Shell tool (grep useEffect): The tool output shows 'Execution Summary:   
  Total commands: 1, Successful: 1' but doesn't show the actual count '2'. The
   assistant reports 'actual: 2' which is not directly visible in the tool    
  output."                                                        
                                                                            
  What the tool actually returned (from the OTEL span in the same trace):     
  
  {                                                                           
    "toolResult": {                                                           
      "toolUseId": "tooluse_ykEPaU6JqjoJjVVdtTIpcC",                          
      "content": [                                                            
        {                                                                     
          "text": "Execution Summary:\nTotal commands: 1\nSuccessful:         
  1\nFailed: 0"                                                               
        },                                                                  
        {                                                                     
          "text": "Command: grep -c \"useEffect\"                           
  /tmp/buggy-component.tsx\nStatus: success\nExit Code: 0\nOutput:            
  2\r\n\nError: "                                                           
        }                                                                     
      ],                                                          
      "status": "success"                                                     
    }                                                                         
  }                                                                           
                                                                              
  The value 2 is clearly present in content[1].text as Output: 2. The agent   
  faithfully reported this value. The evaluator only references the text from
  content[0].                                                                 
                                                                  
  Additional flagged items with the same pattern:                             
                                                                            
  The evaluator also flagged:                                                 
  - image_reader: Tool returned [non-text-response] — this one may be
  legitimately unfaithful (agent claimed to see visual content)               
  - http_request: Tool returned "Status Code: 200" — unclear if body was in a
  second content block                                                        
                                                                              
  ---                  
  Hypothesis                                                                  
                                                                              
  When the Evaluations service serializes the conversation history into the
  {context} placeholder for the Faithfulness judge prompt, multi-element      
  toolResult.content arrays may only include the first text block 
  (content[0]), dropping subsequent elements. This causes the judge to        
  correctly conclude (given its input) that the agent's claims aren't grounded
   — but the judge was never shown the full tool response.
                                                                            
  ---                                                                         
  How to Reproduce
                                                                              
  1. Retrieve the trace 6a16b25a518549363c40a85f2113b7f8 from the runtime log
  group                                                                       
  2. Find the toolResult for toolUseId: tooluse_ykEPaU6JqjoJjVVdtTIpcC —    
  confirm it has 2 content blocks with "Output: 2" in the second              
  3. Run the Faithfulness evaluator on-demand against this session          
  4. Observe the evaluator claims "2" is not in the tool output               
                                                                              
  ---                                                                         
  Expected Behavior                                                           
                                                                              
  The Faithfulness evaluator should have access to all content blocks in a  
  toolResult.content array when building the judge's context, not just the    
  first element. The correct score for this trace should be higher (agent was
  faithful to the full tool output).                                          
                                                                            
  ---                                                                         
  Impact                                                                    
                                                                              
  This false negative caused our automated self-improvement pipeline to
  generate unnecessary prompt changes (adding "output grounding rules" to     
  prevent hallucination that wasn't actually occurring), wasting compute and
  adding prompt bloat to 4 production agents.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Builtin.Faithfulness drops multi-part toolResult.content blocks #1393

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Builtin.Faithfulness drops multi-part toolResult.content blocks #1393

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions