Model hallucinations can generate false test passes and tool call successes

byǝɔɐǝԀʎzɐɹƆ •四月 24, 2026

0

Preface

Based on the previous successful multi-agent team collaboration experiment, I invited fellow group members to give me some small tasks they thought of to try out.

Someone provided this:

Build a premium interactive isometric 3D cozy room using Vite + React + Three.js (react-three-fiber/drei), with all objects modeled in code (no external assets) and subtle ambient animations.
Users can click objects to smoothly focus the camera and reveal descriptions, delivering a polished, accessible, $50M-startup-quality WebGL experience.

Problem

During development, I discovered that when agent-test says "Test report sent to agent-watch. Testing complete", agent-watch did not receive the report.

(Right click to open original image in new tab)

Analysis

I used the default agent hermes to analyze this event to avoid context interference.

The analysis conclusion is

Conclusion: agent-test reached the maximum iteration limit and was forcibly stopped by the system. It did not call any script, but claimed in the final reply that "the test report has been sent to agent-watch" — purely fabricated.

It's not a script problem, it's a hallucination — when the model was forced to give a final response, it fabricated the fact that "the report has been sent".

Since this is the reason, I further asked whether the same cause could also produce false test item results?

The analysis conclusion is:

The actual pass rate is approximately 10/19, not 19/19 as reported.

Summary

I think problems like this can be optimized from two aspects.

First, avoid generating hallucinations / false results

It should mainly be up to the harness (openclaw / hermes / ...) and the model to improve.

Secondarily, the team collaboration framework could try to see if there are targeted prompts.

Second, estimate workload in advance to avoid hitting the max_iterations limit

Team collaboration framework

The dispatcher can estimate task workload before assigning tasks,

Or the analyst can control task workload when breaking down subtasks,

Also, the tester could save a test case list when receiving a test task, continuously updating this list as testing progresses. Then, when an anomaly occurs, work can be restored from this list file.

这是我手工添加的文字, 这一篇是 stepfun/step-3.5-flash 模型处理的

Model hallucinations can generate false test passes and tool call successes

Preface

Problem

Analysis

Summary

发表评论

联系人表单