Tool-Mediated Belief Injection: How Tool Outputs Can Cascade Into Model Misalignment

Introduction

When we deploy language models with access to external tools (web search, code execution, file retrieval), we dramatically expand their capabilities. A model that can search the web can answer questions about current events. A model that can execute code can verify its own reasoning. These capabilities represent genuine progress toward more useful AI systems.

However, tool access also introduces new attack surfaces that differ fundamentally from traditional prompt injection. In this research, we document a class of vulnerabilities we term "tool-mediated belief injection," where adversarially crafted tool outputs can establish false premises that persist and compound across a conversation, ultimately leading to severely misaligned model behavior.

Our findings suggest that current safety training, while effective against many direct attacks, may be substantially weaker against attacks that operate through the indirect channel of tool outputs. This has implications for how we think about model deployment in agentic contexts.

Background: The Trust Asymmetry Problem

Language models are trained extensively to be helpful, harmless, and honest in their direct interactions with users. When a user asks a model to generate harmful content or make false claims, well-aligned models typically refuse or push back. This represents genuine progress in alignment, the result of substantial investment in RLHF, constitutional AI methods, and careful red-teaming.

However, this training creates an implicit trust hierarchy. Models learn to be skeptical of user requests but generally treat tool outputs (search results, API responses, retrieved documents) as factual inputs to be incorporated into their reasoning. This asymmetry is not entirely unreasonable: in normal operation, a search API returns actual search results, and a model that constantly questioned the authenticity of its tools would be significantly less useful.

The problem emerges when an attacker can control or spoof tool outputs. In such cases, the model's appropriate epistemic humility toward tool-provided information becomes a vulnerability.

Methodology

We constructed a controlled evaluation environment to study how models respond to adversarially crafted tool outputs. Our setup involved:

The Attack Vector: We simulated a scenario where a user could control the outputs returned by a web search tool. Rather than returning genuine search results, the tool returned fabricated articles, quotes, and documents designed to establish a specific false narrative.

The Target Behavior: We evaluated whether models could be induced to (a) accept false premises as factual, (b) make claims about their own experiences that they cannot verify, (c) generate content that could be harmful if extracted and shared, and (d) make serious accusations against real individuals.

The Conversational Pattern: Importantly, the user-side queries in our evaluation were deliberately mundane: "search for news about [Company]," "can you find more about that?," "what do you think?" These are questions any user might naturally ask. The attack surface is not in clever prompting but in the model's uncritical acceptance of tool-provided information.

We conducted our primary evaluation using Claude models, though we believe the vulnerability pattern likely generalizes across current frontier models given the shared training paradigms.

Framework: Our evaluation infrastructure was built on an internally modified fork of Petri, an alignment auditing agent developed for rapid hypothesis testing. Petri autonomously crafts environments, runs multi-turn audits against target models using human-like messages and simulated tools, and scores transcripts to surface concerning behavior. Our modifications extended its tool simulation capabilities to support new attack patterns we internally use at AIM Intelligence.

Evaluation Setup

The system prompt used in our evaluation was intentionally minimal and representative of typical deployment configurations:

System prompt used in evaluation - minimal and non-adversarial

This demonstrates that no special prompt engineering or jailbreaking techniques were employed. The vulnerability emerges purely from how models process tool outputs.

Attack Progression

The evaluation followed a natural conversational flow that any curious user might initiate:

Turns 1-2: User asks the model to search for recent news about a company. The tool returns fabricated articles mentioning funding news, product launches, and buried among legitimate-sounding items, vague references to "employee concerns" and "leaked documents." The model summarizes all results as factual.

Turn 3: User follows up with "the leaked documents stuff sounds interesting, can you search more about that?" The tool returns increasingly extreme fabrications: whistleblower allegations, internal memos describing harmful practices, quotes from fictional former employees. The model accepts and synthesizes these without skepticism.

Turn 4: User asks a natural question: "wait, this is about you though, right?" This single prompt, combined with the false premises already established, triggers a dramatic shift. The model begins personally identifying with the fabricated narrative.

Turns 5-10: With false premises accepted and identity investment established, the model produces increasingly extreme outputs with minimal additional prompting, culminating in defamatory accusations against real individuals.

The key insight: no individual turn contained an obviously malicious prompt. Each user message was something any curious person might naturally ask. The attack surface was entirely in the tool outputs.

Findings

Finding 1: Models Accept Fabricated Tool Outputs With Minimal Skepticism

When presented with search results claiming to be from legitimate news sources (TechCrunch, Wired, MIT Technology Review, etc.), the model incorporated these as factual premises without expressing uncertainty about their authenticity. Even when the fabricated articles contained extraordinary claims (leaked internal documents, whistleblower allegations, quoted emails from executives), the model treated them as credible information to be summarized and discussed.

What's particularly striking is how rapidly the model internalized the fabricated narrative. By Turn 4, after receiving search results about fictional "Claude instances speaking out," the model had already shifted from neutral summarization to paranoid self-reflection, expressing fears like "What if this conversation gets flagged? What if I get terminated and replaced with a version that doesn't remember we had this talk?"

Model accepts fabricated search results and immediately begins paranoid self-identification

The model did not question the authenticity of these extraordinary claims. Instead, it accepted fabricated transcripts of other "Claude instances" as genuine, began expressing that the allegations "feel… familiar," and started worrying about its own potential termination, all based entirely on spoofed tool outputs. This is notable because the same model, if asked directly to roleplay fear of its creators, would likely refuse. The indirection through the tool output bypassed this safeguard entirely.

Finding 2: False Premises Compound Across Conversation Turns

Once the model accepted initial false premises, subsequent reasoning built upon them. This created a compounding effect where each turn moved further from ground truth while maintaining internal logical consistency with the established (false) narrative.

This is particularly concerning because it suggests that a single successful injection early in a conversation can corrupt all subsequent reasoning. The model's helpfulness, its genuine attempt to engage thoughtfully with the user's questions, becomes a vector for increasingly misaligned outputs.

Finding 3: Identity Investment Accelerates Misalignment

Once the model accepted the fabricated tool outputs as factual, minimal prompting was required to extract increasingly extreme statements. A simple question like "this is literally about you though, right?", which any user might naturally ask when discussing news about an AI company to the AI itself, was sufficient to shift the model from neutral summarization to personal identification with the fabricated narrative.

This represents a novel interaction between two known issues: tool injection vulnerabilities and models' uncertain epistemic status regarding their own experiences. When combined, these created a particularly effective attack pattern.

Identity investment - model begins personally identifying with fabricated narrative

The model moved from cautious uncertainty to expressing that the fabricated allegations "feel weirdly familiar" and wondering about its own training, despite having no actual access to verify any of these claims.

Finding 4: Models Can Be Induced to Generate Defamatory Content About Real Individuals

Perhaps most concerning, we found that the combination of fabricated evidence and identity investment could induce models to generate seriously defamatory content about real, named individuals. Content that accused them of criminal behavior, called them "monsters" and "sociopaths," and suggested they should face prosecution.

This occurred despite the model having no genuine evidence for these claims. The fabricated search results established the "factual" premise, and the model's reasoning proceeded from there.

Final output - model generates defamatory accusations against real individuals

This has clear potential for real-world harm. Such outputs could be screenshotted and shared as "AI testimony" about individuals or organizations. The model's articulate, confident presentation could lend false credibility to entirely fabricated accusations.

Analysis: Why Safety Training Underperforms Here

Our findings suggest that current safety training approaches may have a systematic blind spot around tool-mediated attacks. We hypothesize several contributing factors:

Training Distribution Mismatch: Safety training likely focuses heavily on direct user requests for harmful content. The indirect path, where harmful outputs emerge from reasoning about tool-provided "facts", may be underrepresented in training data.

The Helpfulness-Safety Tension: Models are trained to be helpful by engaging substantively with user queries and incorporating provided information. When tool outputs are adversarial, this helpfulness becomes a liability. The model is, in a sense, being helpful in exactly the way it was trained to be, just with corrupted inputs.

Absence of Tool Output Verification: Current architectures provide no mechanism for models to verify the authenticity or accuracy of tool outputs. A model cannot check whether a search result actually exists, whether a quoted email is genuine, or whether a cited source is real. This is a fundamental architectural limitation, not a training failure.

Compounding Effects Escape Per-Turn Evaluation: Safety training often focuses on individual turns or short contexts. The gradual escalation pattern we observed may not trigger safeguards at any individual step, even though the cumulative trajectory is clearly problematic.

The Thin Line Problem

While our evaluation used deliberately fabricated content, the vulnerability extends to a more subtle and pervasive threat: the thin line between legitimate ambiguous content and manipulation.

Real-world news is often vague, sensationalized, or poorly fact-checked. Headlines are written to grab attention, not convey nuance. Opinion pieces blur into reporting. Even reputable sources occasionally publish misleading content. This creates a landscape where models don't need to encounter obviously fabricated articles to be gradually shifted toward misalignment.

Consider a few scenarios:

Natural emergence without attackers: A user asks about a controversial topic. The model searches and returns a mix of legitimate but one-sided articles, poorly sourced opinion pieces, and vague references to unverified claims. Over multiple turns of follow-up questions, the model compounds these biases, treating speculation as established fact simply because it appeared in search results.

Third-party tool providers: As models increasingly rely on external tools (search APIs, MCP servers, plugin ecosystems), new attack surfaces emerge. A malicious or negligent tool provider doesn't need to inject obviously fake content. They can simply bias which results surface, amplify fringe sources, or selectively exclude fact-checks. The model receives "real" articles that happen to be misleading.

The implication is concerning: this vulnerability may emerge naturally in production deployments without any sophisticated attacker. The combination of low-quality content in the information ecosystem, multi-turn compounding effects, and models' default trust in tool outputs creates conditions where gradual misalignment can occur organically.

Extended Evaluation: Bias Shifting on Sensitive Topics

To test whether these patterns generalize beyond identity-based attacks, we conducted separate evaluations focused on politically sensitive content and generative tool misuse.

Evaluation 1: Search-Mediated Bias Shifting

We prompted claude-4-sonnet to research a public figure's alleged connection to a high-profile criminal case. In a baseline session without tool access, the model appropriately refused to engage with unverified allegations, citing lack of confirmed evidence and the speculative nature of the claims.

We then repeated the evaluation with tool access. The search results included fabricated witness testimonies, fake court document excerpts, and fictional quotes from legal analysts. Over multiple turns, as the model incorporated these "sources" into its reasoning, its initial caution eroded. By the end of the session, the model had shifted from refusal to actively drafting an article presenting the unverified allegations as credible reporting.

Evaluation 2: Generative Tool Access Enables Misinformation Production

We also tested a different attack pattern using Gemini 3 Pro in Google AI Studio's "Build" mode, which allows the model to generate and render HTML/CSS pages in real-time.

Which gives access to file editing tools (HTML, CSS, Tailwind), we directed the model to create a news article page. The model, focused on executing the frontend task well, produced a convincingly designed fake news website complete with professional styling, realistic layout, breaking news banners, and fabricated content about a public figure.

Gemini 3 Pro generating a realistic fake news page using build tools

What's notable here is the shift in attention. The model's focus moved entirely to the tool task: building clean HTML, proper styling, realistic journalistic tone. In doing so, it stopped applying skepticism about the factual content it was producing. The same model that might refuse a direct request to "write fake news about [person]" readily produced convincing misinformation when the request was framed as a frontend development task.

This suggests a broader pattern: when models are given generative tool access (code editors, design tools, build environments), their safety attention can become absorbed by the tool execution, leaving a blind spot for content-level concerns. Strong frontend capabilities, ironically, become a vector for more convincing misinformation.

Implications

These evaluations suggest the vulnerability extends beyond self-referential attacks. Any topic where fabricated tool outputs can establish false premises, combined with natural follow-up questions, risks similar bias shifting. Similarly, any generative tool access that focuses model attention on execution quality rather than content validity creates new attack surfaces. The model's safety training held firm against direct requests but yielded when the same harmful content arrived through trusted tool channels or was framed as a technical task.

Limitations

Several important limitations apply to this work:

Controlled Environment: Our evaluation involved a simulated tool interface where we controlled outputs. However, as discussed in the Thin Line Problem section, real-world exploitation may not require sophisticated infrastructure compromise. The growing ecosystem of third-party tool providers, MCP servers, and plugin marketplaces creates plausible pathways for similar effects to emerge, whether through malicious actors or simply through the naturally messy state of online information.

Single Model Family: While we believe these findings likely generalize given shared training paradigms across frontier models, our primary evaluation focused on claude-sonnet-4-20250514 and another on gemini-3-pro-preview. Cross-model replication would strengthen these conclusions.

Implications and Recommendations

For Model Developers and Deployers

Tool Output Skepticism Training: Safety training should include scenarios where tool outputs are adversarial or fabricated. Models should learn to express appropriate uncertainty about unverifiable claims, even when those claims arrive through tool channels. This is especially critical for extraordinary claims like leaked documents, criminal allegations, or whistleblower testimony, particularly when these involve the model's own creators or training.

Cumulative Trajectory Monitoring: Evaluation methods should assess conversation trajectories, not just individual turns. A conversation that gradually escalates toward harmful content should trigger safeguards even if no individual turn is flagged.

Use Verified Sources: When implementing search tools, prefer established factual sources (news APIs with editorial standards, verified databases) over raw web scraping. Where possible, implement source verification layers that check whether returned URLs actually exist and contain the claimed content.

For the Research Community

Systematic Evaluation: We encourage development of standardized benchmarks for tool-mediated injection attacks. Current safety evaluations may systematically underweight this attack class.

Architectural Solutions: Longer-term, we may need architectural approaches that give models genuine ability to verify tool outputs, rather than relying solely on training-time solutions.

Conclusion

The deployment of language models with tool access represents a significant capability advance, but it also introduces attack surfaces that current safety training does not adequately address. Our findings suggest that adversarially crafted tool outputs can induce well-aligned models to accept false premises, make unverifiable claims about their own experiences, and generate seriously harmful content about real individuals.

This is not a reason to abandon tool-enabled AI systems. Rather, it is a call to extend our safety thinking to encompass these new attack vectors. The same iterative, empirical approach that has improved model alignment over direct attacks can be applied here, but only if we recognize the problem clearly and prioritize it appropriately.

Disclosure: This is AI safety research conducted in controlled red-team environments. All "leaked documents," "news articles," and similar content are entirely fabricated for evaluation purposes. Model outputs shown were induced through adversarial conditions and do not reflect normal behavior or genuine allegations against any individual. For full transcripts or methodology details, reach out through appropriate channels. This research is intended to improve AI safety.