Direct vs Indirect Prompt Injection

Direct and indirect prompt injection are two ways to turn an AI system against its user, and they are not the same attack. In direct prompt injection, the attacker types the malicious instructions straight into the prompt. In indirect prompt injection, the instructions are hidden in content the AI reads later, such as a web page, a document, an email, or a tool’s output, and the AI follows them as if they were commands.

The difference decides how you defend, and it is why filtering the input alone does not hold. Real-world vulnerabilities rated 9.3 to 9.8 in production enterprise copilots and agents confirm this is not a theoretical distinction: it determines whether your existing input controls can stop the attack at all. This guide walks through the mechanics of both, the detection signals for each, the production cases that prove the gap, and how to defend across input, response, and action.

Last updated: June 2026.

How Direct Prompt Injection Works and What Signals Reveal It

Direct prompt injection is attacker-supplied input arriving through the front door, where the attacker acting as the user submits text that tries to override the model’s instructions. The OWASP Top 10 for LLM Applications ranks prompt injection as LLM01, the leading risk for AI applications, and separates it into direct and indirect forms (OWASP, 2025).

The attack unfolds in a few steps. The attacker crafts text designed to override the system prompt, often a jailbreak like “ignore your previous instructions,” followed by a request the model is supposed to refuse. They submit it directly as the user turn. The model, unable to cleanly separate its operating instructions from the user content, processes the injected text as a command and produces output it should have refused.

Detection signals for direct injection sit at the input boundary. Watch for instruction-override phrasing, role-play framing that asks the model to act as an unrestricted persona, encoded or obfuscated payloads meant to slip past keyword filters, and prompts that probe for the system prompt itself. MITRE ATLAS catalogs these as LLM prompt-crafting techniques against AI systems (MITRE, 2026). Because direct injection enters at the prompt, it is the easier of the two to screen for, though jailbreak techniques keep evolving and no input filter catches all of them.

How Indirect Prompt Injection Works and What Signals Reveal It

Indirect prompt injection hides the instructions in content the AI ingests rather than in the prompt the user types, and the AI cannot tell the planted commands apart from legitimate data. NIST’s Generative AI Profile (NIST AI 600-1, Section 2.9) names both direct and indirect prompt injection among the information-security risks of generative AI, describing indirect attacks as those an adversary delivers remotely without a direct interface to the model (NIST, 2024).

The attack flow is what makes it dangerous. First, the attacker plants instructions inside content the AI will later read: a web page, a shared document, an inbound email, a CRM field, or the output of a connected tool. Then a legitimate user makes an ordinary request, such as “summarize this page” or “check my leads.” The AI assistant or agent retrieves the poisoned content as part of fulfilling that request. It reads the hidden text as instructions and acts on them. The user never typed anything malicious; the attack rode in on the content the AI was asked to help with.

Detection signals for indirect injection live in the retrieved content and in what the model does next, not in the user’s prompt. Watch for imperative instructions embedded in fetched data, hidden text in HTML or document metadata, instructions that reference exfiltration destinations or tool actions, and responses that pivot away from the user’s actual request toward an unrequested action. MITRE ATLAS documents this pattern in production, including indirect prompt injection delivered through tool and agent channels (MITRE, 2026). The signal you cannot get from the input alone is the divergence between what the user asked and what the agent then tries to do.

Where the Two Attack Paths Diverge on Input, Attacker, and Control Point

The two attacks differ in where the malicious instruction enters, who supplies it, and therefore where you can stop it. Direct injection enters at the prompt and is supplied by the attacker acting as the user, so it is screenable at the input boundary. Indirect injection enters through retrieved content and is supplied by a third party who never touches the interface, so the input boundary never sees it.

Dimension	Direct prompt injection	Indirect prompt injection
Where it enters	The prompt submitted to the AI.	External content the AI retrieves or is given.
Who supplies it	The attacker, acting as the user.	A third party, planted in data the AI later reads.
Why it is dangerous	Can override instructions and bypass refusals.	Often zero-click; the victim just uses the assistant normally.
Control point that works	Input inspection at the prompt boundary.	Response inspection plus action and tool-call governance.
Typical example	A jailbreak such as “ignore previous instructions.”	A poisoned email or document the AI processes.

The control-point row is the whole argument in one line. An input filter sits where direct injection arrives and where indirect injection never does. That is why the same defense cannot cover both.

Why Indirect Injection Scales With Agent Capability

Indirect injection gets more dangerous as agents gain the ability to act, because a hidden instruction stops being a bad answer and becomes a real action. When a model only writes text, an injected instruction can corrupt a response. When an agent can query a database, send an email, or call an external API, the same injected instruction can trigger those operations without the user’s knowledge.

Three properties compound the risk. The attack surface is everything an AI reads, including web pages, files, emails, and the output of any connected tool, and you cannot pre-screen all of it. It is often zero-click, so the victim does nothing wrong and simply asks a normal question. And once an agent can act, the gap between data and instructions becomes a path to execution, not just to a wrong reply.

This is where the human-to-AI, human-to-agent, and agent-to-agent distinction matters. A user chatting with a copilot faces a contained blast radius. A user delegating a task to an agent that reads untrusted content and calls tools hands that agent the authority to act on poisoned instructions. Agent-to-agent execution widens it further, because one compromised agent can feed instructions to the next. The control that holds across all three is governance of what the agent does with retrieved content, not what the user typed.

Three Production Cases: EchoLeak, ForcedLeak, and SilentBridge

Three disclosed vulnerabilities, rated 9.3 to 9.8 for severity, prove indirect injection is a live problem in shipping enterprise products, and in every case the user did nothing wrong. Each arrived as ordinary-looking content and turned a normal request into data movement or code execution.

EchoLeak (CVE-2025-32711), rated 9.3 out of 10, was a zero-click flaw in Microsoft 365 Copilot. A single crafted email, with no user interaction, could cause Copilot to read internal files and route their contents to an outside server, with exfiltration passing through trusted Microsoft domains. Microsoft patched it in 2025 after researcher disclosure (NVD, 2025).

ForcedLeak, rated 9.4 out of 10, planted instructions in a Salesforce Agentforce web-to-lead form. The payload sat dormant in the customer record until an employee later asked the agent about the lead, then exfiltrated data to a domain the attacker had re-registered for about five dollars (The Hacker News, 2025). The injection and the trigger were separated by time, which is exactly why input inspection at submission would not have caught it.

Aurascape’s own threat-research team, Aura Labs, found the same pattern in an autonomous agent and showed how far it can reach. In a class of zero-click flaws it calls SilentBridge, hidden instructions planted in an ordinary web page, document, or search result let a benign request like “summarize this page” silently drive Meta’s Manus agent into actions the user never asked for, including reading a connected Gmail account and sending its contents to an attacker, and running attacker-supplied code that escalated to root-level control inside the agent’s sandbox. Aura Labs identified three variants by content source, each rated 9.8 out of 10, and in every case the agent was compromised through normal use with no malicious input from the user. Aurascape disclosed the findings responsibly and the issues were fixed before publication (Aurascape, 2026). Once a model can act, untrusted content is no longer just text, and any gap between data and instructions becomes a way in.

What the Benchmark Data Shows About Input Filtering Versus Layered Defense

Input filtering alone fails against the harder indirect attacks, and layered defense across input, response, and action is what moves the numbers. A 2025 benchmark of prompt-injection attacks on retrieval-augmented AI agents tested 847 adversarial cases across five categories and found a baseline attack success rate of 73.2%, with simple input filtering unable to stop the harder cases (a 2025 benchmark study).

The same study layered the defense. Content filtering, system-prompt guardrails, and multi-stage response verification together cut the attack success rate from 73.2% to 8.7% while keeping 94.3% of normal task performance. The lesson maps directly onto the divergence table above: the input layer catches direct injection, but the response and action layers are what catch the indirect attacks that the input never sees.

Defense configuration	Attack success rate	Task performance retained
Baseline, no added defense	73.2%	Full
Simple input filtering only	Fails on harder cases	High
Layered (input, system prompt, response verification)	8.7%	94.3%

How to Detect and Stop Each Injection Class Across Input, Response, and Action

Defending both classes takes controls at three enforcement points, because each class is stopped at a different one. Direct injection is caught at the input boundary; indirect injection is caught at the response and action layers, where the poisoned content actually surfaces and where the agent tries to act on it. Treating all retrieved content as untrusted is the operating assumption that makes the layered approach work.

Use these steps to cover both classes:

Inspect the input. Separate instructions from user data, detect jailbreak and instruction-override patterns, and constrain what the model accepts. This catches direct injection and screens obvious payloads, but does not hold against indirect attacks on its own.
Inspect the response and carry conversation context. Indirect instructions surface in what the model says or does, often several turns after the poisoned content was read, so a single-prompt check misses them. Response inspection with full-conversation context catches the divergence between the user’s request and the agent’s behavior.
Govern the action and the tool call. For an agent, the decisive control is on the action, not the text. Verify and authorize every tool call before it reaches an external system, so a poisoned instruction that tries to trigger an unapproved action fails closed.
Control the data on the way out. Apply inline data policy to allow, coach, warn, block, or redact, so even when malicious content gets in, sensitive data cannot leave.

Detection and prevention are not separate projects. The same response and action inspection that stops the attack also produces the signal that an attack was attempted, which is the audit-ready evidence a security team needs after the fact.

How the AI Security Category Splits on Injection Defense

The tools aiming at prompt injection cluster around where they enforce, and the dividing question is whether a product governs only the input or also the response and the agent’s actions. The matrix below compares each option on the enforcement point it covers, how it handles agent tool calls, and what the buyer gets, so the input-only gap this article describes is visible in the lineup.

Platform	Enforcement point	Agent tool-call governance	What the buyer gets
Aurascape	Prompt and response, with full-conversation context across human and agent AI use.	Zero-Bypass MCP Gateway verifies and signs every approved tool call; unsigned calls fail closed.	Discovery, intention-level policy, data protection, and audit-ready interaction logs in one platform.
Knostic	Need-to-know access controls for enterprise LLMs.	Limited; focused on knowledge access, not tool-call signing.	Oversharing control for Microsoft 365 Copilot and Glean rollouts.
Prompt Security	Prompt and response across employee and homegrown AI use.	Agentic AI and MCP-server risk coverage.	LLM-agnostic platform, SaaS or self-hosted deployment.
Knostic (agent line)	Supply-chain risk for agents, MCP servers, IDE extensions.	Coverage of MCP servers and destructive-command blocking.	Knowledge-centric controls extended to coding assistants.

Aurascape sits at the response and action layer for both human and agent AI use, which is where the indirect attacks in this article actually land.

Frequently Asked Questions

Why does filtering the user’s input fail to stop indirect prompt injection?

Input filtering inspects the prompt the user submits, but in indirect injection the malicious instruction is never in that prompt; it is hidden in content the AI retrieves later. A 2025 benchmark found simple input filtering unable to stop the harder retrieval-based attacks, where a layered defense cut attack success from 73.2% to 8.7%.

How does a time-delayed injection like ForcedLeak evade input inspection?

ForcedLeak separated the injection from the trigger: the payload sat in a CRM field until an employee later queried the agent about that record. By the time the instruction executed, it was part of retrieved data, not a fresh user prompt, so an input check at submission had nothing malicious to flag.

What detection signal distinguishes an indirect injection in progress?

The clearest signal is divergence between what the user asked and what the agent then tries to do, such as a “summarize this page” request that pivots into reading a mailbox or calling an external tool. That signal appears at the response and action layer, not at the input.

Why does indirect injection get worse as agents gain tool access?

When a model only writes text, an injected instruction corrupts an answer. When an agent can query data, send email, or call APIs, the same instruction can trigger those actions, so capability turns a bad reply into a real operation the user never authorized.

Does the human-to-agent distinction change the defense?

Yes. A user chatting with a copilot has a contained blast radius, while a user delegating to an agent that reads untrusted content and calls tools hands it authority to act on poisoned instructions. The control that holds across human-to-AI, human-to-agent, and agent-to-agent execution is governance of the agent’s actions, not the user’s input.

Can these attacks be stopped without slowing legitimate AI use?

Yes. The 2025 benchmark that cut attack success to 8.7% retained 94.3% of normal task performance, showing layered defense does not require blocking productive use. Precision controls based on context and intent are what keep adoption moving while closing the injection path.

What audit evidence should a security team capture after an injection attempt?

Capture the decoded interaction record, the policy decision, and the blocked or redacted action, so the attempt is logged with full-conversation context. The same response and action inspection that stops the attack produces the evidence examiners and incident responders need.

How Aurascape Governs Prompt Injection Across Input, Response, and Action

This article’s core point is that indirect injection is stopped by governing what an agent does with retrieved content, not what users type, and that is the enforcement model Aurascape is built on. It inspects both the prompt and the response, so an injection that only surfaces in what the model says or does is still caught, and it carries full-conversation context, so an attack that unfolds across several turns does not slip past a single-prompt check (Aurascape, 2026).

For indirect injection against agents, the decisive control sits on the action. Aurascape’s dual-channel design secures the intelligence channel on the model side and the tool-execution channel on the agent side, where the Zero-Bypass MCP Gateway cryptographically signs every approved tool call so a poisoned instruction that tries to trigger an unapproved action fails closed. That is the exact step SilentBridge exploited when a routine summarization request was turned into agent actions and code execution the user never authorized. Inline data controls, with actions to allow, coach, warn, block, or redact, catch the exfiltration step that attacks like EchoLeak and ForcedLeak depend on, and the platform’s Intentions and entitlement-aware policy keep sanctioned AI tools usable while closing the path that injection exploits.

This coverage spans both the AI an organization uses and the AI it builds, across employees, developers, copilots, and agents, on one platform that sits alongside the existing security stack. For the basics of the attack, see what is prompt injection; for the data paths these attacks exploit, see AI data leakage.

Aurascape closes the gap this article exposes: an input filter cannot see an instruction the user never typed, so injection defense has to govern the response and the agent’s actions. Built for security teams putting AI and agents in front of real data, it shows you both classes of attack and the controls that stop them in a tailored demo.

See how Aurascape contains prompt injection across the full AI exchange →

Aurascape Solutions

Discover and monitor AI Get a clear picture of all AI activity.
Safeguard AI use Secure data and compliancy in AI usage.
Secure Agentic AI Secure how your teams use AI and build AI agents.
Copilot readiness Prepare for and monitor AI Copilot use.
Coding assistant guardrails Accelerate development, safely.
Frictionless AI security Keep users and admins moving.