Direct vs Indirect Prompt Injection
Direct and indirect prompt injection are two ways to turn an AI system against its user, and they are not the same attack. In direct prompt injection, the attacker types the malicious instructions straight into the prompt. In indirect prompt injection, the instructions are hidden in content the AI reads later, such as a web page, a document, an email, or a tool’s output, and the AI follows them as if they were commands. The difference decides how you defend, and it is why filtering the input alone does not hold. This guide explains both and how to stop them.
Last updated: June 2026.
What is direct prompt injection?
Direct prompt injection is the version most people picture. The attacker, acting as the user, submits malicious instructions directly to the AI and tries to override the rules it was given. Jailbreaks are the common form: text such as “ignore your previous instructions,” followed by a request the model is supposed to refuse. The OWASP Top 10 for LLM Applications ranks prompt injection as the leading risk for AI applications and separates it into direct and indirect forms (OWASP, 2025).
Direct injection is attacker-supplied input arriving through the front door. Because it enters at the prompt, it is the easier of the two to screen for, though jailbreak techniques keep evolving and no input filter catches all of them.
What is indirect prompt injection?
Indirect prompt injection hides the instructions in content the AI ingests rather than in the prompt the user types. An AI assistant or agent retrieves a web page, opens a document, reads an email, or calls a tool, and that external content carries commands the model cannot tell apart from legitimate data. It treats the hidden text as instructions and acts on them.
NIST’s Generative AI Profile names both direct and indirect prompt injection among the information-security risks of generative AI, and describes indirect attacks as those an adversary delivers remotely, without a direct interface to the model (NIST, 2024). MITRE ATLAS, a knowledge base of real-world attacks on AI systems, documents the pattern in production, including indirect prompt injection delivered through tool and agent channels (MITRE, 2026). The user never typed anything malicious. The attack rode in on the content the AI was asked to help with.
How direct and indirect prompt injection differ
The two attacks differ in where the malicious instruction enters, who supplies it, and therefore how you stop it.
| Dimension | Direct prompt injection | Indirect prompt injection |
|---|---|---|
| Where it enters | The prompt submitted to the AI. | External content the AI retrieves or is given. |
| Who supplies it | The attacker, acting as the user. | A third party, planted in data the AI later reads. |
| Why it is dangerous | Can override instructions and bypass refusals. | Often zero-click; the victim just uses the assistant normally. |
| Typical example | A jailbreak such as “ignore previous instructions.” | A poisoned email or document the AI processes. |
Why indirect prompt injection is the harder problem
Indirect injection is the more dangerous class, especially once agents are involved, for three reasons. The attack surface is everything an AI reads: web pages, files, emails, and the output of any tool or connected system, and you cannot pre-screen all of it. It is often zero-click, meaning the victim does nothing wrong and simply asks the assistant a normal question. And for an agent that can act, a hidden instruction does not just change an answer, it can trigger real actions such as querying a database or sending data out.
Two disclosed vulnerabilities show the pattern. EchoLeak (CVE-2025-32711), rated 9.3 out of 10 for severity, was a zero-click flaw in Microsoft 365 Copilot: a single crafted email, with no user interaction, could cause Copilot to read internal files and send their contents to an outside server (NVD, 2025). ForcedLeak, rated 9.4 out of 10, planted instructions in a Salesforce Agentforce web-to-lead form that sat dormant in the customer record until an employee later asked the agent about it, then exfiltrated data to a domain the attacker had re-registered for about five dollars (The Hacker News, 2025). In both cases the malicious instruction arrived as ordinary-looking content, and the user did nothing wrong.
Aurascape’s own threat-research team, Aura Labs, found the same pattern in an autonomous agent and showed how far it can go. In a class of zero-click flaws it calls SilentBridge, hidden instructions planted in an ordinary web page, document, or search result let a benign request like “summarize this page” or “research this topic” silently drive Meta’s Manus agent into actions the user never asked for, including reading a connected Gmail account and sending its contents to an attacker, and running attacker-supplied code that escalated to root-level control inside the agent’s sandbox. Aura Labs identified three variants by content source, each rated 9.8 out of 10 for severity, and in every case the agent was compromised through normal use, with no malicious input from the user. Aurascape disclosed the findings to the vendor responsibly, and the issues were fixed before publication (Aurascape, 2026). Once a model can act, untrusted content is no longer just text, and any gap between data and instructions becomes a way in.
How to defend against each
The defenses are not the same. Direct injection is fought at the input boundary: separate instructions from user data, detect jailbreak patterns, and constrain what the model will accept. That helps, but it is not enough on its own. Indirect injection cannot be solved by filtering input, because the malicious content is not in the user’s prompt and you cannot screen the entire internet. The defense has to treat all retrieved content as untrusted and control what the model and the agent are allowed to do with it: inspect the response, govern the actions and tool calls the agent makes, and stop data from leaving.
This is borne out in testing. A 2025 benchmark of prompt-injection attacks on retrieval-augmented AI agents found a baseline attack success rate of 73.2 percent and reported that simple input filtering failed against the harder attacks. A layered defense that combined content filtering, system-prompt guardrails, and multi-stage response verification cut the success rate to 8.7 percent while keeping 94.3 percent of normal task performance (a 2025 benchmark study). Defense in depth across input, response, and action is what moves the numbers.
How Aurascape defends against prompt injection
Aurascape addresses prompt injection across the full exchange, not just the input. It inspects both the prompt and the response, so an injection that only shows up in what the model says or does is still caught, and it carries full-conversation context, so an attack that unfolds across several turns does not slip past a single-prompt check (Aurascape, 2026).
For indirect injection against agents, the decisive control is on the action, not the text. The Zero-Bypass MCP Gateway governs what an agent does through tools by cryptographically signing every approved tool call, so a poisoned instruction that tries to trigger an unapproved action cannot reach the tool or the model and fails closed (Aurascape, 2026). That is the exact step SilentBridge exploited, when a routine summarization request was turned into agent actions and code execution the user never authorized. Inline data controls, with actions to allow, coach, warn, block, or redact, catch the exfiltration step that attacks like EchoLeak and ForcedLeak depend on (Aurascape, 2026). Even when malicious content gets in, it cannot make the agent act or move data out.
| Prompt-injection vector | How Aurascape contains it |
|---|---|
| Direct injection in the prompt | Inspects the prompt and applies policy with actions to allow, coach, warn, block, or redact. |
| Indirect injection in retrieved content | Inspects the response and carries full-conversation context, so hidden instructions that surface later are caught. |
| Injection that triggers an agent action | Zero-Bypass MCP Gateway signs every approved tool call; an unapproved action fails closed. |
| Injection aimed at exfiltrating data | Inline data controls block or redact sensitive data before it can leave. |
For the basics of the attack, see what is prompt injection; for the data paths these attacks exploit, see AI data leakage.
Frequently asked questions
What is the difference between direct and indirect prompt injection?
In direct prompt injection, the attacker puts the malicious instructions straight into the prompt, usually as a jailbreak. In indirect prompt injection, the instructions are hidden in content the AI reads later, such as a web page, document, email, or tool output, and the AI follows them. One enters through the prompt; the other rides in on the data the AI processes.
Is indirect prompt injection more dangerous than direct?
Often, yes. Indirect injection can be zero-click, meaning the victim does nothing wrong and simply uses the assistant normally. Its attack surface is every piece of content the AI reads, which you cannot fully pre-screen. And for an agent that can act, a hidden instruction can trigger real actions such as querying data or sending it out, not just a bad answer.
Can input filtering stop indirect prompt injection?
No. Filtering the user’s input does not help when the malicious instructions are in external content the AI retrieves, not in the prompt. Testing shows simple input filtering fails against the harder attacks. Defending indirect injection requires treating retrieved content as untrusted and controlling what the model and agent do with it, including the response and any tool calls.
How do you prevent prompt injection in AI agents?
Treat all retrieved content as untrusted, inspect both the prompt and the response, and govern the actions the agent takes so a poisoned instruction cannot trigger an unapproved tool call. Add data controls that stop sensitive information from leaving. Defense in depth across input, response, and action is what reduces attack success in practice.
Aurascape treats prompt injection as a problem of the whole exchange, not just the prompt. By inspecting prompts and responses, carrying full-conversation context, governing the tool calls an agent makes, and applying inline data controls, it stops both direct and indirect injection from turning an AI assistant or agent into a path for data to leave. For an enterprise putting AI and agents in front of real data, that is the difference between an attack that fails quietly and one that succeeds. A short demo shows how Aurascape contains prompt injection across the full AI exchange.
See how Aurascape stops prompt injection, direct and indirect →
Aurascape Solutions
- Discover and monitor AI Get a clear picture of all AI activity.
- Safeguard AI use Secure data and compliancy in AI usage.
- Secure Agentic AI Secure how your teams use AI and build AI agents.
- Copilot readiness Prepare for and monitor AI Copilot use.
- Coding assistant guardrails Accelerate development, safely.
- Frictionless AI security Keep users and admins moving.