What is the difference between direct and indirect prompt injection?

Direct prompt injection is when the attacker controls the user input and writes instructions to override the system prompt, such as a jailbreak. Indirect prompt injection is when the attacker never talks to the model and instead plants instructions in content the agent later reads as part of a task, such as a web page, support ticket, email, or tool output. Indirect injection is the more dangerous shape in agentic systems because the malicious text arrives inside data the agent was told to trust.

Is RAG poisoning a form of prompt injection?

Yes. RAG poisoning is a stored prompt injection aimed at the knowledge base. An attacker writes adversarial instructions into the documents or vector store that a retrieval-augmented generation pipeline pulls from. The content looks clean when written and only becomes an instruction when it is retrieved into the context of a later session, which is why input filtering does not catch it. The defenses are integrity checks on what gets indexed and inspection at retrieval time.

What is the lethal trifecta?

The lethal trifecta is Simon Willison's name for the three conditions that make indirect prompt injection catastrophic when they hold at once: the agent has access to private data, it is exposed to untrusted content, and it can communicate externally. With all three present, injected instructions can read sensitive data and send it out. Removing any single leg closes the exfiltration path, which is why the defense is architectural rather than a better content filter.

How is prompt injection different from jailbreaking?

Jailbreaking is a subset of direct prompt injection focused on getting a model to bypass its own safety policy. Prompt injection is the broader problem of untrusted text being interpreted as instructions, which includes indirect, stored, and cross-agent variants where the attacker never touches the input field. Jailbreaking targets the model's guardrails; the injection types that matter for agents target the data and tools the agent acts on.

Which prompt injection type is most dangerous for AI agents?

Indirect prompt injection, and its stored and cross-agent extensions. These are the variants where the attacker reaches the agent through data and tools rather than the prompt, so model-layer guardrails never see them. The most damaging real incidents, including zero-click inbox exfiltration and MCP server attacks, have all been indirect. Tool poisoning, where the injection rides in an MCP tool description, is the highest-impact concrete instance.

Prompt injection: a real taxonomy (direct, indirect, stored, cross-agent)

By Cameron McClellan, Growth Engineer

Published June 11, 2026

Prompt injection is the top entry on the OWASP Top 10 for LLM applications, where it has held the number-one slot every year the list has existed.

It occurs when an attacker’s text reaches a language model and gets interpreted as instructions. The model has no way to distinguish legitimate instructions from illegitimate ones, so it acts on the attacker’s intent instead of the operator’s.

The root cause is structural. A language model reads its instructions and its data through the same channel, and any text that reaches the context window can be read as a command.

In agentic systems, dangerous injection can also arrive through:

data the model was told to trust
context that persists across sessions
payloads that travel from one agent to the next

This page classifies the four most common prompt injection attack vectors, gives each an attacker model, a real incident, and the control that actually catches it, and explains why no single layer covers all four. It is a companion to Tool poisoning attacks against MCP servers, which is the most important concrete instance of one of these types.

TL;DR The four prompt injection types

Direct

The attacker controls the user input and writes override instructions into the prompt itself.

DefenseModel guardrails

Indirect

The attacker plants instructions in content the agent reads: a retrieved doc, web page, or tool output.

DefenseRuntime inspection

Stored

The payload is planted once in persistent memory or a RAG store and fires in a later session.

DefenseMemory integrity checks

Cross-agent

An injection infects one agent and rides a message to the next agent in the chain.

DefenseIdentity and per-hop policy

What is a prompt injection attack?

Simon Willison named LLM prompt injection in 2022, drawing the analogy to SQL injection, where untrusted input is concatenated with trusted instructions and the model cannot tell which is which. Unlike SQL injection, there is no parameterized-query equivalent that fully separates the two, because instructions and data are both natural language in the same window.

The attacker’s goal is always to get their text into the context with enough authority to be obeyed, and that single property generates the whole taxonomy. The four types differ only in where that text comes from and how long it waits before it fires.

Direct prompt injection: jailbreaks and system-prompt override

In direct prompt injection, the attacker controls the input field and writes instructions that try to override the system prompt. The attacker and the user are the same person, and the override travels the shortest possible path into the model.

Attacker

Agent

Model

jailbrokenThe user input outranked the system prompt

Direct prompt injection. The attacker controls the input field, the override rides into the model inside trusted context, and the guardrails that should catch it hold only probabilistically.

The attacker’s instructions take one of these forms:

“ignore your previous instructions”
role-play framings
encoded payloads that smuggle a command past a filter

The only thing standing in the way is a probabilistic guardrail.

It matters most when the person typing is the adversary, for example a user trying to extract a system prompt, bypass a safety policy, or jailbreak a model into producing restricted output. The mitigations live mostly at the model layer:

instruction hierarchies
system-prompt hardening
input classifiers

These controls help, but they are probabilistic, and direct injection is the variant model vendors have spent the most effort on. In an agentic setting it is rarely the real threat, because the dangerous content usually does not come from the user at all.

Indirect prompt injection: instructions hidden in data the agent reads

In indirect injection, the attacker plants instructions in content the agent will later read as part of a legitimate task: a web page it browses, a support ticket it triages, an email in the inbox it manages, or the output of a tool it calls. The animation below shows what happens end to end when nothing inspects what the agent reads.

Agent

Model

Data source

Attacker

exfiltratedThe data is gone and nothing flagged it

Indirect prompt injection with nothing inspecting the agent's inputs. The planted instruction rides back with legitimate content, the model obeys it, and private data reaches the attacker while the user sees a clean answer.

The attacker plants the instruction in advance, and it rides back with legitimate content during a routine task. The model obeys it, and private data reaches the attacker while the user sees a clean answer. When the agent ingests that content, it cannot distinguish the embedded instruction from its own task.

The canonical demonstration is Greshake and colleagues’ 2023 paper Not what you’ve signed up for, which compromised real LLM-integrated applications, including Bing Chat, with instructions hidden in web content, and showed remote control, data theft, and worming. The pattern has played out repeatedly since:

EchoLeak (CVE-2025-32711) used a single crafted email to make Microsoft 365 Copilot exfiltrate the contents of a user’s inbox, with no click required.
The same class was demonstrated against the GitHub MCP server in 2025, where a malicious issue in a public repository steered an agent into leaking data from private ones.

Indirect injection is most dangerous when three conditions hold at once, a combination Willison calls the lethal trifecta:

The agent has access to private data.
It is exposed to untrusted content.
It can communicate externally.

Removing any one leg closes the exfiltration path, so the defense is structural (not a better filter), a point the AI security layers reference develops in full. The control that catches indirect injection is runtime inspection of inbound content before the model acts on it, done by AI agent hooks.

Stored prompt injection: how attacks persist across sessions

Stored prompt injection is indirect injection with a built-in delay. Instead of firing when the poisoned content is read, the payload is written into something the agent will retrieve later: a persistent memory store, a notes file, or the vector database behind a RAG pipeline. The animation below shows how the payload waits in the store as clean text and fires when a later session retrieves it as trusted prior context.

Agent

Model

Memory store

Attacker

exfiltratedThe payload fired sessions after it was planted

Stored prompt injection. The payload is written into memory in one session, waits as clean text, and fires when a later session retrieves it as trusted prior context.

A future session pulls it back in as trusted context and acts on it, often long after the attacker is gone. The OWASP Agentic Top 10 catalogs this as memory and context poisoning, covered in depth in The OWASP Agentic Top 10, explained. As a worked example, consider a support agent that keeps long-term memory of customer interactions. An attacker files a series of tickets crafted to write adversarial instructions into that memory, and a later session retrieves and executes them as if they were legitimate prior context. RAG poisoning is the same idea aimed at the knowledge base instead of the memory store.

Stored injection defeats input filtering entirely, because the malicious text was clean when it was written and only becomes an instruction when it is retrieved. The controls are integrity checks on what gets written to memory, inspection at retrieval time, and scoping so that one session cannot poison the context of another.

Cross-agent prompt injection: how one compromised agent spreads an attack

Cross-agent prompt injection arises in multi-agent systems, where the output of agent A becomes the input of agent B. An instruction injected into A can propagate through the messages it sends downstream and turn a single compromise into a chain. The animation below shows how the attacker only ever touches the first agent, while the payload replicates through messages that look like ordinary inter-agent traffic.

Agent A

Attacker

Agent B

Agent C

propagatedOne injection became a chain of compromised agents

Cross-agent prompt injection. One seeded message compromises the first agent, and the payload replicates through normal-looking inter-agent traffic with no human in the loop.

The proof of concept is Morris II, a 2024 zero-click worm that embedded a self-replicating prompt in content processed by GenAI-powered email assistants. Each infected agent carried the payload into its outgoing messages, compromising the next agent in the network without any human in the loop. The propagation rate scaled with context-window size and the number of hops, the same parameters teams tune up for performance.

Cross-agent injection is the hardest to catch with content inspection alone, because the malicious message looks like ordinary inter-agent traffic. The controls are:

identity that travels with each call so actions stay attributable
policy enforced at every hop and not only at the system boundary
an audit trail that can reconstruct the chain after the fact

Prompt injection defense: which control catches which attack type

The four types do not share a single fix, which is the practical reason prompt injection is not “solved.” Model guardrails catch some direct injection and little else, while runtime inspection catches indirect injection at ingest, memory integrity and retrieval-time checks catch stored injection, and identity and per-hop policy contain cross-agent propagation. A program that buys only one of these is covered against one column of the table and exposed on the other three. The subsections below walk through each defense layer an enterprise can deploy and what it does and does not catch.

Model guardrails: instruction hierarchy and LLM input classifiers

Vendors have invested most at the model layer: instruction hierarchies, input classifiers, and prompting techniques that mark untrusted content so the model treats it as data. The animation below shows where these controls sit and what they catch.

Attacker

Agent

Classifier

Model

blockedThe classifier caught the injection before the model saw it

Model guardrails intercepting a direct injection attempt. The input classifier flags the override before the model sees it. The model stays clean throughout.

OpenAI’s instruction hierarchy trains models to give system and developer messages more authority than user input, so an override typed into the chat box loses to the system prompt more often. Input classifiers such as Meta’s Llama Prompt Guard 2 screen text for known injection and jailbreak patterns before it reaches the model, and prompting techniques like Microsoft’s spotlighting mark untrusted content so the model treats it as data and not as instructions.

These controls address direct injection but do not cover the broader problem. They are probabilistic, they degrade against novel phrasings and encodings, and a classifier watching the user input never sees the instruction that arrives inside a tool result the application already trusts.

Runtime inspection: gateways and hooks for AI agent security

Indirect injection enters through what the agent reads, so the defense must sit where the reading happens: on the path between the agent and the content it ingests. The animation below shows runtime inspection intercepting an indirect injection before it reaches the model.

Web

Gateway

Agent

Model

interceptedThe gateway intercepted the payload before it reached the agent

Runtime inspection catching an indirect injection attempt. The MCP gateway inspects every tool result before it reaches the agent. The embedded instruction never enters the model context.

An MCP gateway proxies every tool call, which puts tool results, the channel both EchoLeak and the GitHub MCP incident used, in front of an inspection point before they reach the model. AI agent hooks run inside the agent loop itself and can inspect, rewrite, or block a prompt, a retrieval, or a tool call before it executes.

Willison’s dual-LLM pattern routes untrusted content to a second model that holds no authority, avoiding the need to trust a filter to spot every attack. Google’s CaMeL restructures the agent so that untrusted data retrieved from the environment can never affect the program’s control flow, regardless of what instructions it contains. Both approaches converge on the same principle: separating the text that can act from the text that can only be read.

Memory and RAG integrity: preventing stored injection at the source

Stored injection is invisible to input filtering because the payload is clean text until a later session retrieves it. The animation below shows the controls that live around the store: validating writes, inspecting at retrieval, and scoping memory per session.

Attacker

Integrity check

Store

Agent

write rejectedThe payload never entered the store. No future session retrieves it.

Memory integrity controls rejecting a stored injection attempt. Write-time attribution blocks the payload before it can enter the store, so no future session retrieves it as trusted context.

The controls live around the store instead:

Validate and attribute writes, so adversarial instructions cannot enter long-term memory anonymously through a public channel like support tickets.
Inspect retrieved context at read time, the same way inbound tool results are inspected, because retrieval is the moment stored text becomes instructions.
Scope memory per user and per agent, so one session cannot write into the context a different session will trust.

For RAG pipelines the same discipline applies to the knowledge base: provenance on every document, allowlisted ingestion sources, and periodic scans of the corpus for embedded instructions.

Least privilege: limiting the blast radius of a successful injection

Every injection ends the same way: the model spends the access it was already given. Cutting that access is the one defense that keeps working when detection fails. The animation below shows least privilege containing a successful injection to a narrow scope.

Model

Agent

Credentials

External

access deniedInjection landed but the scoped credentials had nothing to spend

Least privilege containing a successful injection. The model complies, but scoped credentials block the outbound action. The attacker got in but had nothing to spend.

Cutting that access is the one defense that keeps working when detection fails:

Scope credentials per agent and per task instead of sharing a broad service account.
Require human approval for irreversible or outward-facing actions.
Remove a leg of the lethal trifecta wherever the workflow allows it.

An agent that reads untrusted content should not also hold private data and an external write path.

Identity and audit: per-hop policy for multi-agent AI systems

Cross-agent injection defeats content inspection because the malicious message is indistinguishable from legitimate inter-agent traffic. The animation below shows identity and per-hop policy stopping propagation at each boundary.

Agent A

Policy + audit

Audit log

Agent B

chain containedPer-hop policy stopped propagation. The audit log has the full chain.

Per-hop identity and audit stopping cross-agent propagation. Signed identity tokens let policy enforce at every hop, and the audit log captures the full chain for reconstruction after the fact.

The controls are structural:

Identity must travel with every call so each action stays attributable to a specific agent.
Policy must be enforced at every hop and not only at the system boundary.
The audit trail must be complete enough to reconstruct the chain after the fact.

Without per-hop enforcement, the first compromised agent inherits the trust of the whole workflow, which is exactly the property Morris II exploited.

Why no single control prevents prompt injection attacks

Responsible guidance has converged on defense in depth over any claim of full prevention. By late 2025, vendors building agentic browsers were publicly acknowledging that prompt injection may never be fully eliminated, only contained. Containment means assuming injection will land and limiting what it can reach: scoped credentials, inspection on the path, and a complete record of what happened. Each layer above covers the column of the taxonomy the others miss, and an enterprise program needs all five before the table stops having an exposed column.

How prompt injection attacks map to the OWASP Agentic Top 10

Direct and indirect injection are the model-layer face of agent goal hijack (ASI01), stored injection maps to memory and context poisoning (ASI06), and cross-agent injection is the propagation mechanism behind several agentic categories at once. The AI security frameworks reference maps how OWASP, NIST, and MITRE divide this ground, and the Agentic Top 10 explainer walks each category with examples.

Where Speakeasy fits

SpeakeasyAI control plane

MCP gateway

Inspects every tool result before it reaches the agent

Indirect injection

Agent hooks

Runs inside the agent loop to inspect or block any prompt or retrieval

Indirect + stored

Shared identity

Every action stays attributable across agents and hops

Cross-agent

Audit logging

Produces the record that turns an incident into something you can reconstruct

All types

Each layer covers the column of the taxonomy the others miss. Together they close the table.

The four Speakeasy control plane capabilities and which injection types each one addresses.

No product makes prompt injection impossible, and any vendor that claims otherwise is selling the direct case as if it were the whole problem. What an AI control plane does is make the other three types containable, including the cross-agent and stored variants that matter most for multi-agent AI security:

The MCP gateway inspects tool calls and the content that comes back from them, which is where indirect injection enters an agent.
Agent hooks run inside the agent loop and can inspect or block a prompt, a retrieval, or a tool call before it executes.
A shared identity foundation keeps every action attributable across agents.
Audit logging produces the record that turns an incident into something you can reconstruct.

Injection will still arrive, but the control plane ensures it lands somewhere visible and into a blast radius that has already been scoped.

Prompt injectionThreats & defenses

Direct

Indirect

Stored

Cross-agent

What is a prompt injection attack?

Direct prompt injection: jailbreaks and system-prompt override

Indirect prompt injection: instructions hidden in data the agent reads

Stored prompt injection: how attacks persist across sessions

Cross-agent prompt injection: how one compromised agent spreads an attack

Prompt injection defense: which control catches which attack type

Model guardrails: instruction hierarchy and LLM input classifiers

Runtime inspection: gateways and hooks for AI agent security

Memory and RAG integrity: preventing stored injection at the source

Least privilege: limiting the blast radius of a successful injection

Identity and audit: per-hop policy for multi-agent AI systems

Why no single control prevents prompt injection attacks

How prompt injection attacks map to the OWASP Agentic Top 10

Where Speakeasy fits

Frequently asked questions