Prompt injection is an attack in which an attacker’s text is provided to a language model (via a user message, a retrieved document, a tool result) and interpreted as instructions. The model has no way to distinguish between legitimate and illegitimate instructions and so acts on the attacker’s intent instead of the operator’s.
Prompt injection is the top entry on the OWASP Top 10 for LLM applications, where it has held the number-one slot every year the list has existed. The root cause is structural: a language model reads its instructions and its data through the same channel, so any text that reaches the context window can be read as a command.
Most writing on the subject covers one case: a malicious user types “ignore your previous instructions” and the model complies. That case exists, but it is the least dangerous shape, and treating it as the whole problem is why people believe a content filter solves it. The injection that matters in agentic systems usually arrives in data the model was told to trust, persists across sessions, or travels from one agent to the next.
This page classifies the four most common attack vectors, gives each an attacker model, a real incident, and the control that actually catches it, and explains why no single layer covers all four. It is a companion to Tool poisoning attacks against MCP servers, which is the most important concrete instance of one of these types.
Direct
The attacker controls the user input and writes override instructions into the prompt itself.
Indirect
The attacker plants instructions in content the agent reads: a retrieved doc, web page, or tool output.
Stored
The payload is planted once in persistent memory or a RAG store and fires in a later session.
Cross-agent
An injection infects one agent and rides a message to the next agent in the chain.
What is prompt injection?
Simon Willison named prompt injection in 2022, drawing the analogy to SQL injection: untrusted input is concatenated with trusted instructions, and the model cannot tell which is which. Unlike SQL injection, there is no parameterized-query equivalent that fully separates the two, because instructions and data are both natural language in the same window.
That single property generates the whole taxonomy. The attacker’s goal is always to get their text into the context with enough authority to be obeyed. What changes between the four types is where that text comes from and how long it waits before it fires.
What is direct prompt injection: jailbreaks and system-prompt override
Direct prompt injection is the case everyone knows. The attacker controls the input field and writes instructions that try to override the system prompt: “ignore your previous instructions,” role-play framings, or encoded payloads that smuggle a command past a filter.
The animation above shows why this is the simplest shape: the attacker and the user are the same person, the override travels the shortest possible path into the model, and the only thing standing in its way is a probabilistic guardrail.
It matters most when the person typing is the adversary, for example a user trying to extract a system prompt, bypass a safety policy, or jailbreak a model into producing restricted output. The mitigations live mostly at the model layer: instruction hierarchies, system-prompt hardening, and input classifiers. These help, but they are probabilistic, and direct injection is the variant model vendors have spent the most effort on. In an agentic setting it is rarely the real threat, because the dangerous content usually does not come from the user at all.
What is indirect prompt injection: instructions hidden in data the agent reads
Indirect prompt injection is the shape that breaks agents. The attacker never talks to the model. They plant instructions in content the agent will later read as part of a legitimate task: a web page it browses, a support ticket it triages, an email in the inbox it manages, or the output of a tool it calls. When the agent ingests that content, it cannot distinguish the embedded instruction from its own task.
The animation above shows the attack end to end when nothing inspects what the agent reads. The attacker plants the instruction in advance, it rides back with legitimate content during a routine task, the model obeys it, and private data reaches the attacker while the user sees a clean answer.
The canonical demonstration is Greshake and colleagues’ 2023 paper Not what you’ve signed up for, which compromised real LLM-integrated applications, including Bing Chat, with instructions hidden in web content, and showed remote control, data theft, and persistence. The pattern has played out repeatedly since:
- EchoLeak (CVE-2025-32711) used a single crafted email to make Microsoft 365 Copilot exfiltrate the contents of a user’s inbox, with no click required.
- The same class was demonstrated against the GitHub MCP server in 2025, where a malicious issue in a public repository steered an agent into leaking data from private ones.
Indirect injection is most dangerous when three conditions hold at once, a combination Willison calls the lethal trifecta: the agent has access to private data, it is exposed to untrusted content, and it can communicate externally. Remove any one leg and the exfiltration path closes. That is why the defense is structural rather than a better filter, a point the AI security layers reference develops in full. The control that catches indirect injection is runtime inspection of inbound content before the model acts on it, which is what AI agent hooks exist to do.
What is stored prompt injection: attacks that persist across sessions
Stored prompt injection is indirect injection with a delay. Instead of firing when the poisoned content is read, the payload is written into something the agent will retrieve later: a persistent memory store, a notes file, or the vector database behind a retrieval-augmented generation pipeline. A future session pulls it back in as trusted context and acts on it, often long after the attacker is gone.
The OWASP Agentic Top 10 catalogs this as memory and context poisoning, covered in depth in The OWASP Agentic Top 10, explained. A worked example: a support agent keeps long-term memory of customer interactions, an attacker files a series of tickets crafted to write adversarial instructions into that memory, and a later session retrieves and executes them as if they were legitimate prior context. RAG poisoning is the same idea aimed at the knowledge base instead of the memory store.
The animation above shows the delay that defines this type: the payload is written in one session, waits in the store as clean text, and fires when a later session retrieves it as trusted prior context.
Stored injection defeats input filtering entirely, because the malicious text was clean when it was written and only becomes an instruction when it is retrieved. The controls are integrity checks on what gets written to memory, inspection at retrieval time, and scoping so that one session cannot poison the context of another.
What is cross-agent prompt injection: how one compromised agent infects the next
Cross-agent injection appears once systems run more than one agent. The output of agent A becomes the input of agent B, so an instruction injected into A can propagate through the messages it sends downstream. In a multi-agent workflow this turns a single compromise into a chain.
The proof of concept is Morris II, a 2024 zero-click worm that embedded a self-replicating prompt in content processed by GenAI-powered email assistants. Each infected agent carried the payload into its outgoing messages, compromising the next agent in the network without any human in the loop. The propagation rate scaled with context-window size and the number of hops, the same parameters teams tune up for performance.
The animation above shows the chain: the attacker only ever touches the first agent, and the payload replicates downstream through messages that look like ordinary inter-agent traffic.
Cross-agent injection is the hardest to catch with content inspection alone, because the malicious message looks like ordinary inter-agent traffic. The controls are identity that travels with each call so actions stay attributable, policy enforced at every hop rather than only at the system boundary, and an audit trail that can reconstruct the chain after the fact.
How to prevent prompt injection: which control catches which type
The four types do not share a single fix, which is the practical reason prompt injection is not “solved.” Model guardrails catch some direct injection and little else. Runtime inspection catches indirect injection at ingest. Memory integrity and retrieval-time checks catch stored injection. Identity and per-hop policy contain cross-agent propagation. A program that buys only one of these is covered against one column of the table and exposed on the other three. The subsections below walk through each defense layer an enterprise can deploy and what it does and does not catch.
Model guardrails: instruction hierarchy and input classifiers
The model layer is where vendors have invested most. OpenAI’s instruction hierarchy trains models to give system and developer messages more authority than user input, so an override typed into the chat box loses to the system prompt more often. Input classifiers such as Meta’s Llama Prompt Guard 2 screen text for known injection and jailbreak patterns before it reaches the model, and prompting techniques like Microsoft’s spotlighting mark untrusted content so the model can treat it as data rather than instructions.
These controls are the right tool for direct injection and the wrong place to stop. They are probabilistic, they degrade against novel phrasings and encodings, and a classifier watching the user input never sees the instruction that arrives inside a tool result the application already trusts.
Runtime inspection: a gateway and hooks on the agent’s path
Indirect injection enters through what the agent reads, so the control has to sit where the reading happens. An MCP gateway proxies every tool call, which puts tool results, the channel both EchoLeak and the GitHub MCP incident used, in front of an inspection point before they reach the model. AI agent hooks run inside the agent loop itself and can inspect, rewrite, or block a prompt, a retrieval, or a tool call before it executes.
Research designs point the same direction. Willison’s dual-LLM pattern and Google’s CaMeL both restructure the system so untrusted content is processed by a model that holds no authority, rather than trusting a filter to spot every attack. The shared principle is architectural: separate the text that can act from the text that can only be read.
Memory and RAG integrity: controlling what gets written and retrieved
Stored injection is invisible to input filtering because the payload is clean text until a later session retrieves it. The controls live around the store instead:
- Validate and attribute writes, so adversarial instructions cannot enter long-term memory anonymously through a public channel like support tickets.
- Inspect retrieved context at read time, the same way inbound tool results are inspected, because retrieval is the moment stored text becomes instructions.
- Scope memory per user and per agent, so one session cannot write into the context a different session will trust.
For RAG pipelines the same discipline applies to the knowledge base: provenance on every document, allowlisted ingestion sources, and periodic scans of the corpus for embedded instructions.
Least privilege: scoped credentials and a bounded blast radius
Every injection ends the same way, with the model spending the access it was already given. Cutting that access is the one defense that keeps working when detection fails. Scope credentials per agent and per task instead of sharing a broad service account, require human approval for irreversible or outward-facing actions, and remove a leg of the lethal trifecta wherever the workflow allows it: an agent that reads untrusted content should not also hold private data and an external write path.
Identity and audit: per-hop policy for multi-agent systems
Cross-agent injection defeats content inspection because the malicious message is indistinguishable from legitimate inter-agent traffic. The controls here are structural. Identity must travel with every call so each action stays attributable to a specific agent, policy must be enforced at every hop rather than once at the system boundary, and the audit trail must be complete enough to reconstruct the chain after the fact. Without per-hop enforcement, the first compromised agent inherits the trust of the whole workflow, which is exactly the property Morris II exploited.
Why defense in depth is the consensus
This is why responsible guidance has converged on defense in depth rather than prevention. By late 2025, vendors building agentic browsers were publicly acknowledging that prompt injection may never be fully eliminated, only contained. Containment means assuming injection will land and limiting what it can reach: scoped credentials, inspection on the path, and a complete record of what happened. Each layer above covers the column of the taxonomy the others miss, and an enterprise program needs all five before the table stops having an exposed column.
How prompt injection maps to the OWASP Agentic Top 10
The taxonomy lines up with the agentic threat catalog. Direct and indirect injection are the model-layer face of agent goal hijack (ASI01). Stored injection is memory and context poisoning (ASI06). Cross-agent injection is the propagation mechanism behind several agentic categories at once. The AI security frameworks reference maps how OWASP, NIST, and MITRE divide this ground, and the Agentic Top 10 explainer walks each category with examples.
Where Speakeasy fits
No product makes prompt injection impossible, and any vendor that claims otherwise is selling the direct case as if it were the whole problem. What an AI control plane does is make the other three types containable. The MCP gateway inspects tool calls and the content that comes back from them, which is where indirect injection enters an agent. Agent hooks run inside the agent loop and can inspect or block a prompt, a retrieval, or a tool call before it executes. A shared identity foundation keeps every action attributable across agents, and audit logging produces the record that turns an incident into something you can reconstruct. The injection will still arrive. The point of the control plane is that it arrives somewhere you can see it and into a blast radius you have already scoped.