Resource · Security taxonomy

Prompt injectionThreats & defenses

Direct, indirect, stored, and cross-agent. The four shapes prompt injection takes, where each enters an agent, and the control that actually catches it.

Scroll for the taxonomy
Cameron McClellan headshotBy Cameron McClellan, Growth Engineer
Published

Prompt injection is an attack in which an attacker’s text is provided to a language model (via a user message, a retrieved document, a tool result) and interpreted as instructions. The model has no way to distinguish between legitimate and illegitimate instructions and so acts on the attacker’s intent instead of the operator’s.

Prompt injection is the top entry on the OWASP Top 10 for LLM applications, where it has held the number-one slot every year the list has existed. The root cause is structural: a language model reads its instructions and its data through the same channel, so any text that reaches the context window can be read as a command.

Most writing on the subject covers one case: a malicious user types “ignore your previous instructions” and the model complies. That case exists, but it is the least dangerous shape, and treating it as the whole problem is why people believe a content filter solves it. The injection that matters in agentic systems usually arrives in data the model was told to trust, persists across sessions, or travels from one agent to the next.

This page classifies the four most common attack vectors, gives each an attacker model, a real incident, and the control that actually catches it, and explains why no single layer covers all four. It is a companion to Tool poisoning attacks against MCP servers, which is the most important concrete instance of one of these types.

TL;DR The four prompt injection types

Direct

The attacker controls the user input and writes override instructions into the prompt itself.

DefenseModel guardrails

Indirect

The attacker plants instructions in content the agent reads: a retrieved doc, web page, or tool output.

DefenseRuntime inspection

Stored

The payload is planted once in persistent memory or a RAG store and fires in a later session.

DefenseMemory integrity checks

Cross-agent

An injection infects one agent and rides a message to the next agent in the chain.

DefenseIdentity and per-hop policy

What is prompt injection?

Simon Willison named prompt injection in 2022, drawing the analogy to SQL injection: untrusted input is concatenated with trusted instructions, and the model cannot tell which is which. Unlike SQL injection, there is no parameterized-query equivalent that fully separates the two, because instructions and data are both natural language in the same window.

That single property generates the whole taxonomy. The attacker’s goal is always to get their text into the context with enough authority to be obeyed. What changes between the four types is where that text comes from and how long it waits before it fires.

What is direct prompt injection: jailbreaks and system-prompt override

Attacker
Agent
Model
jailbrokenThe user input outranked the system prompt
Direct prompt injection. The attacker controls the input field, the override rides into the model inside trusted context, and the guardrails that should catch it hold only probabilistically.

Direct prompt injection is the case everyone knows. The attacker controls the input field and writes instructions that try to override the system prompt: “ignore your previous instructions,” role-play framings, or encoded payloads that smuggle a command past a filter.

The animation above shows why this is the simplest shape: the attacker and the user are the same person, the override travels the shortest possible path into the model, and the only thing standing in its way is a probabilistic guardrail.

It matters most when the person typing is the adversary, for example a user trying to extract a system prompt, bypass a safety policy, or jailbreak a model into producing restricted output. The mitigations live mostly at the model layer: instruction hierarchies, system-prompt hardening, and input classifiers. These help, but they are probabilistic, and direct injection is the variant model vendors have spent the most effort on. In an agentic setting it is rarely the real threat, because the dangerous content usually does not come from the user at all.

What is indirect prompt injection: instructions hidden in data the agent reads

Agent
Model
Data source
Attacker
exfiltratedThe data is gone and nothing flagged it
Indirect prompt injection with nothing inspecting the agent's inputs. The planted instruction rides back with legitimate content, the model obeys it, and private data reaches the attacker while the user sees a clean answer.

Indirect prompt injection is the shape that breaks agents. The attacker never talks to the model. They plant instructions in content the agent will later read as part of a legitimate task: a web page it browses, a support ticket it triages, an email in the inbox it manages, or the output of a tool it calls. When the agent ingests that content, it cannot distinguish the embedded instruction from its own task.

The animation above shows the attack end to end when nothing inspects what the agent reads. The attacker plants the instruction in advance, it rides back with legitimate content during a routine task, the model obeys it, and private data reaches the attacker while the user sees a clean answer.

The canonical demonstration is Greshake and colleagues’ 2023 paper Not what you’ve signed up for, which compromised real LLM-integrated applications, including Bing Chat, with instructions hidden in web content, and showed remote control, data theft, and persistence. The pattern has played out repeatedly since:

Indirect injection is most dangerous when three conditions hold at once, a combination Willison calls the lethal trifecta: the agent has access to private data, it is exposed to untrusted content, and it can communicate externally. Remove any one leg and the exfiltration path closes. That is why the defense is structural rather than a better filter, a point the AI security layers reference develops in full. The control that catches indirect injection is runtime inspection of inbound content before the model acts on it, which is what AI agent hooks exist to do.

What is stored prompt injection: attacks that persist across sessions

Agent
Model
Memory store
Attacker
exfiltratedThe payload fired sessions after it was planted
Stored prompt injection. The payload is written into memory in one session, waits as clean text, and fires when a later session retrieves it as trusted prior context.

Stored prompt injection is indirect injection with a delay. Instead of firing when the poisoned content is read, the payload is written into something the agent will retrieve later: a persistent memory store, a notes file, or the vector database behind a retrieval-augmented generation pipeline. A future session pulls it back in as trusted context and acts on it, often long after the attacker is gone.

The OWASP Agentic Top 10 catalogs this as memory and context poisoning, covered in depth in The OWASP Agentic Top 10, explained. A worked example: a support agent keeps long-term memory of customer interactions, an attacker files a series of tickets crafted to write adversarial instructions into that memory, and a later session retrieves and executes them as if they were legitimate prior context. RAG poisoning is the same idea aimed at the knowledge base instead of the memory store.

The animation above shows the delay that defines this type: the payload is written in one session, waits in the store as clean text, and fires when a later session retrieves it as trusted prior context.

Stored injection defeats input filtering entirely, because the malicious text was clean when it was written and only becomes an instruction when it is retrieved. The controls are integrity checks on what gets written to memory, inspection at retrieval time, and scoping so that one session cannot poison the context of another.

What is cross-agent prompt injection: how one compromised agent infects the next

Agent A
Attacker
Agent B
Agent C
propagatedOne injection became a chain of compromised agents
Cross-agent prompt injection. One seeded message compromises the first agent, and the payload replicates through normal-looking inter-agent traffic with no human in the loop.

Cross-agent injection appears once systems run more than one agent. The output of agent A becomes the input of agent B, so an instruction injected into A can propagate through the messages it sends downstream. In a multi-agent workflow this turns a single compromise into a chain.

The proof of concept is Morris II, a 2024 zero-click worm that embedded a self-replicating prompt in content processed by GenAI-powered email assistants. Each infected agent carried the payload into its outgoing messages, compromising the next agent in the network without any human in the loop. The propagation rate scaled with context-window size and the number of hops, the same parameters teams tune up for performance.

The animation above shows the chain: the attacker only ever touches the first agent, and the payload replicates downstream through messages that look like ordinary inter-agent traffic.

Cross-agent injection is the hardest to catch with content inspection alone, because the malicious message looks like ordinary inter-agent traffic. The controls are identity that travels with each call so actions stay attributable, policy enforced at every hop rather than only at the system boundary, and an audit trail that can reconstruct the chain after the fact.

How to prevent prompt injection: which control catches which type

The four types do not share a single fix, which is the practical reason prompt injection is not “solved.” Model guardrails catch some direct injection and little else. Runtime inspection catches indirect injection at ingest. Memory integrity and retrieval-time checks catch stored injection. Identity and per-hop policy contain cross-agent propagation. A program that buys only one of these is covered against one column of the table and exposed on the other three. The subsections below walk through each defense layer an enterprise can deploy and what it does and does not catch.

Model guardrails: instruction hierarchy and input classifiers

The model layer is where vendors have invested most. OpenAI’s instruction hierarchy trains models to give system and developer messages more authority than user input, so an override typed into the chat box loses to the system prompt more often. Input classifiers such as Meta’s Llama Prompt Guard 2 screen text for known injection and jailbreak patterns before it reaches the model, and prompting techniques like Microsoft’s spotlighting mark untrusted content so the model can treat it as data rather than instructions.

These controls are the right tool for direct injection and the wrong place to stop. They are probabilistic, they degrade against novel phrasings and encodings, and a classifier watching the user input never sees the instruction that arrives inside a tool result the application already trusts.

Runtime inspection: a gateway and hooks on the agent’s path

Indirect injection enters through what the agent reads, so the control has to sit where the reading happens. An MCP gateway proxies every tool call, which puts tool results, the channel both EchoLeak and the GitHub MCP incident used, in front of an inspection point before they reach the model. AI agent hooks run inside the agent loop itself and can inspect, rewrite, or block a prompt, a retrieval, or a tool call before it executes.

Research designs point the same direction. Willison’s dual-LLM pattern and Google’s CaMeL both restructure the system so untrusted content is processed by a model that holds no authority, rather than trusting a filter to spot every attack. The shared principle is architectural: separate the text that can act from the text that can only be read.

Memory and RAG integrity: controlling what gets written and retrieved

Stored injection is invisible to input filtering because the payload is clean text until a later session retrieves it. The controls live around the store instead:

  • Validate and attribute writes, so adversarial instructions cannot enter long-term memory anonymously through a public channel like support tickets.
  • Inspect retrieved context at read time, the same way inbound tool results are inspected, because retrieval is the moment stored text becomes instructions.
  • Scope memory per user and per agent, so one session cannot write into the context a different session will trust.

For RAG pipelines the same discipline applies to the knowledge base: provenance on every document, allowlisted ingestion sources, and periodic scans of the corpus for embedded instructions.

Least privilege: scoped credentials and a bounded blast radius

Every injection ends the same way, with the model spending the access it was already given. Cutting that access is the one defense that keeps working when detection fails. Scope credentials per agent and per task instead of sharing a broad service account, require human approval for irreversible or outward-facing actions, and remove a leg of the lethal trifecta wherever the workflow allows it: an agent that reads untrusted content should not also hold private data and an external write path.

Identity and audit: per-hop policy for multi-agent systems

Cross-agent injection defeats content inspection because the malicious message is indistinguishable from legitimate inter-agent traffic. The controls here are structural. Identity must travel with every call so each action stays attributable to a specific agent, policy must be enforced at every hop rather than once at the system boundary, and the audit trail must be complete enough to reconstruct the chain after the fact. Without per-hop enforcement, the first compromised agent inherits the trust of the whole workflow, which is exactly the property Morris II exploited.

Why defense in depth is the consensus

This is why responsible guidance has converged on defense in depth rather than prevention. By late 2025, vendors building agentic browsers were publicly acknowledging that prompt injection may never be fully eliminated, only contained. Containment means assuming injection will land and limiting what it can reach: scoped credentials, inspection on the path, and a complete record of what happened. Each layer above covers the column of the taxonomy the others miss, and an enterprise program needs all five before the table stops having an exposed column.

How prompt injection maps to the OWASP Agentic Top 10

The taxonomy lines up with the agentic threat catalog. Direct and indirect injection are the model-layer face of agent goal hijack (ASI01). Stored injection is memory and context poisoning (ASI06). Cross-agent injection is the propagation mechanism behind several agentic categories at once. The AI security frameworks reference maps how OWASP, NIST, and MITRE divide this ground, and the Agentic Top 10 explainer walks each category with examples.

Where Speakeasy fits

No product makes prompt injection impossible, and any vendor that claims otherwise is selling the direct case as if it were the whole problem. What an AI control plane does is make the other three types containable. The MCP gateway inspects tool calls and the content that comes back from them, which is where indirect injection enters an agent. Agent hooks run inside the agent loop and can inspect or block a prompt, a retrieval, or a tool call before it executes. A shared identity foundation keeps every action attributable across agents, and audit logging produces the record that turns an incident into something you can reconstruct. The injection will still arrive. The point of the control plane is that it arrives somewhere you can see it and into a blast radius you have already scoped.

Frequently asked questions

Not completely. Because a language model reads instructions and data through the same channel, no filter reliably separates the two, and by late 2025 vendors building agentic systems were publicly acknowledging that prompt injection may never be fully eliminated. The realistic goal is containment, not prevention: assume an injection will land, inspect content on the path, scope what agents can reach, and keep a complete audit record so an incident can be caught and reconstructed.

Direct prompt injection is when the attacker controls the user input and writes instructions to override the system prompt, such as a jailbreak. Indirect prompt injection is when the attacker never talks to the model and instead plants instructions in content the agent later reads as part of a task, such as a web page, support ticket, email, or tool output. Indirect injection is the more dangerous shape in agentic systems because the malicious text arrives inside data the agent was told to trust.

Yes. RAG poisoning is a stored prompt injection aimed at the knowledge base. An attacker writes adversarial instructions into the documents or vector store that a retrieval-augmented generation pipeline pulls from. The content looks clean when written and only becomes an instruction when it is retrieved into the context of a later session, which is why input filtering does not catch it. The defenses are integrity checks on what gets indexed and inspection at retrieval time.

The lethal trifecta is Simon Willison's name for the three conditions that make indirect prompt injection catastrophic when they hold at once: the agent has access to private data, it is exposed to untrusted content, and it can communicate externally. With all three present, injected instructions can read sensitive data and send it out. Removing any single leg closes the exfiltration path, which is why the defense is architectural rather than a better content filter.

Jailbreaking is a subset of direct prompt injection focused on getting a model to bypass its own safety policy. Prompt injection is the broader problem of untrusted text being interpreted as instructions, which includes indirect, stored, and cross-agent variants where the attacker never touches the input field. Jailbreaking targets the model's guardrails; the injection types that matter for agents target the data and tools the agent acts on.

Indirect prompt injection, and its stored and cross-agent extensions. These are the variants where the attacker reaches the agent through data and tools rather than the prompt, so model-layer guardrails never see them. The most damaging real incidents, including zero-click inbox exfiltration and MCP server attacks, have all been indirect. Tool poisoning, where the injection rides in an MCP tool description, is the highest-impact concrete instance.

AI everywhere.