Skip to Content

Engineering

How we approach AI security: where to apply policy and how to enforce it

Vishal Gowda

Vishal Gowda

June 17, 2026 - 8 min read

Engineering

We spend a lot of our time thinking about how to inspect agent traffic in real time, and we wanted to share how we approach it. There is plenty written about what to govern: NIST’s AI Risk Management Framework, ISO 42001, and the OWASP Top 10 for LLM applications all give a good account of the risks. There is much less written about the engineering of actually enforcing a policy on live traffic, which is the part we work on day to day.

When we sat down to design this, it kept collapsing into two questions:

  • Where on an agent’s path do you apply a policy? An agent interaction isn’t a single event. It’s a sequence of distinct moments, and each one leaks differently.
  • How do you decide whether a given moment violates the policy? Detection isn’t one technique. It’s a few, and they differ in both what they cost and how predictable their verdicts are.

There’s a third question sitting underneath: how do you keep enforcement fast and quiet enough that people leave it switched on? We’ll walk through all three, and we’d welcome pushback from anyone who has built something similar.

The four points where AI security policy applies

The thing we kept coming back to is that a conversation isn’t one blob of text to scan. A single agent turn passes through four distinct points where policy needs to be enforced.

User
Model
Tool / MCP
1
User prompt

Where PII and prompt injection are most likely to enter

2
Tool call

The highest-leverage point: where intent turns into action

3
Tool response

Watches for exfiltration on the way back into context

4
Model response

Catches a secret, or anything that crosses a guardrail

A single agent turn passes through four distinct points. A policy can be scoped to each one, because each leaks differently.

User prompts

The prompt is what the user sends to the model. It’s where personally identifiable information (PII) is most likely to enter the system, because it’s the actor typing free text, and it’s the obvious surface for prompt injection. So a PII scan or an injection check belongs here. Running that same PII scan on model responses, on the other hand, is mostly wasted work and a steady source of noise, which is part of why we don’t treat the conversation as one undifferentiated stream.

Tool calls

When an agent decides to invoke a tool, the call carries a function name and its arguments. For us this is the highest-leverage point, because it’s where intent turns into action. A policy here can target a specific MCP server, a specific function (bash, read, edit), or a pattern inside the arguments. Matching on curl or rm -rf in a shell tool’s arguments is a tool-call policy, and it’s the kind of thing that’s much easier to reason about at the call site than anywhere else.

Tool responses

The data a tool returns is the third point. A tool wired into an internal system can hand back records full of sensitive information, and that payload flows straight into the model’s context. Scoping detection to tool responses lets us watch for exfiltration on the way back in, independently of anything the user typed.

Model responses

This is the text the model itself produces. Scoping a policy here catches the case where the model emits something it shouldn’t: a secret it inferred, a disallowed recommendation, or anything that crosses a stated guardrail. We were deliberate about the name, too. “Assistant message” reads ambiguously, and what’s actually being inspected is the agent’s own output.

How AI threat detection works: three techniques

Once a policy is scoped to the right point, something has to decide whether the content actually violates it. Defense-in-depth is the familiar instinct here: no single control catches everything, so you layer them. We borrow the layering but change the ordering. Classic defense-in-depth stacks redundant controls so a miss in one is caught by the next; our layers are a cascade ordered roughly by cost, where the cheapest check that can answer the question runs first and only what it can’t resolve falls through to the next.

Cost isn’t the only axis, though. The cheaper layers are also the more predictable ones: a regex returns the same verdict on the same input every time, while an LLM judge can decide differently across two identical runs. So when we reach for a deterministic check first, we’re optimizing for repeatable enforcement as much as for latency.

Regex and exact-text matching

The fastest layer is deterministic pattern matching: regular expressions and exact-text matches. It runs inline as traffic passes through, so there’s not much to optimize away. It’s the right tool when the pattern is knowable ahead of time: known secret formats, banned strings, specific command signatures. If a regex can answer the question, there’s no reason to spend a model call on it.

Presidio for PII and entity detection

For PII, fixed patterns alone get brittle fast. We use Presidio , Microsoft’s open source data-protection library, as a hybrid detector. It pairs regex-style recognizers with machine-learning models and some natural-language processing that tokenizes text before classifying it, so it picks up entities like names and addresses that a static pattern would miss. It sits a step above plain regex in both capability and cost, which is the tradeoff we’re making when we reach for it.

LLM-as-judge for intent-based policies

The most flexible layer is an LLM acting as a judge. It’s what powers our prompt-based policies, where someone writes a guardrail in natural language (“flag any message that exhibits a particular bias”) instead of as a pattern. The judge model reads the interaction and decides whether it matches that intent. It handles the policies that can’t be reduced to a string match, and in exchange we give up repeatability and pay the latency and compute of a model call on every evaluation it covers.

The three layers trade off along the same axis. Here’s roughly how we think about where each one fits.

Runs firstCost and latency rise ↓
1
Regex / exact-text

Known patterns: secret formats, banned strings, command signatures

Lowest, runs inline
Escalates only if the cheaper layer can't decide
2
Presidio

PII and named entities that fixed patterns miss

Moderate, ML plus NLP
Escalates only if the cheaper layer can't decide
3
LLM-as-judge

Intent-based guardrails defined in natural language

Highest, a model call per evaluation
Last resort
The three detection techniques stacked by cost. The cheapest check that can answer the question runs first, and traffic only climbs to the next layer when the current one can't decide.

Why AI security enforcement has to be fast

All of this runs on synchronous hooks. The agent can’t take its next step until the hook returns a verdict, so every check sits squarely in the critical path of the interaction. That constraint ends up shaping most of the design decisions we make.

Deterministic checks are cheap enough that their latency is close to a rounding error. An LLM judge is a different animal: it adds the round trip and inference time of a separate model call to every interaction it covers. Left alone, that’s exactly the kind of cost that gets a security control quietly switched off, so we lean on two ideas to keep it in check.

  • Run the cheap layers first. Most violations have a deterministic signature. If regex or Presidio catches it, the judge never has to fire.
  • Hoist deterministic rules out of prompt policies. A natural-language policy can often be distilled into a set of substring or regex rules that approximate it. Those run as a fast layer zero that short-circuits the interaction before the judge is invoked. We’re working toward inferring that rule set automatically from real session history, so someone can write a policy in plain language and still get deterministic, low-latency enforcement underneath it.

Keeping policies quiet

Speed is only half of what keeps a control switched on. The other half is noise: a policy that fires constantly on safe interactions trains its owners to ignore it, which is worse than not having the policy at all. Most of the time the detector is doing its job correctly and the policy is just scoped too broadly, so the fix is better scoping, not a blunter detector. We pulled that topic into its own piece, cutting false positives in AI security without weakening policy, which walks through fine-grained scopes, exemptions, and testing rules against real traffic.

Where we’re taking this

None of this is finished. The direction we’re most interested in is closing the loop between the flexible layer and the fast one: letting people express intent in natural language, then automatically learning the deterministic rules that approximate it from real traffic, so the expensive judge fires less and less over time. If you’re working on similar problems, we’d love to compare notes.

This is the security work behind the AI control plane, where the same hooks inspect prompts, model responses, tool calls, and tool responses on the path between an agent and the systems it reaches.

Frequently asked questions

Where should AI security policy be applied?

At four distinct points in an agent interaction: the user prompt, the model response, the tool call, and the tool response. Each leaks differently, so a policy is better scoped to the specific point that matters than applied to the whole conversation.

What is an LLM-as-judge?

It’s a detection technique that uses a language model to evaluate whether an interaction violates a policy written in natural language. It handles guardrails that can’t be reduced to a pattern match, at the cost of a model call per evaluation, which is why cheaper deterministic checks should run ahead of it.

Last updated on

AI everywhere.