Engineering

How we approach AI security: where to apply policy and how to enforce it

Vishal Gowda

June 17, 2026 · 8 min read

We spend a lot of our time thinking about how to inspect agent traffic in real time, and we wanted to share how we approach AI security policy enforcement. There is plenty written about what to govern, including NIST's AI Risk Management Framework, ISO 42001, and the OWASP Top 10 for LLM applications, all of which give a good account of the risks. There is much less written about the engineering of actually enforcing a policy on live traffic, which is the part we work on day to day.

When we sat down to design this, it kept collapsing into two questions:

Where on an agent's path do you apply a policy? An agent interaction is a sequence of distinct moments, not a single event, and each one leaks differently.
How do you decide whether a given moment violates the policy? Detection is a handful of techniques rather than one, and they differ in both what they cost and how predictable their verdicts are.

There's a third question sitting underneath. How do you keep enforcement fast and quiet enough that people leave it switched on? We'll walk through all three, and we'd welcome pushback from anyone who has built something similar.

Where to apply AI security policy

The thing we kept coming back to is that a conversation is not one blob of text to scan. A single agent turn passes through four distinct points where AI agent guardrails need to be enforced.

User

Model

Tool / MCP

User prompt

Where PII and prompt injection are most likely to enter

Tool call

The highest-leverage point: where intent turns into action

Tool response

Watches for exfiltration on the way back into context

Model response

Catches a secret, or anything that crosses a guardrail

A single agent turn passes through four distinct points. A policy can be scoped to each one, because each leaks differently.

Detecting PII and prompt injection in user prompts

The prompt is what the user sends to the model. It's where personally identifiable information (PII) is most likely to enter the system, because it's the actor typing free text, and it's the obvious surface for prompt injection. So a PII scan or a prompt injection detection check belongs here.

Running that same PII scan on model responses, on the other hand, is mostly wasted work and a steady source of noise. That's part of why we don't treat the conversation as one undifferentiated stream.

Enforcing policy at the tool call

When an agent decides to invoke a tool, the call carries a function name and its arguments. For us this is the highest-leverage point, because it's where intent turns into action. A policy here can target:

a specific MCP server
a specific function (bash, read, edit)
a pattern inside the arguments

Matching on curl or rm -rf in a shell tool's arguments is a tool-call policy, and it's the kind of thing that's much easier to reason about at the call site than anywhere else.

Catching data exfiltration in tool responses

The data a tool returns is the third point. A tool wired into an internal system can hand back records full of sensitive information, and that payload flows straight into the model's context. Scoping detection to tool responses lets us watch for exfiltration on the way back in, independently of anything the user typed.

Scoping policy to model responses and agent output

This is the text the model itself produces. Scoping a policy here catches the case where the model emits something it shouldn't:

a secret it inferred
a disallowed recommendation
anything that crosses a stated guardrail

We were deliberate about the name, too. "Assistant message" reads ambiguously, and what's actually being inspected is the agent's own output.

Three AI threat detection techniques, ranked by cost and speed

Once a policy is scoped to the right point, something has to decide whether the content actually violates it. Defense-in-depth is the familiar instinct here, since no single control catches everything, so you layer them.

We borrow the layering but change the ordering. Classic defense-in-depth stacks redundant controls so a miss in one is caught by the next. Our layers are a cascade ordered roughly by cost, where the cheapest check that can answer the question runs first and only what it can't resolve falls through to the next.

Cost isn't the only axis, though. The cheaper layers are also the more predictable ones, since a regex returns the same verdict on the same input every time, while an LLM judge can decide differently across two identical runs. So when we reach for a deterministic check first, we're optimizing for repeatable enforcement as much as for latency.

Regex and exact-text matching for deterministic AI security checks

The fastest layer is deterministic pattern matching: regular expressions and exact-text matches. It runs inline as traffic passes through, so there's not much to optimize away. It's the right tool when the pattern is knowable ahead of time:

known secret formats
banned strings
specific command signatures

If a regex can answer the question, there's no reason to spend a model call on it.

Natural language processing for PII detection in AI agents

For PII, fixed patterns alone get brittle fast. We use natural language processing as a hybrid detector that pairs regex-style recognizers with machine-learning models. It tokenizes text before classifying it, so it picks up entities like names and addresses that a static pattern would miss. It sits a step above plain regex in both capability and cost, which is the tradeoff we're making when we reach for it.

LLM-as-judge for intent-based AI security policies

The most flexible layer is an LLM acting as a judge. It's what powers our prompt-based policies, where someone writes a guardrail in natural language ("flag any message that exhibits a particular bias") instead of as a pattern. The judge model reads the interaction and decides whether it matches that intent. It handles the policies that can't be reduced to a string match, and in exchange we give up repeatability and pay the latency and compute of a model call on every evaluation it covers.

The three layers trade off along the same axis. Here's roughly how we think about where each one fits.

Runs firstCost and latency rise ↓

Regex / exact-text

Known patterns: secret formats, banned strings, command signatures

Lowest, runs inline

Escalates only if the cheaper layer can't decide

Natural language processing

PII and named entities that fixed patterns miss

Moderate, ML plus NLP

Escalates only if the cheaper layer can't decide

LLM-as-judge

Intent-based guardrails defined in natural language

Highest, a model call per evaluation

Last resort

The three detection techniques stacked by cost. The cheapest check that can answer the question runs first, and traffic only climbs to the next layer when the current one can't decide.

Why AI security enforcement has to run without adding latency

All of this runs on synchronous hooks. The agent can't take its next step until the hook returns a verdict, so every check sits squarely in the critical path of the interaction. That constraint ends up shaping most of the design decisions we make.

Deterministic checks are cheap enough that their latency is close to a rounding error. An LLM judge is a different animal, since it adds the round trip and inference time of a separate model call to every interaction it covers. Left alone, that's exactly the kind of cost that gets a security control quietly switched off, so we lean on two ideas to keep it in check.

Run the cheap layers first. Most violations have a deterministic signature. If regex or Presidio catches it, the judge never has to fire.
Hoist deterministic rules out of prompt policies. A natural-language policy can often be distilled into a set of substring or regex rules that approximate it. Those run as a fast layer zero that short-circuits the interaction before the judge is invoked. We're working toward inferring that rule set automatically from real session history, so someone can write a policy in plain language and still get deterministic, low-latency enforcement underneath it.

Starting point

Natural-language policy

"Flag any message that exhibits a particular bias"

Distilled into rules that approximate it

Distilled

Substring / regex rules

Feeds into layer zero of the detection cascade above, so the judge fires less often on interactions the rules already cover.

A natural-language policy can be distilled into deterministic rules that approximate it, feeding the fast layer at the top of the detection cascade.

Reducing false positives to keep AI security policies switched on

Speed is only half of what keeps a control switched on. The other half is noise, since a policy that fires constantly on safe interactions trains its owners to ignore it, which is worse than not having the policy at all.

Most of the time the detector is doing its job correctly and the policy is just scoped too broadly, so the fix is better scoping, not a blunter detector. We pulled that topic into its own piece, cutting false positives in AI security without weakening policy, which walks through fine-grained scopes, exemptions, and testing rules against real traffic.

Where AI security policy enforcement is headed next

None of this is finished, and the direction we're most interested in is closing the loop between the flexible layer and the fast one. That means letting people express intent in natural language, then automatically learning the deterministic rules that approximate it from real traffic, so the expensive judge fires less and less over time. If you're working on similar problems, we'd love to compare notes.

This is the security work behind the AI control plane, where the same hooks inspect prompts, model responses, tool calls, and tool responses on the path between an agent and the systems it reaches.

Frequently asked questions

Where should AI security policy be applied?

At four distinct points in an agent interaction: the user prompt, the model response, the tool call, and the tool response. Each leaks differently, so a policy is better scoped to the specific point that matters than applied to the whole conversation.

What is an LLM-as-judge?

It's a detection technique that uses a language model to evaluate whether an interaction violates a policy written in natural language. It handles guardrails that can't be reduced to a pattern match, at the cost of a model call per evaluation, which is why cheaper deterministic checks should run ahead of it.

Last updated on June 17, 2026