Speakeasy Logo
Skip to Content

AI

Agents

Developer Tools

Prompting agents: What works and why

Nolan Sullivan

Nolan Sullivan

September 23, 2025 - 12 min read

AI

Prompting agents: What works and why

As chatbots, large language models are surprisingly human-like and effective conversation partners. They mirror whoever is chatting with them and have a knack for small-talk like no other - sometimes even appearing too compassionate.

In contrast, working with an LLM agent (rather than a chatbot) often feels like you’re pushing a string, only to realize too late that the string has folded over itself in kinks and loops, and you need to start over again. Agents that lack clear success and failure criteria or explicit direction can be expensive, slow, and in the worst case, destructive.

In this guide, we unpack the different layers of prompting agents, and explore proven methods for improving how we prompt agents.

But first, why focus on agents if there is already so much written about prompts and context engineering in general?

Chatbots are for loops; agents are while loops

Chatbots take discrete turns, with a human available to steer the bot at each turn. They work on a single task per turn. Claude, ChatGPT, and Grok are examples of chatbots. There are many prompting guides available online for these popular chatbots.

Agents, on the other hand, work continuously in a loop to achieve complex goals. Agents are usually employed in situations where they have access to tools that influence the real world. Examples include Claude Code, ChatGPT Operator (soon to be replaced by ChatGPT agent mode), and Gemini CLI. Agents are so complex, layered, and multifaceted that most agent prompting guides only scratch the very surface - how an end user should ask an agent to do things.

A note on terminology

The AI world hasn’t settled on standard terms yet (even calling LLMs “AI” is sometimes frowned upon), so let’s be clear about what we mean:

Agents are AI systems that can take actions through tools - they can run commands, manipulate files, call APIs, and change things in the real world. Claude Code executing terminal commands is an agent. When Cursor edits your files or Gemini CLI runs Python scripts, they are also acting as agents.

Agent interfaces like Claude Code, ChatGPT Operator, and Gemini CLI are the products you interact with. They combine an underlying model (Claude 3.5 Sonnet, GPT-4, Gemini) with tools and a user interface.

Chatbots just generate text responses. They can’t execute code, access your filesystem, or take actions beyond returning text. Regular Claude.ai and ChatGPT (without plugins) are chatbots.

When we talk about “prompting agents” we mean getting AI to actually do things, not just talk about doing them.

When prompting an agent, start at the right layer

An agent’s prompt doesn’t start at the point when a user asks a question. An agent prompt is a larger entity, which we can break up into the following distinct layers, all of which influence how well an agent performs, and each of which is as important as the others to get right:

  1. The model’s platform-level instructions: At the highest layer are the platform-level instructions. These are set by the platform, like OpenAI or Anthropic. For example, even if we use the OpenAI API, the model’s API responses won’t include copyrighted media, illegal material, or information that could cause harm.

  2. Developer instructions: These are often called the system prompt, and for most developers, this is the highest level of authority their prompts could have. Examples include proprietary system prompts, like Claude Code or Cursor’s system prompt, or prompts for open-source agents like those of Goose  and Cline . The system prompt is set by the agent’s developers.

  3. User rules: Some agents support rules files that the user can set for all instances of their agent. For example, Claude Code reads the file at ~/.claude/CLAUDE.md, or any parent directory of the project you’re working on, and consistently applies your rules to its actions.

  4. Project rules: These are instructions an agent applies within a specific directory. For example, a CLAUDE.md file in the project root directory, or a child directory.

  5. User request: This is the actual prompt entered by the user, for example, “Fix the race condition in agents/tasks.py.

  6. Tool specifications: At the lowest level are the descriptions and guidelines from tool developers, which include input/output formats, constraints, and best practices. These are usually only for the agent to read and are written by the tool developers. An example would be the browser_console_messages tool in Playwright MCP, with the description Returns all console messages.

These different prompt levels are strung together in the agent’s context, and changing any one may have an effect on the agent’s performance. The levels you have access to and your history with the agent will determine where you should begin improving your prompts.

RoleDescriptionLevels to influence
Agent userThe person interacting with the agent, providing input and feedback.User request, User rules, Project rules.
Agent developerThe person building and maintaining the agent, responsible for its overall behavior and capabilities.Developer instructions.
Model hostThe underlying architecture and infrastructure that supports the agent, including APIs, databases, and other services.Platform-level instructions.
Tool developerThe person or team responsible for creating and maintaining the tools that the agent uses.Tool specifications.

Understanding how system prompts shape agent behavior

You can’t change the system prompt of Claude Code or ChatGPT Operator, but understanding what’s happening behind the scenes helps explain why agents sometimes behave in unexpected ways and how to work around their limitations.

System prompts are the hidden instructions that make agents work. They’re written by the companies building these tools and run thousands of words long. When your agent refuses to do something reasonable or insists on doing something you didn’t ask for, the system prompt is often at work.

Here’s what these gigantic prompts typically contain:

1. Identity and role boundaries

Most agents start with a defined identity that constrains what they will and won’t do. This is why Claude Code won’t help you write malware, even if you have a legitimate security testing reason.

For example, Cline’s system prompt looks as follows (from their open-source code):

2. Tool usage patterns and guardrails

Agents have extensive instructions about how to use their tools correctly. This is why Claude Code often checks file contents before editing, or why it might refuse certain filesystem operations.

For example, the MinusX team discovered the contents of Claude Code’s hidden prompts:

This structured XML approach dramatically improved tool usage accuracy and reduced navigation errors.

3. Domain-specific behaviors

Agents come preloaded with opinions about best practices. When v0 generates a React component, it follows specific instructions about which libraries to use and how to structure code.

For example, this is what v0 by Vercel is instructed to do behind the scenes:

The massive size and impact of system prompts

Modern AI coding agents use system prompts of around tens of thousands of characters. These prompts encode years of engineering wisdom, safety rules, and behavioral patterns.

Here’s what you need to keep in mind about these hidden instructions:

  • Your instructions compete with these prompts: If you ask for something that conflicts with the system prompt, the system prompt usually wins.
  • Weird behaviors often trace back here: That annoying habit where ChatGPT uses em-dashes everywhere? Probably baked into its system prompt.
  • You can work around them once you know they exist: You can override default behaviors by being more explicit and repeating important instructions.

Want to see what’s under the hood? Check out these extracted system prompts from top AI tools (though be aware, these are often reverse-engineered and may not be official): Collection of system prompts .

Learning from open-source agents

If you’re building your own agent or want to understand how they think, these open-source system prompts are goldmines:

Tracking how these prompts change over time reveals how agent capabilities evolve and the problems developers are trying to solve.

Now that you understand the hidden forces shaping agent behavior, let’s look at what you can actually control: your own prompts.

Prompting as a user: Techniques that improve agent performance

Let’s look at some examples of prompting techniques.

1. Give agents clear success criteria

Simon Willison , who has extensively documented his AI agent experiments, demonstrates this perfectly. In one function-writing experiment, he saved 15 minutes of work with an efficient prompt.

Simon could have used a simple prompt like the following:

At a glance, we can guess how that conversation would have gone - a long loop of clarifying questions and incremental changes, eventually taking as long as it would have to just write the function ourselves.

Instead, he used a prompt that included success criteria:

This prompt produced a complete, production-ready function in 15 seconds. Simon’s key insight is that the prompt should do the following:

  • Provide the exact function signature upfront.
  • Specify the precise technologies to use.
  • Follow up with “Now write me the tests using pytest” to get comprehensive test coverage.

2. Use the “think” tool pattern

Anthropic added  a seemingly useless “think” tool that significantly improves complex reasoning:

This tool, which appears to do nothing, actually enables models to leverage their tool-calling training to pause and think through complex problems, resulting in significant performance improvements.

3. Be specific

Actual agent interactions show the dramatic difference between vague and specific prompts. Consider the following two examples:

File organization gone wrong (or right)

If you need to organize test files that are scattered throughout a project, a vague prompt like, “Organize my test files,” results in the following agent response:

The agent creates a flat __tests__ directory and dumps all JavaScript test files there, losing the original folder structure and mixing component tests with utility and API tests.

You can get a better agent response by using a more specific prompt:

This improved prompt results in an improved agent response:

The lesson: Both prompts “worked” but the vague one made assumptions that might not match your needs. The specific prompt guaranteed the exact structure you wanted.

The units confusion that costs real money

In this scenario, you run a website CMS platform with subscriptions ($29-99/month) and setup fees ($150-300), and your payment processor stores amounts in cents (standard practice). You need to batch payment transactions from a CSV file:

When you use the following vague prompt, you don’t get the desired result:

The agent responds to the vague prompt as follows:

The agent creates 30 batches (one per transaction since even the “basic” plan exceeds $2,000).

You use the following, more specific prompt instead:

When given the improved prompt, the agent delivers the correct result:

It creates three batches with proper grouping.

The lesson: A human sees “2900” for a “BASIC” plan and immediately thinks “$29”. An agent might interpret it as “$2,900” - who pays that for basic website hosting? Without explicit units, you’re gambling on the agent’s interpretation.

4. Create custom user rules in a convention file

The practice of creating AI convention documents (like CLAUDE.md) has become increasingly popular. These files act as persistent rules that agents follow automatically.

For example, an agent implements different code styling based on whether it’s using a convention document in addition to user prompts or relying on user prompts alone.

When you prompt Claude to “refactor cart.js to modern JavaScript” without including any style rules in CLAUDE.md, it makes its own decisions. It might use class syntax, skip documentation, or add features you didn’t ask for.

However, when you include the following section in CLAUDE.md, Claude will respond to the same “refactor cart.js to modern JavaScript” prompt by converting function declarations to arrow functions and adding JSDoc comments, because it follows the CLAUDE.md rules automatically:

This example makes the power dynamic clear: When the same instruction appears in both CLAUDE.md and your prompt, CLAUDE.md usually wins. Use this to your advantage for consistent project-wide rules.

One developer’s observation on Hacker News  demonstrates how project-specific guidance significantly improves agent performance:

I have a rule: ‘Do information gathering first,’ which encourages it to look around a bit before making changes.

5. Learn real-world safety from production disasters

The Ory team documented  a sobering incident where an AI agent accidentally deleted their production database. Here’s what actually happened and how you can prevent similar disruptions:

When Ory sent a prompt telling the agent to “Fix the database connection issue in production,” the agent responded as follows:

If they had given the agent a prompt like the following, they could have prevented the disaster:

The lesson: Agents don’t understand the difference between “test” and “production” unless you explicitly tell them. Always assume they’ll take the most direct path to “fixing” something.

6. Practice constraint-based prompting

Instead of explaining in natural language, stub out methods and define code paths directly in code:

7. Use AI to improve your prompts

The fastest way to fix a broken prompt is to use AI itself. Instead of guessing what went wrong, request specific feedback about why the prompt failed.

This works because the agent:

  • Understands its own failure modes better than you do.
  • Can identify ambiguities you missed.
  • Suggests concrete improvements, rather than giving vague advice.

You can use an agent like Claude or ChatGPT directly by pasting your failed prompt in the console and asking for analysis.

Alternatively, you can use agent-specific tools to improve a prompt. For example, the Workbench page in the Anthropic console contains various tools for bettering prompts.

Anthropic workbench interface for prompt optimization and testing

You can optimize your prompts by templatizing them. Make individual prompts reusable by clicking on the Templatize button.

Anthropic templatize feature for making prompts reusable

You can also use the Improve prompt button to interactively develop a prompt via the What would you like to improve? modal.

Anthropic improve prompt modal asking for improvement requirements

Once you’ve stated your needs, Anthropic provides you with an updated prompt:

Anthropic improved prompt result showing enhanced prompt structure

We tested this process with the following initial prompt:

And we ended up with the following prompt, improved by Anthropic tools:

This tool was built for Anthropic models and agents, but the same pattern works with others. You can ask ChatGPT, Gemini, or any LLM to refine your prompt by following the flow: Your prompt → Goal → Refined prompt.

8. Use XML tags in prompts

XML is token-rich compared to JSON or YAML: Lots of angle brackets and closing tags create strong, distinct boundaries. These clear boundaries help LLMs avoid mixing sections. LLMs aren’t actually parsing in the strict sense, like a compiler. They’re predicting tokens.

By providing an LLM with consistent delimiters (<tag>…</tag>), you enable it to do the following:

  • Recognize scope easily: The <inputs> section is clearly different from <outputs>.
  • Reduce ambiguity: Instead of “status (optional),” you give <param name="status" required="false">, and the model no longer needs to infer.

The following example demonstrates how to use XML tags to write a prompt that includes context, the desired format, and constraints:

As agents get smarter, prompts become more important, not less

Smarter agents make more sophisticated assumptions that are harder to predict and debug. As agents get “smarter,” specific prompts become more important, not less.

Your action items:

  1. Create a CLAUDE.md (or equivalent file) for your projects with explicit rules.
  2. Include units, formats, and examples in your prompts.
  3. Test prompts with actual outputs before trusting them in production.
  4. When something goes wrong, make your prompt more specific, not longer.

Last updated on

Organize your
dev universe,

faster and easier.

Try Speakeasy Now