Gram by Speakeasy

Introducing Gram by Speakeasy. Gram is everything you need to power integrations for Agents and LLMs. Easily create, curate and host MCP servers for internal and external use. Build custom tools and remix toolsets across 1p and 3p APIs. Try building a MCP server now!

MCP tools: Less is more

While building MCP servers, it’s often tempting to include every single tool you think might be useful to someone someday. But researchers at Southern Illinois University have found that this (lack of) strategy confuses LLMs and leads to worse performance.

We recommend a more curated approach. By curating smaller MCP servers per use case, you can help your users install the MCP servers they need without overwhelming the LLM or depending on users to know how to curate their own MCP tools for each use case.

What curated MCP servers look like

The idea of curating MCP servers is to create smaller, focused servers that contain only the tools relevant to a specific use case or department. This reduces the cognitive load on users and helps LLMs perform better by limiting the number of tools they need to consider at any one time.

Take a hypothetical example of a company that provides an MCP server for its internal customers. This server should feature tools used by the sales, marketing, and customer support teams. Tools could include:

/getCustomer: A tool for retrieving customer information
/getProductInfo: A tool for retrieving product details
/getSalesData: A tool for retrieving sales statistics
/getMarketingCampaigns: A tool for retrieving marketing campaign details
/getSupportTickets: A tool for retrieving customer support tickets
/getFeedback: A tool for retrieving customer feedback

The naive approach would be to create a single MCP server that includes all these tools.

What we’re suggesting instead is creating separate MCP servers for each department:

Sales MCP server: Contains tools like /getCustomer, /getProductInfo, and /getSalesData.
Marketing MCP server: Contains tools like /getMarketingCampaigns and /getProductInfo.
Customer support MCP server: Contains tools like /getSupportTickets and /getFeedback.

This way, each MCP server is tailored to the specific needs of its users, reducing confusion and improving the performance of the LLMs that interact with these tools.;

Here’s what that might look like in practice:

Why curated MCP servers are better

Recent research on tool loadout, the practice of selecting only relevant tool definitions for a given context, reveals concrete numbers about when LLMs start to struggle with having too many options.

The research: When tool confusion kicks in

In the article, How to Fix Your Context , Drew Breunig discusses two papers that found the following thresholds:

For large models (like DeepSeek-v3):

30 tools: Thirty tools is the critical threshold at which tool descriptions begin to overlap and create confusion.
100+ tools: Models are virtually guaranteed to fail at tool selection tasks when choosing from over 100 tools.
3x improvement: Using RAG techniques to keep tool count under 30 resulted in dramatically better tool selection accuracy.

For smaller models (like Llama 3.1 8B):

19 tools: A set of 19 tools is the sweet spot at which models succeed at benchmark tasks.
46 tools: A selection of 46 tools is the failure point at which the same models fail the same benchmarks.
44% improvement: Dynamic tool selection improved performance when the tool count was reduced.

Real-world testing: The Dog API example

To demonstrate this principle in action, we created a practical test using the Dog CEO API with Gram-hosted MCP servers. The results clearly show how tool count affects LLM performance.

We created an OpenAPI document for the Dog API, with an endpoint per dog breed. The full API had 107 dog breeds, so our OpenAPI document has 107 GET operations.

We then uploaded the OpenAPI document to Gram and created a single MCP server with all 107 tools, and installed this remote MCP server in Claude Desktop.

On our very first test using Claude Sonnet 3.5, the LLM hallucinated an endpoint for Golden Retrievers, when in fact there is only a single Retriever endpoint. Although later tests using Claude Code and Claude Desktop with different models yielded better results, the initial confusion was evident.

The effect of having 107 tools in one server caused Claude Desktop to frequently stop responding after one sentence with a generic error:

Claude’s response was interrupted. This can be caused by network problems or exceeding the maximum conversation length. Please contact support if the issue persists.

A screenshot of the Claude Desktop chat UI shows an error message and a brief conversation in which Claude agrees to get four images of dogs using a tool. The error message states: Claude's response was interrupted. This can be caused by network problems or exceeding the maximum conversation length. Please contact support if the issue persists.

Next, we decided to test the same MCP server using a smaller model on LM Studio. We used the qwen/qwen3-1.7b model, which is trained specifically for tool calling.

With the same 107 tools, the model struggled to select a tool most of the time. It hallucinated incorrect tool names based on patterns it recognized in the tool names.

We then created several smaller MCP servers with fewer tools, each containing only a subset of the dog breeds. We tested these servers with the same qwen/qwen3-1.7b model in LM Studio.

First, we created a server with 40 tools, which included a random selection of dog breeds. The model was able to successfully call three out of four tools, but still hallucinated one endpoint.

A screenshot of the LM Studio UI shows a conversation where the qwen3-1.7b model successfully uses the correct tool names for three of four tool calls. It hallucinates one endpoint.

Next, we created a server with 20 tools, which included a mixture of random dog breeds. The model got 19 out of 20 tool calls correct, with only one hallucinated tool call.

A screenshot of the LM Studio UI shows a successful conversation where the qwen3-1.7b model correctly uses almost all tool names when presented with only 20 tools. The interface displays only one hallucinated tool call, demonstrating improved performance when the number of available tools is reduced to an optimal range for smaller language models.

Finally, we created two servers with only 10 carefully selected tools each. One server included the most common dog breeds, and the other contained rare dog breeds. The model successfully retrieved images of four different dog breeds with correct tool names and no errors.

We then created a new conversation with both the rare and common dog breed servers installed. These servers have ten tools each, but they are focused on different sets of dog breeds. The model was able to successfully use tools from both servers without any errors.

Our test results

We know this isn’t an exhaustive test, nor is it the most rigorous scientific study. But this method demonstrates how you can quickly set up tests to compare the performance of LLMs with different tool counts and configurations.

Here’s a summary of our findings:

With 107 tools, both large and small models struggled to select the correct tools, leading to frequent errors and hallucinations.

With 20 tools, the smaller model got 19 out of 20 tool calls correct, with only one hallucinated tool call.

With 10 tools, the smaller model successfully retrieved images of four different dog breeds with correct tool names and no errors.

And most surprisingly, when 20 tools were split across two focused servers, the model was able to successfully use tools from both servers without any errors.

This shows that by curating MCP servers to contain only the most relevant tools for a specific use case, we can significantly improve the performance of LLMs, especially smaller models.

Benefits beyond accuracy

From our testing with Qwen 3.1 1.7B, we found that curating MCP servers improves accuracy and dramatically speeds up the response time and thinking time of the model. Parsing the prompt, selecting the right tools, and generating a response all happen much faster when the model has fewer tools to consider. This is especially important for real-time applications, where response time is critical.

How to implement curated MCP servers

We recommend following these steps to implement curated MCP servers:

1. Identify use cases

Start by identifying the specific use cases or departments that will benefit from their own MCP servers. This could be based on job roles, projects, or specific tasks.

2. Select relevant tools per use case

For each use case, select only the tools that are relevant to that specific context. Avoid including tools that are not directly applicable to the tasks at hand.

3. Create focused MCP servers

Create separate MCP servers for each use case, ensuring that each server contains only the tools selected in the previous step. This will help reduce confusion and improve performance.

4. Test and iterate

Use the approach demonstrated in our Dog API example:

Create test scenarios for your use cases.
Compare performance between monolithic and curated approaches.
Measure both accuracy and response time.
Gather user feedback on ease of use.

For a real-world example of how tool proliferation affects MCP users, see our companion article: Why less is more: The Playwright proliferation problem with MCP. This article demonstrates how reducing the Playwright MCP server from 26 tools to just 8 essential ones dramatically improves agent performance and reduces decision paralysis in browser automation tasks.