This section explains how to plan for system capacity and optimize performance to ensure reliable operations under varying workloads. It covers vertical scaling limits, horizontal scaling triggers, and strategies for resource management. Understanding these guidelines helps maintain system stability and support scalability for high availability.

Response and resource protection

The response guard is a critical mechanism that prevents excessive memory usage caused by large payloads. It ensures that the system remains responsive and avoids crashes during high-volume operations. When tools consistently approach guard limits, it is a signal that the underlying API needs a redesign to use pagination for better efficiency. Latency hotspots often occur at token exchange endpoints, during downstream TLS handshakes, and when retrieving JWKS keys at cache refresh intervals.

Scaling architecture and performance tuning

The MCP server uses asyncio concurrency model, which is an asynchronous I/O model to achieve high throughput and low latency. This model is designed for I/O-bound workloads and enables the system to manage thousands of concurrent connections without blocking operations. Some of the key characterstics of this architecture are:
  • The server uses single-threaded asynchronous I/O within each worker process.
  • It employs an event-loop-based concurrency model to handle thousands of simultaneous connections.
  • Non-blocking operations are used for HTTP requests and downstream API calls, which improves responsiveness.
  • This architecture is highly efficient for I/O-bound workloads, which are typical in MCP server operations.

Production scaling strategy

You can configure worker processes for different environments to achieve optimal performance and reliability. Worker configuration depends on workload characteristics such as I/O intensity, CPU requirements, and high-availability needs.

The following table and examples describe recommended worker configurations for development, production, and high-availability scenarios.
Environment Configuration Purpose
Development
export UVICORN_WORKERS=1
Use a single worker for simplicity during development
Production (I/O-heavy)
export UVICORN_WORKERS=2
Recommended for workloads dominated by network I/O
Production (CPU-intensive)
export UVICORN_WORKERS=$(nproc)
Match the number of workers to available CPU cores
High-availability
export UVICORN_WORKERS=$(($(nproc) * 2))
Over-provision workers for redundancy and failover monitoring
Explanation:
  • In development, a single worker is sufficient because the workload is minimal and simplicity is preferred.
  • For I/O-heavy production environments, two workers are recommended to handle concurrent requests efficiently without overloading the system.
  • In CPU-intensive production environments, set the number of workers equal to the number of CPU cores to maximize parallel processing.
  • For high-availability scenarios, double the number of workers compared to CPU cores. This approach provides redundancy and supports failover, but it requires active monitoring to avoid resource contention.

Capacity planning

You must plan for system capacity to ensure optimal performance and scalability. System capacity is the limit of vertical scaling and it identifies triggers for horizontal scaling. Understanding these factors helps maintain stability and prevent resource exhaustion during peak loads.

Vertical scaling refers to increasing resources within a single server instance. The following limits apply:
  • Memory per worker is approximately 100–200 MB as a baseline plus additional request buffers.
  • CPU efficiency decreases with diminishing returns beyond twice the number of CPU cores.
  • Connection limits are approximately 1000 concurrent connections per worker when using asyncio.
Horizontal scaling involves adding more server instances to distribute load. Consider scaling horizontally when:
  • CPU utilization remains above 70 percent after tuning the worker count.
  • Memory pressure begins to affect response times.
  • Network I/O saturation occurs, which is rare for typical MCP loads.
  • There is a need for geographic distribution or high availability.