Six weeks is fast for anything in software. For building a production AI agent platform with twelve external integrations, evaluation infrastructure, and a multi-tenant architecture โ it's a result that most teams won't believe until they see the engineering decisions that made it possible.
The startup โ a Series A company building workflow automation for mid-market operations teams โ made a bet early: don't build a model layer, don't build an agent framework from scratch, don't build RAG infrastructure from first principles. Use Anthropic's stack: Claude API for the model layer, the Claude Agent SDK for the agent runtime, and MCP for all external integrations. Three engineers, six weeks, production.
About This Case Study
The company is a Series A workflow automation startup serving mid-market operations, finance, and procurement teams. Details anonymised. Metrics from the company's own tracking systems. Our team provided architecture advisory and code review during the build โ we were not the primary development team.
The Build vs. Buy Decision
In early 2025, the founding team had a prior architecture based on OpenAI's API and a custom-built agent orchestration layer. The agent orchestration code was growing โ 11,000 lines of Python that handled tool calling, context management, error recovery, and multi-step task decomposition. It was working but fragile. Every new integration required changes to the orchestration layer. Every model update required re-testing orchestration assumptions.
The decision to switch to Claude and rebuild on the Agent SDK wasn't primarily driven by model quality (though Claude's performance on the structured output tasks their platform relied on was a factor). It was driven by a build vs. buy analysis on the orchestration layer. The Claude Agent SDK replaced roughly 9,000 lines of their custom code with a single dependency. The remaining 2,000 lines became the product-specific logic that actually differentiated their platform.
If you're building on top of AI, every line of code you write that isn't product-specific logic is technical debt waiting to become a maintenance burden. Model layer, agent orchestration, tool calling protocols โ these are infrastructure problems that Anthropic has already solved. Use their stack.
The Architecture
The platform's agent architecture had four layers:
Layer 1: The Customer Interface
Customers defined workflows through a React frontend: a natural language task description ("every Monday morning, pull last week's sales data from Salesforce, compare against targets, and post a summary to the #sales-ops Slack channel"), plus a set of configuration parameters. The frontend was deliberately simple โ the complexity was in the backend.
Layer 2: The Agent SDK Orchestrator
The Claude Agent SDK handled task decomposition, tool selection, multi-step execution, error recovery, and sub-agent delegation. The startup's orchestration logic was reduced to: define the available tools (via MCP), define the task goal, define the output schema, and let the SDK manage execution. For tasks that required parallelism โ pulling data from multiple sources simultaneously โ sub-agent delegation was used natively.
Layer 3: The Model Layer
All agent reasoning ran on Claude Sonnet 4.6. The team evaluated using Haiku for simpler sub-tasks to reduce cost, but found the latency improvement wasn't worth the accuracy reduction on tasks involving complex reasoning across multiple data sources. Prompt caching was implemented from day one โ their system prompt (which encoded the customer's workflow configuration) was cached server-side, reducing token costs by approximately 60% on repeat task executions.
from anthropic.agents import AgentTask, MCPContext
task = AgentTask(
goal="Pull Salesforce pipeline, compare against targets, generate summary",
output_schema=WeeklyPipelineSummary,
tools=mcp_context.available_tools,
max_steps=15,
model="claude-sonnet-4-6"
)
result = await task.execute()
Layer 4: MCP Integrations
Every external system integration was built as an MCP server. Salesforce, NetSuite, HubSpot, Slack, Google Sheets, Gmail, Notion, Jira, Asana, Stripe, QuickBooks, and Airtable. Twelve MCP servers, each exposing a clean tool interface to the agent. The advantage: adding a new integration required building one MCP server with well-defined tool definitions. The agent layer required zero changes. The team estimates MCP reduced their per-integration build time from three days to less than one day once the pattern was established.
What They Got Right
1. Structured Output from Day One
Every agent task returned strongly-typed structured output, not freeform text. This was a non-negotiable design principle from the first commit. Structured output via Claude's tool use made the platform's outputs deterministic, auditable, and directly usable in downstream systems without parsing. See our Claude structured output guide for implementation details.
2. Evaluation Infrastructure Before Scale
Week two of the build was entirely dedicated to evaluation infrastructure โ before any customer-facing features were built. Every agent task type had a test suite with golden examples. Every code change was validated against these test suites before merge. This felt slow in week two and fast in week five, when they were shipping production changes confidently because they knew exactly what was covered.
3. Rate Limit and Cost Budgeting Per Tenant
Multi-tenant architecture requires per-tenant resource controls. The team built cost and rate limit budgeting at the tenant level from day one โ each customer's agent executions were capped at configurable token and API call limits per billing period. This prevented a single heavily-using customer from affecting others and gave the team cost visibility before launch.
What They Had to Revisit
Context Window Management
The first version of the orchestrator passed the full workflow history to every agent call. This worked fine on small workflows. On complex tasks with 10+ steps and large data payloads (pulling several MB from Salesforce, for example), it hit context window limits and produced degraded reasoning quality in the later steps.
The fix was implementing a sliding context window with a summarisation step: after step 7 of any task, the full history was summarised by a lightweight Claude Haiku call and the summary replaced the raw history in the context. Token costs dropped 40% on long tasks, reasoning quality improved. See our token management guide for the pattern.
Error Recovery Granularity
The initial error recovery strategy was simple: retry the full task on any failure. This was too coarse. A Salesforce API timeout on step 8 of a 12-step task retried from step 1, wasting all the work already done and the associated cost. The revised architecture implemented checkpoint-based recovery: the task state was persisted at each step, and retries resumed from the last successful checkpoint. A second engineering sprint in week four.
The Six-Week Sprint
Agent SDK integration and first working agent
First working agent executing a single Salesforce query via MCP. Core orchestration pattern established. Architecture document approved.
Evaluation infrastructure and structured output
Built evaluation test suite for 8 core task types. Implemented structured output schemas. No customer-facing features โ deliberate investment.
MCP servers 1โ6 and multi-tenant architecture
Salesforce, HubSpot, Slack, Google Sheets, Gmail, Notion MCP servers live. Multi-tenant resource controls implemented.
MCP servers 7โ12 and error recovery overhaul
Remaining six MCP servers. Checkpoint-based error recovery replaces full-task retry. Context window sliding window implemented.
Customer interface and private beta with 5 customers
React frontend for workflow definition. Five beta customers using the platform. First real-world edge cases discovered and fixed.
Security review, billing integration, first paying customer
Anthropic DPA finalised. Stripe billing integrated. First paying customer signed. $180K ARR within 90 days of this milestone.
Why Claude, Not GPT or Gemini
The founding team had used OpenAI previously. The switch to Claude for this build was based on three factors:
Structured output quality. On the complex multi-source data aggregation tasks their platform needed to do, Claude Sonnet 4.6 produced fewer schema violations in structured output than GPT-4o, particularly on nested objects with conditional fields. This matters enormously in a platform context where schema violations require retry logic and add latency.
Agent SDK maturity. Anthropic's Agent SDK is purpose-built for the orchestration patterns this platform needed. The equivalent OpenAI tooling required more custom code. For a three-engineer team on a six-week schedule, that gap was decisive.
MCP ecosystem. The startup's product required twelve integrations at launch. The MCP ecosystem had well-documented, community-tested servers for all twelve systems. Building custom tool use implementations for each would have taken weeks more than using MCP. See our MCP servers guide for coverage of the major enterprise integrations.
Building an AI Agent Product on Claude?
Our AI agent development service provides architecture design, MCP integration, evaluation framework setup, and code review for teams building on the Claude stack.
Lessons for Startup Teams Building on Claude
The pattern from this build is transferable. The startup moved fast not because they were exceptional engineers (they were good, not exceptional) but because they made disciplined infrastructure decisions early. Don't build what Anthropic already built. Invest in evaluation before features. Build structured output into every agent task from day one. Implement cost and rate limit controls before you have customers who can stress-test them.
For teams considering a similar build, our Claude API enterprise guide and multi-agent systems guide cover the architectural patterns in more depth. Our AI agent development service provides hands-on architecture support for teams building production platforms on Claude.