How a Tech Startup Built an AI Agent

6 wks

POC to first paying customer

Engineers on the build team

MCP integrations shipped at launch

$180K

ARR within 90 days of launch

Six weeks is fast for anything in software. For building a production AI agent platform with twelve external integrations, evaluation infrastructure, and a multi-tenant architecture — it's a result that most teams won't believe until they see the engineering decisions that made it possible.

The startup — a Series A company building workflow automation for mid-market operations teams — made a bet early: don't build a model layer, don't build an agent framework from scratch, don't build RAG infrastructure from first principles. Use Anthropic's stack: Claude API for the model layer, the Claude Agent SDK for the agent runtime, and MCP for all external integrations. Three engineers, six weeks, production.

About This Case Study

The company is a Series A workflow automation startup serving mid-market operations, finance, and procurement teams. Details anonymised. Metrics from the company's own tracking systems. Our team provided architecture advisory and code review during the build — we were not the primary development team.

The Build vs. Buy Decision

In early 2025, the founding team had a prior architecture based on OpenAI's API and a custom-built agent orchestration layer. The agent orchestration code was growing — 11,000 lines of Python that handled tool calling, context management, error recovery, and multi-step task decomposition. It was working but fragile. Every new integration required changes to the orchestration layer. Every model update required re-testing orchestration assumptions.

The decision to switch to Claude and rebuild on the Agent SDK wasn't primarily driven by model quality (though Claude's performance on the structured output tasks their platform relied on was a factor). It was driven by a build vs. buy analysis on the orchestration layer. The Claude Agent SDK replaced roughly 9,000 lines of their custom code with a single dependency. The remaining 2,000 lines became the product-specific logic that actually differentiated their platform.

The Make-vs-Buy Rule for AI Platforms

If you're building on top of AI, every line of code you write that isn't product-specific logic is technical debt waiting to become a maintenance burden. Model layer, agent orchestration, tool calling protocols — these are infrastructure problems that Anthropic has already solved. Use their stack.

The Architecture

The platform's agent architecture had four layers:

Platform Architecture

Customer Workflow Interface (React)

↕

Task API (FastAPI)

Agent SDK Orchestrator

↕

Task Queue (Redis)

↕

Evaluation Engine

Claude API (Sonnet 4.6)

↕

Prompt Cache

↕

Structured Output

MCP Server 1 (Salesforce)

MCP Server 2 (NetSuite)

MCP Server 3 (Slack)

MCP Server 4-12 (...)

Layer 1: The Customer Interface

Customers defined workflows through a React frontend: a natural language task description ("every Monday morning, pull last week's sales data from Salesforce, compare against targets, and post a summary to the #sales-ops Slack channel"), plus a set of configuration parameters. The frontend was deliberately simple — the complexity was in the backend.

Layer 2: The Agent SDK Orchestrator

The Claude Agent SDK handled task decomposition, tool selection, multi-step execution, error recovery, and sub-agent delegation. The startup's orchestration logic was reduced to: define the available tools (via MCP), define the task goal, define the output schema, and let the SDK manage execution. For tasks that required parallelism — pulling data from multiple sources simultaneously — sub-agent delegation was used natively.

Layer 3: The Model Layer

All agent reasoning ran on Claude Sonnet 4.6. The team evaluated using Haiku for simpler sub-tasks to reduce cost, but found the latency improvement wasn't worth the accuracy reduction on tasks involving complex reasoning across multiple data sources. Prompt caching was implemented from day one — their system prompt (which encoded the customer's workflow configuration) was cached server-side, reducing token costs by approximately 60% on repeat task executions.

        # Simplified example of the agent task invocation pattern

        from anthropic.agents import AgentTask, MCPContext

        task = AgentTask(

          goal="Pull Salesforce pipeline, compare against targets, generate summary",

          output_schema=WeeklyPipelineSummary,

          tools=mcp_context.available_tools,

          max_steps=15,

          model="claude-sonnet-4-6"

        )

        result = await task.execute()

Layer 4: MCP Integrations

Every external system integration was built as an MCP server. Salesforce, NetSuite, HubSpot, Slack, Google Sheets, Gmail, Notion, Jira, Asana, Stripe, QuickBooks, and Airtable. Twelve MCP servers, each exposing a clean tool interface to the agent. The advantage: adding a new integration required building one MCP server with well-defined tool definitions. The agent layer required zero changes. The team estimates MCP reduced their per-integration build time from three days to less than one day once the pattern was established.

What They Got Right

1. Structured Output from Day One

Every agent task returned strongly-typed structured output, not freeform text. This was a non-negotiable design principle from the first commit. Structured output via Claude's tool use made the platform's outputs deterministic, auditable, and directly usable in downstream systems without parsing. See our Claude structured output guide for implementation details.

2. Evaluation Infrastructure Before Scale

Week two of the build was entirely dedicated to evaluation infrastructure — before any customer-facing features were built. Every agent task type had a test suite with golden examples. Every code change was validated against these test suites before merge. This felt slow in week two and fast in week five, when they were shipping production changes confidently because they knew exactly what was covered.

3. Rate Limit and Cost Budgeting Per Tenant

Multi-tenant architecture requires per-tenant resource controls. The team built cost and rate limit budgeting at the tenant level from day one — each customer's agent executions were capped at configurable token and API call limits per billing period. This prevented a single heavily-using customer from affecting others and gave the team cost visibility before launch.

What They Had to Revisit

Context Window Management

The first version of the orchestrator passed the full workflow history to every agent call. This worked fine on small workflows. On complex tasks with 10+ steps and large data payloads (pulling several MB from Salesforce, for example), it hit context window limits and produced degraded reasoning quality in the later steps.

The fix was implementing a sliding context window with a summarisation step: after step 7 of any task, the full history was summarised by a lightweight Claude Haiku call and the summary replaced the raw history in the context. Token costs dropped 40% on long tasks, reasoning quality improved. See our token management guide for the pattern.

Error Recovery Granularity

The initial error recovery strategy was simple: retry the full task on any failure. This was too coarse. A Salesforce API timeout on step 8 of a 12-step task retried from step 1, wasting all the work already done and the associated cost. The revised architecture implemented checkpoint-based recovery: the task state was persisted at each step, and retries resumed from the last successful checkpoint. A second engineering sprint in week four.

The Six-Week Sprint

Agent SDK integration and first working agent

First working agent executing a single Salesforce query via MCP. Core orchestration pattern established. Architecture document approved.

Evaluation infrastructure and structured output

Built evaluation test suite for 8 core task types. Implemented structured output schemas. No customer-facing features — deliberate investment.

MCP servers 1–6 and multi-tenant architecture

Salesforce, HubSpot, Slack, Google Sheets, Gmail, Notion MCP servers live. Multi-tenant resource controls implemented.

MCP servers 7–12 and error recovery overhaul

Remaining six MCP servers. Checkpoint-based error recovery replaces full-task retry. Context window sliding window implemented.

Customer interface and private beta with 5 customers

React frontend for workflow definition. Five beta customers using the platform. First real-world edge cases discovered and fixed.

Security review, billing integration, first paying customer

Anthropic DPA finalised. Stripe billing integrated. First paying customer signed. $180K ARR within 90 days of this milestone.

Why Claude, Not GPT or Gemini

The founding team had used OpenAI previously. The switch to Claude for this build was based on three factors:

Structured output quality. On the complex multi-source data aggregation tasks their platform needed to do, Claude Sonnet 4.6 produced fewer schema violations in structured output than GPT-4o, particularly on nested objects with conditional fields. This matters enormously in a platform context where schema violations require retry logic and add latency.

Agent SDK maturity. Anthropic's Agent SDK is purpose-built for the orchestration patterns this platform needed. The equivalent OpenAI tooling required more custom code. For a three-engineer team on a six-week schedule, that gap was decisive.

MCP ecosystem. The startup's product required twelve integrations at launch. The MCP ecosystem had well-documented, community-tested servers for all twelve systems. Building custom tool use implementations for each would have taken weeks more than using MCP. See our MCP servers guide for coverage of the major enterprise integrations.

Building an AI Agent Product on Claude?

Our AI agent development service provides architecture design, MCP integration, evaluation framework setup, and code review for teams building on the Claude stack.

Talk to an Architect

Lessons for Startup Teams Building on Claude

60%

Cost reduction from prompt caching on repeat task types

40%

Token reduction from context window sliding window

Lines of orchestration code replaced by Agent SDK

<1 day

Per-integration build time with established MCP pattern

The pattern from this build is transferable. The startup moved fast not because they were exceptional engineers (they were good, not exceptional) but because they made disciplined infrastructure decisions early. Don't build what Anthropic already built. Invest in evaluation before features. Build structured output into every agent task from day one. Implement cost and rate limit controls before you have customers who can stress-test them.

For teams considering a similar build, our Claude API enterprise guide and multi-agent systems guide cover the architectural patterns in more depth. Our AI agent development service provides hands-on architecture support for teams building production platforms on Claude.

Claude Implementation Team

Claude Certified Architects — Enterprise AI Deployment Specialists

About our team →

How a Tech Startup Built an AI Agent Platform on Claude in 6 Weeks

About This Case Study

The Build vs. Buy Decision

The Architecture

Layer 1: The Customer Interface

Layer 2: The Agent SDK Orchestrator

Layer 3: The Model Layer

Layer 4: MCP Integrations

What They Got Right

1. Structured Output from Day One

2. Evaluation Infrastructure Before Scale

3. Rate Limit and Cost Budgeting Per Tenant

What They Had to Revisit

Context Window Management

Error Recovery Granularity

The Six-Week Sprint

Agent SDK integration and first working agent

Evaluation infrastructure and structured output

MCP servers 1–6 and multi-tenant architecture

MCP servers 7–12 and error recovery overhaul

Customer interface and private beta with 5 customers

Security review, billing integration, first paying customer

Why Claude, Not GPT or Gemini

Building an AI Agent Product on Claude?

Lessons for Startup Teams Building on Claude

You Might Also Like

How a Tech Startup Built an AI Agent Platform on Claude in 6 Weeks

About This Case Study

The Build vs. Buy Decision

The Architecture

Layer 1: The Customer Interface

Layer 2: The Agent SDK Orchestrator

Layer 3: The Model Layer

Layer 4: MCP Integrations

What They Got Right

1. Structured Output from Day One

2. Evaluation Infrastructure Before Scale

3. Rate Limit and Cost Budgeting Per Tenant

What They Had to Revisit

Context Window Management

Error Recovery Granularity

The Six-Week Sprint

Agent SDK integration and first working agent

Evaluation infrastructure and structured output

MCP servers 1–6 and multi-tenant architecture

MCP servers 7–12 and error recovery overhaul

Customer interface and private beta with 5 customers

Security review, billing integration, first paying customer

Why Claude, Not GPT or Gemini

Building an AI Agent Product on Claude?

Lessons for Startup Teams Building on Claude

You Might Also Like

Claude Agent SDK Guide: Build Production AI Agents

Enterprise AI Agent Architecture with Claude: Design Patterns & Security

MCP Servers for Salesforce, Jira, Slack & HubSpot

Get Claude developer guides and case studies weekly

Related Articles

Claude Enterprise AI Agent Architecture

AI Agent Security: The New Attack

Claude Agent SDK Guide: Build

How to Build a Multi-Agent Claude

AI Agent Evaluation & Testing: Measure