Agent Engineering

Building AI agents for production

Most agent failures are not model failures. They are context failures. The model is maybe twenty percent of the work. The other eighty is the architecture between the model and the user.

What context the agent sees, how it decides which tool to call, what happens when it's wrong, and how you know the difference. I build the systems that handle that 80%.

audio brief · NotebookLM-style

24:54

Why AI agents fail in production

Host A · Host B

architecture, not model

context budget on prompts

context is everything else

failure rate for autonomy

Agent Development Pipeline

Architecture

Pattern selection · Framework choice · State design

Choose pattern for problem → Design state graph

Context

System prompts · Memory tiers · Token budgets

Engineer what the agent sees → Compress and prioritize

Tools & MCP

MCP servers · Tool definitions · Permission scoping

Connect to live systems → Scope access per agent

Knowledge

Hybrid RAG · Graph RAG · Contextual retrieval

Build retrieval pipeline → Evaluate accuracy

Voice & Multimodal

STT · LLM · TTS · Telephony

Design latency budget → Build cascading pipeline

Evaluation

Simulation · Benchmarks · Golden datasets

Test before production → Measure trajectory quality

Production

Observability · Guardrails · Graceful degradation

Deploy with tracing → Monitor and iterate

The methodology

Every agent I build follows this pipeline. The order matters: architecture before code, context before tools, evaluation before production. Most failed agent projects skip straight to implementation and discover their architecture was wrong after three months of engineering. I start with the hard decisions.

01Architecture & pattern selection

LangGraphClaude Agent SDKOpenAI Agents SDK

The first decision is the architecture pattern, and getting it wrong costs months. Plan-and-Execute achieves 92% task completion with 3.6× speedup over ReAct for structured workflows. ReAct is better for dynamic, exploratory tasks. Multi-agent orchestration adds value only when complexity justifies coordination overhead. At a 5% per-action failure rate, a 20-action agent fails frequently; production agents need sub-1% end-to-end failure rates.

# LangGraph — stateful agent with checkpointing and crash recovery
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver

class AgentState(TypedDict):
    messages: list[BaseMessage]
    plan: list[str]
    current_step: int
    context: dict

graph = StateGraph(AgentState)
graph.add_node("plan", create_plan)
graph.add_node("execute", execute_step)
graph.add_node("evaluate", evaluate_result)
graph.add_node("replan", replan_if_needed)

graph.add_edge("plan", "execute")
graph.add_edge("execute", "evaluate")
graph.add_conditional_edges("evaluate", should_continue, {
    "continue": "execute",
    "replan": "replan",
    "done": END,
})

# Persistent checkpointing — survives crashes, enables time-travel debugging
checkpointer = PostgresSaver.from_conn_string(DATABASE_URL)
agent = graph.compile(checkpointer=checkpointer)

# Claude Agent SDK — native shell + filesystem tools, hooks for interception
from claude_agent_sdk import ClaudeAgentOptions, ClaudeSDKClient, HookMatcher

async def block_destructive_bash(input_data, tool_use_id, context):
    cmd = input_data["tool_input"].get("command", "")
    if any(p in cmd for p in ("rm -rf", "DROP TABLE", "TRUNCATE")):
        return {"hookSpecificOutput": {
            "hookEventName": "PreToolUse",
            "permissionDecision": "deny",
            "permissionDecisionReason": "Destructive command blocked.",
        }}
    return {}

options = ClaudeAgentOptions(
    system_prompt="""You are a customer operations agent.
    Query the database, analyze patterns, take action.
    Always explain what you're about to do before doing it.""",
    # Read/Write/Edit/Bash are NATIVE — no MCP server needed for filesystem
    allowed_tools=["Read", "Bash", "mcp__customer_db__lookup"],
    mcp_servers={
        "customer_db": {"type": "stdio", "command": "customer-db-server"},
    },
    # Hooks intercept every tool call before execution — distinctive to this SDK
    hooks={"PreToolUse": [HookMatcher(matcher="Bash", hooks=[block_destructive_bash])]},
    permission_mode="acceptEdits",
    max_turns=10,
)

async with ClaudeSDKClient(options=options) as client:
    await client.query("Investigate why churn spiked 15% last week")
    async for msg in client.receive_response():
        print(msg)

# OpenAI Agents SDK — handoffs as the primary primitive for routing
from agents import Agent, Runner, function_tool

@function_tool
def lookup_invoice(customer_id: str) -> str:
    """Lookup invoice by customer ID."""
    return fetch_invoice(customer_id)

@function_tool
def restart_service(service_name: str) -> str:
    """Restart a customer's service."""
    return restart(service_name)

@function_tool
def update_account_email(customer_id: str, new_email: str) -> str:
    """Update the account holder's email."""
    return update_email(customer_id, new_email)

billing_agent = Agent(
    name="Billing Agent",
    handoff_description="Handles billing, invoices, refunds, payment methods.",
    instructions="Handle billing inquiries. Access invoices and payment history.",
    tools=[lookup_invoice],
    input_guardrails=[detect_prompt_injection],
    output_guardrails=[no_pii_in_output],
)

technical_agent = Agent(
    name="Technical Agent",
    handoff_description="Handles outages, errors, restarts, debugging.",
    instructions="Diagnose technical issues. Restart services if needed.",
    tools=[restart_service],
)

account_agent = Agent(
    name="Account Agent",
    handoff_description="Handles account changes — email, password, profile.",
    instructions="Handle account management requests.",
    tools=[update_account_email],
)

triage_agent = Agent(
    name="Triage Agent",
    instructions="Route the customer to the right specialist based on their question.",
    handoffs=[billing_agent, technical_agent, account_agent],  # typed Agent objects
)

# Each handoff becomes a typed tool call with full tracing built in
result = await Runner.run(triage_agent, "I was charged twice this month")
print(f"Resolved by: {result.last_agent.name}")  # → "Billing Agent"
print(result.final_output)

MCP is universal now: LangGraph (via langchain-mcp-adapters), Claude Agent SDK, and OpenAI Agents SDK (via MCPServerStdio / MCPServerStreamableHttp) all consume the same MCP ecosystem. Same goes for delegation: every framework can route between agents, just with different primitives. I pick based on what the framework makes natively first-class, not what's possible. LangGraph when the workflow has cycles, retries, and human-in-the-loop checkpoints; its persistent state, conditional edges, and time-travel debugging are unmatched. Claude Agent SDK when the agent needs to read, write, and execute on the file system out of the box; Read/Write/Edit/Bash come pre-wired because it runs Claude Code as a subprocess, plus hooks let you intercept every tool call before execution. OpenAI Agents SDK when routing between specialists is the core pattern; handoffs as a typed primitive with built-in tracing make this the cleanest expression. The framework is a means, not an end. I've shipped production systems on all of them.

02Context engineering

System promptsMemory tiersToken budgetsContext compression

“Most agent failures are not model failures. They are context failures” (Google DeepMind). System prompt and user prompt together consume only ~5% of a production agent's context budget. The other 95% is retrieved knowledge, memory, tool definitions, and output schemas. Most teams burn 40%+ of their context window before the agent does any real work. Anthropic's data shows 75% utilization produces higher-quality output than pushing to the limit.

# Context budget allocation — a well-engineered agent
context_window: 200k tokens

allocation:
  system_prompt:     2%    # ~4K tokens — role, boundaries, output format
  tool_definitions:  8%    # ~16K tokens — MCP tools, schemas, examples
  memory:           15%    # ~30K tokens — conversation + long-term
  retrieved_context: 35%   # ~70K tokens — RAG results, documents
  working_space:    40%    # ~80K tokens — reasoning, tool outputs

compression_thresholds:
  tool_output: 20k         # Offload to filesystem with 10-line preview
  total_window: 85%        # Trigger structured summarization
  stale_context: 3_turns   # Compress conversation older than 3 turns

# Critical: place high-priority context at the END of the window
# Transformer attention has recency bias — middle content gets ignored

I design four-tier memory systems: working memory (current context), short-term (session-persistent via LangGraph checkpoints), long-term (cross-session via Mem0 or Zep), and permanent (compliance logs). Graph-enhanced memory captures entity relationships through directed knowledge graphs, outperforming vector-only approaches on complex multi-hop reasoning. The key insight: strategic forgetting is as important as remembering. LangChain's compression strategy offloads tool responses over 20K tokens to filesystem with file path and 10-line preview.

03Tool & MCP integration

MCP serversSkillsCLI toolsTool definitionsPermission scoping

MCP won the protocol war for third-party integrations. Slack, Linear, GitHub, Stripe, Sentry, Datadog, Notion, and Atlassian all ship official MCP servers; the integration cost is a one-time investment that pays back across every agent framework. Now under the Linux Foundation with 200+ server implementations. Use MCP when someone else maintains the surface and multiple agents need it.

But MCP has a token tax. Every tool definition loads into the agent's context window every turn; an Anthropic engineer confirmed this in GitHub issue #3406. Real-world Claude Code sessions with 6 MCP servers consume 50-98K tokens (25-49% of a 200K window) before the first prompt. Skills are the opposite: ~100 tokens of metadata per skill at startup (Anthropic's progressive disclosure), full body loaded only when triggered. For project-scoped tools (your deploy script, your validation pipeline, your doc-fetcher), the CLI + Skill pattern (like Andrew Ng's Context Hub) is cheaper, more transparent, and zero-setup. The two are complementary: MCP for the SaaS products your agent integrates with, CLI + Skills for the workflows your team owns.

// MCP server — connecting an agent to your database (SDK v2 API)
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";

const server = new McpServer({ name: "customer-db", version: "1.0.0" });

server.registerTool(
  "lookup_customer",
  {
    description:
      "Find customer by email or ID. Returns profile, plan, and recent activity.",
    inputSchema: {
      email: z.string().email().optional(),
      id: z.string().optional(),
    },
  },
  async ({ email, id }) => {
    const customer = await db.customers.findOne(
      email ? { email } : { id }
    );
    // Return only what the agent needs — not the full record
    return {
      content: [{
        type: "text",
        text: JSON.stringify({
          name: customer.name,
          plan: customer.plan,
          status: customer.status,
          recentTickets: customer.tickets.slice(0, 5),
        }),
      }],
    };
  }
);

Tool definitions consume ~8% of context budget. Every tool description competes with RAG results and conversation history for attention. I keep definitions precise: name, one-line description, typed parameters, one example. Verbose tool descriptions are one of the most common causes of agent confusion.

04RAG & knowledge systems

Hybrid searchRe-rankingContextual retrievalGraph RAGAgentic RAG

Hybrid RAG (vector + BM25 + re-ranking) is the 2026 production baseline. Anthropic's contextual retrieval, prepending chunk-specific context before embedding, achieves 49% fewer retrieval failures. Graph RAG improves reasoning accuracy by 22% in complex domains (finance, medical, enterprise hierarchies). The best production systems are not pipelines but loops: an agentic orchestrator routes simple queries to naive retrieval, relational queries to graph, and complex research through multi-step reasoning.

# Agentic RAG — the agent decides retrieval strategy, not a fixed pipeline
from langgraph.graph import StateGraph

class RAGState(TypedDict):
    query: str
    documents: list[Document]
    answer: str
    confidence: float
    retrieval_attempts: int

def route_query(state: RAGState) -> str:
    """Agent decides retrieval strategy based on query type."""
    if is_entity_lookup(state["query"]):
        return "graph_rag"       # Entity relationships → knowledge graph
    if is_simple_factual(state["query"]):
        return "naive_retrieval" # Simple facts → fastest pipeline
    return "hybrid_retrieval"    # Complex → vector + BM25 + reranking

def evaluate_sufficiency(state: RAGState) -> str:
    """Agent evaluates whether retrieved docs answer the query."""
    if state["confidence"] > 0.85:
        return "generate_answer"
    if state["retrieval_attempts"] < 3:
        return "rewrite_and_retry"  # Query transformation loop
    return "escalate_to_human"

The production pattern: retrieve broadly (top-20), re-rank precisely (top-5 via ColBERT v2), send only the best to the LLM. Semantic chunking splits documents at natural boundaries using embeddings; fixed 512-token splits break paragraphs and separate questions from answers. I evaluate with RAGAS: faithfulness, answer relevancy, context precision, recall.

05Voice & multimodal agents

DeepgramElevenLabsTavusLiveKitVapi

Two architecture choices define voice agent design. Cascading (STT → LLM → TTS) runs at 2-4 second latency with full transcript auditability; right for compliance-sensitive industries like finance and healthcare. Speech-to-speech runs at ~500ms but with limited inspection. For production, I build cascading pipelines: Deepgram Nova-3 for STT (6.84% WER, sub-300ms), the fastest available LLM for reasoning, and ElevenLabs Flash v2.5 for TTS (~75ms). For video agents I use Tavus; at INFINIT I built a LangGraph agent on Tavus that cut KYC from hours to minutes. For volumes above 50K minutes/month, custom LiveKit builds save ~80% versus managed platforms.

# Voice agent latency budget — cascading architecture
pipeline:
  target_total: 2000ms     # Maximum acceptable end-to-end

  stt:                      # Speech-to-Text
    provider: deepgram-nova-3
    latency: 250ms
    wer: 6.84%

  llm:                      # Reasoning + tool use
    provider: gemini-2.5-flash  # Fastest for voice
    ttft: 300ms
    max_tokens: 200         # Short, conversational responses

  tts:                      # Text-to-Speech
    provider: elevenlabs-flash-v2.5
    latency: 75ms
    streaming: true         # Start playing before full generation

  overhead:                 # Network + processing
    budget: 375ms

# Memory is critical for voice — users cannot scroll back
# Mem0 integration for cross-session context persistence

Memory is the hardest problem in voice AI. Users cannot manually retrieve context from an audio conversation; the agent must remember everything. I integrate Mem0 for cross-session persistence with episodic (what happened), semantic (what is known), and procedural (how to do things) memory types. Voice agents also need specialized evaluation: not just task completion but turn-level latency, interruption handling, and conversation naturalness.

06Evaluation & testing

SimulationGolden datasetsRAGASDeepEval

Agent evaluation differs fundamentally from model evaluation: you must assess the entire decision-making trajectory, not just the final output. Did the agent call the right tool with the right arguments? Did it remember context from three turns ago? Did it stay in character? I test with simulation: hundreds of synthetic customer conversations before any production exposure. For multi-turn agents, conversation completeness replaces task completion as the primary metric.

# Agent evaluation — testing the full trajectory, not just the answer
from deepeval import evaluate
from deepeval.test_case import ConversationalTestCase, Turn
from deepeval.metrics import (
    ConversationCompletenessMetric,  # User intentions satisfied across dialogue?
    RoleAdherenceMetric,             # Agent stays in character?
)

# Golden dataset: multi-turn conversation with expected behavior
convo_test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="I need to cancel my subscription"),
        Turn(role="assistant",
             content="I've cancelled it. Your refund will process in 3–5 days."),
        Turn(role="user", content="Actually, can I downgrade instead?"),
        Turn(role="assistant",
             content="Done — downgraded to Basic and the cancellation is reversed."),
    ],
)

# Keyword signature: evaluate(test_cases=[...], metrics=[...])
evaluate(
    test_cases=[convo_test_case],
    metrics=[
        ConversationCompletenessMetric(threshold=0.7),
        RoleAdherenceMetric(
            threshold=0.8,
            role="You are a helpful customer support agent.",
        ),
    ],
)

The testing stack: DeepEval for code-first agent evaluation with typed metrics. RAGAS for RAG-specific quality (faithfulness, answer relevancy, context precision). LangSmith for LangChain-native tracing and evaluation. I build golden datasets from real conversations, not synthetic prompts; the edge cases that matter are the ones your users actually hit.

07Observability & production

LangFuseGuardrailsGraceful degradationHuman-in-the-loop

Production agents need three-layer defense: rule-based guardrails (sub-10ms, regex patterns, blocklists), ML classifiers (50-200ms, topic detection, sentiment), and LLM semantic checks (300-2000ms, complex policy evaluation). Route by risk level: low-risk actions get fast checks, high-risk actions get all three layers. Human-in-the-loop reduces hallucination errors by 96%, but only 14.4% of production agents have full security approval. The gap between demo and production is almost entirely observability and guardrails.

# Three-layer guardrail architecture
guardrails:
  layer_1_rules:          # < 10ms — runs on every action
    - block_pii_in_output
    - enforce_response_length
    - validate_tool_arguments
    - check_rate_limits

  layer_2_classifiers:    # 50-200ms — runs on user-facing output
    - topic_boundary_check
    - sentiment_detection
    - toxicity_filter

  layer_3_semantic:       # 300-2000ms — runs on high-risk actions
    - policy_compliance_check
    - factual_grounding_verification
    - escalation_decision

  routing:
    low_risk:  [layer_1]                    # Fast path
    medium:    [layer_1, layer_2]           # Standard
    high_risk: [layer_1, layer_2, layer_3]  # Full verification

  fallback:
    on_uncertainty: escalate_to_human
    on_failure: return_safe_default
    on_timeout: retry_once_then_escalate

I deploy with LangFuse for open-source tracing (self-hosted for data privacy) with OpenTelemetry integration. Every agent action gets a trace: the prompt, the tool calls, the model output, the latency, the token cost. Semantic caching via Redis reduces costs by 50-80% on repeated queries. The most important production feature is graceful degradation: agents that know when they're uncertain, escalate correctly, and fail without taking down the system.

Error handling needs four tiers, not one. Transient errors (API timeouts, rate limits) get retry with backoff. LLM-recoverable errors (wrong tool arguments, malformed output) get returned as a ToolMessage so the model self-corrects. User-fixable errors (missing permissions, ambiguous intent) interrupt for human input. Unexpected errors (crashes, data corruption) bubble up immediately. Most agent projects only implement the first tier.

The stack

Tools I reach for, organized by what they do.

agent frameworksLangGraph, Claude Agent SDK, OpenAI Agents SDK

integrationMCP Servers, A2A Protocol, Tool Definitions

knowledgeHybrid RAG, Graph RAG, Contextual Retrieval, ColBERT

voice & videoDeepgram, ElevenLabs, Tavus, LiveKit, Vapi, Twilio

memoryLangGraph Checkpoints, Mem0, Zep, Redis

observabilityLangFuse, LangSmith, OpenTelemetry

evaluationDeepEval, RAGAS, LangSmith, Simulation

guardrailsThree-layer defense, Human-in-the-loop, Graceful degradation

Where this is going

A controversial take, with some evidence.

All agents will eventually be coding agents. The most capable agents (Claude Code, Codex, Devin) already operate by reading files, writing code, running commands, and iterating on results. Customer service agents will query databases by writing SQL. Analytics agents will generate and execute Python. Operations agents will modify infrastructure through code. The agent that can read, write, and execute code in a sandbox has access to every capability a computer offers.

This is why sandboxes, skills, and filesystem access matter now. Vercel Sandbox gives agents ephemeral Firecracker microVMs for safe code execution (28ms from snapshot). Claude Code's Skills system packages domain expertise into composable capabilities that agents load contextually. Long-running agents in sandboxed environments (running overnight on migration tasks, test coverage sprints, or performance audits) are already standard at companies like Stripe and Anthropic. The agent doesn't need to be fast. It needs to be correct, sandboxed, and observable.

The other trend worth watching: harnesses should get thinner over time, not thicker. Anthropic regularly deletes planning steps as models improve; with Opus 4.6 they eliminated sprint decomposition and dropped context resets entirely. Manus rebuilt their harness five times in six months, each time removing complexity. OpenAI's Codex team argues that “if you rely on complex scaffolding you aren't scaling, you're coping.” Design every component for deletion: if it's still needed in six months, it earned its place.

# Long-running agent in sandbox — overnight task execution
$ claude -p "Migrate all API handlers from Express to Hono. Run tests after each file." \
    --sandbox                     # Isolated execution environment
    --permission-mode default     # Ask before destructive actions
    --output-pr                   # Create PR on completion

# Skills — domain expertise loaded contextually
# .claude/skills/database-migrations/SKILL.md activates when
# the agent touches migration files, injecting project-specific
# patterns without bloating base context

# Agent Teams — parallel sandboxed agents
$ claude --agent test-writer "Write integration tests for auth module"
$ claude --agent security-audit "Scan for OWASP Top 10 vulnerabilities"
# Each agent runs in isolation with restricted tool access

What doesn't work

Failure modes I've seen repeatedly. Every one of them looked fine in the demo.

Demo-driven architecture

Choosing ReAct because it looks impressive in a notebook, then discovering it burns 10x the tokens and fails unpredictably in production. Plan-and-Execute handles 90% of structured workflows better.

Context window stuffing

Loading the entire knowledge base into context instead of engineering what the agent sees. 40% of context budget wasted before the agent does real work. Less context, more carefully selected, always wins.

Self-testing agents

Having the agent write its own tests produces a "self-congratulation machine"; it verifies its own assumptions rather than user intent. Independent evaluation is non-negotiable.

No graceful degradation

Agents that crash on unexpected input instead of escalating. Production needs three tiers: handle confidently, handle with caveats, escalate to human. Most agents only implement tier one.

Tool overload

Vercel had a text-to-SQL agent with 15 specialized tools. They deleted 80% and replaced everything with bash + SQL. Accuracy went from 80% to 100%, speed improved 3.5x, tokens dropped 37%. More tools means worse reasoning; expose the minimum set for the current step.

Skipping observability

Deploying without tracing means you cannot distinguish good output from confidently wrong output. By the time users report problems, the damage is done.

I also help teams adopt AI-native development workflows (SDLC 2.0). See my consulting engagements or email me at raman.shrivastava.7@gmail.com.