Karpathy coined "vibe coding" in February 2025 — fully giving in to the vibes, embracing exponentials, forgetting the code exists. A year later he declared it passé and proposed "agentic engineering" instead — "agentic because you are not writing the code directly 99% of the time, engineering to emphasize there is an art & science to it." I think he's right, but for a slightly different reason than most people assume.
The problem with vibe coding isn't that it doesn't work. It works great. I vibe-coded a patient briefing agent in a weekend and it was genuinely impressive. The problem is that "works on my laptop" and "handles 1,000 customer interactions per month without someone watching it" are two completely different engineering problems, and the second one has almost nothing to do with the model.
The 80/20 nobody talks about
When I built the AI platform for a Lightspeed-backed neo-bank last year, the voice agent demo took a week. Getting it to handle real customer calls — interruptions, context switches, calendar access via MCP servers, graceful degradation when the model hallucinates about car inventory — took months. The model was maybe 20% of the work. The other 80% was context engineering, state management, and the systems that keep an agent reliable when it's handling someone's banking.
I see this pattern everywhere. The gap between demo and production is almost entirely engineering, not AI. This is good news if you're an engineer. It means the hard problems are tractable — they're our problems, not "waiting for GPT-6" problems.
Context engineering > prompt engineering
The most important shift of 2025-2026 isn't a new model. It's the recognition that what surrounds the request matters more than how you phrase it.
Prompt engineering is crafting the right words. Context engineering is designing dynamic systems that deliver the right information, tools, memory, and retrieved documents at the right time. The LangChain State of Agent Engineering survey (n=1,340) found that 32% of production agent failures trace to poor context management, not model limitations. Gartner now calls context engineering the systematic practice that keeps agents effective over time. I suspect they're underselling it — in my experience it's closer to 50% of failures.
In practice this means CLAUDE.md files treated like infrastructure config (~80% adherence, with a roughly 150-200 instruction budget before compliance drops), dynamic retrieval that gives the agent the right 2,000 tokens instead of dumping everything, persistent memory that knows when to forget, and restricted tool sets per agent instead of giving one agent access to 50 tools and hoping for the best. Projects with well-maintained context files see 40% fewer agent errors. This is not a small effect.
Three frameworks, three philosophies
I use three multi-agent paradigms and they solve genuinely different problems.
LangGraph gives you explicit state machines — typed state, conditional edges, checkpointed execution you can replay. The graph IS the documentation. I used this for a video KYC system: face detection → document extraction → verification → compliance check, with human-in-the-loop at the compliance stage. Each node independently testable. More upfront design, but you get predictability. Worth it for anything where "the agent did something weird" is not an acceptable outcome.
OpenAI Agents SDK is built around handoffs — literally a tool call that returns another Agent, carrying conversation context with it. Three built-in primitives: Handoffs, Guardrails, Tracing. Python-first and opinionated. I use it for delegation trees (triage → billing, triage → support). Less flexible than LangGraph for complex branching, faster to ship for straightforward patterns.
Claude Agent SDK is architecturally different from the other two. Under the hood it spawns Claude Code CLI as a subprocess, which means your agent gets file I/O, bash execution, browser automation, MCP servers — the same tools Claude Code uses in the terminal. You're orchestrating an agent that can actually interact with a codebase and system, not just call an API. Within Claude Code, subagents run in isolated context windows with restricted tools and persistent memory. I use this for development workflows where agents need to read code, run tests, and write files.
Personally I suspect the framework choice matters less than people think. What actually matters: can you replay a failed agent run? Can you see exactly which tool call went wrong? Does the system degrade gracefully or silently produce garbage?
MCP + skills + sandboxes
MCP grew from ~1,000 servers in early 2025 to over 10,000 by March 2026. It's now adopted by Anthropic, OpenAI, and Google. But the real power comes from combining MCP (tool connectivity) with skills (methodology) and sandboxes (safe execution).
Skills are modular markdown files in .claude/skills/ that give agents domain knowledge — not what to connect to, but how to work within your project's conventions. In build-ai-agents I've open-sourced skills for staff-level architecture review, code review, frontend development, and runtime skill discovery. The staff-architect skill is read-only (Read, Grep, Glob). The code-reviewer gets Bash for running tests. Each skill has its own tool restrictions and execution context.
Sandboxes complete the picture. Vercel Sandbox uses Firecracker microVMs with millisecond startup. E2B is open-source with ~90ms cold starts. When your agent needs to run tests or try a migration, it runs untrusted code without risking the host system. The combination of skills + sandboxes is what makes fully autonomous sessions viable: structured capabilities AND a safe execution environment.
Agent engineering vs. agentic engineering
I think two distinct disciplines are emerging and most people haven't separated them yet.
Agent engineering is building AI agents as products — the voice agent for the neo-bank, the research agent, the customer service bot. Multi-agent orchestration, context engineering, production observability. This is the work I've been doing for years.
Agentic engineering is using AI agents to improve how your team writes software. Addy Osmani's four-step workflow captures the core: plan before prompting, review with peer-review rigor, test relentlessly (with tests agents iterate until passing; without tests they cheerfully declare done on broken code), own the codebase. Spec-driven approaches reduce logic errors by 23-37% compared to direct generation.
The cutting-edge pattern here: running 3-5 Claude Code agents simultaneously on different features in isolated git worktrees. No file conflicts. Teams are seeing 3-4x throughput. Autonomous coding agents are still early but the results are getting hard to ignore — Anthropic built a C compiler with 16 agents producing 100K lines of Rust across ~2,000 sessions, hitting 99% pass rate on the GCC torture test. Cursor's Planners/Workers built a web browser from scratch (1M+ lines, ~1 week). I would not trust either for production code without human review, but the trajectory is clear. Personally I'm experimenting with agent teams and I think this becomes standard within a year.
TLDR
Vibe coding was the prototype phase. Agentic engineering is the production phase. The interesting problems — context engineering, state management, observability, graceful degradation — have almost nothing to do with the model and everything to do with the architecture between the model and the user. Two disciplines are emerging: agent engineering (building agents as products) and agentic engineering (using agents to build software). The combination of MCP, skills, and sandboxes is what makes autonomous agent sessions viable. The gap between demo and production is 80% engineering. This is our home turf.
If you're building production agent systems or setting up agentic development workflows for your team, reach out at raman.shrivastava.7@gmail.com.