Agent infrastructure and benchmarks — building outcome systems

Claude Code as a Daily Driver: Claude.md, Skills, Subagents, Plugins, and MCPs shows Claude Code turning CLAUDE.md, skills, and subagents into a programmable daily driver that moves teams from hand-crafted prompts to delegated, verifiable agent workflows. Outcome engineers get a practical blueprint for composing subagents, managing context, and producing executable artifacts — a direct playbook for Principle 03 and Principle 13 work.

Building self-improving tax agents with Codex describes a production Tax AI that learns from practitioner feedback and production traces to improve accuracy and cut accountant review time. This is a concrete example of closing the feedback loop and instrumenting agents for continuous validation and retraining — essential for Principle 16 (audit the outcomes).

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks releases the first agentic SRE benchmark and finds frontier models underperforming on enterprise IT tasks. That gap warns outcome engineers not to assume competence in ops workflows and underscores the need for harnesses, sandboxed testing, and orchestration before scaling agents into production.

Docker Sandboxes and microVMs, explained explains microVM-based sandboxes that give agents VM-level isolation with container-like speed. Use this pattern to contain agent actions, limit blast radius, and run realistic integration tests — a practical infrastructure piece for Principle 07 and Principle 14 safety systems.

Warp’s big bet on building open source with GPT-5.5 shows Warp using GPT-5.5 and Oz to coordinate agent-driven open-source development, automating PRs while humans set intent and review. Treat this as a case study in agentic coordination and developer tooling: it reveals how to embed agents into CI/CD, review gates, and human-in-the-loop delivery (Principle 09).