Agent Ops: Benchmarks, Codex Goals, Claude Code & Sandboxes
ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks. ITBench-AA finds frontier models scoring below 50% on agentic SRE and IT tasks, revealing large gaps in enterprise automation readiness. Outcome engineers must treat agentic IT work as a validation-first problem — invest in reproducible sandboxes, scenario tests, and audit tooling to avoid fragile automation (Principles 16, 07).
Claude Code as a Daily Driver: Claude.md, Skills, Subagents, Plugins, and MCPs. Claude Code moves teams from ad-hoc prompts to structured, delegable agent workflows using CLAUDE.md, skills, subagents, and MCPs that produce verifiable artifacts. For outcome engineers this mandates designing context-first pipelines and documented artifacts so agents execute predictable, reviewable work (Principles 03, 06, 13).
The Codex feature that works while you sleep. Codex’s /goal enables multi-hour autonomous goal loops that let assistants run long-running fixes, cleanups, and task flows without constant human prompting. That forces new patterns for time-aware orchestration, observability, and failure modes — plan for watchdogs, checkpoints, and outcome audits when agents run overnight (Principles 01, 16).
Building self-improving tax agents with Codex. OpenAI and Thrive demonstrate production tax agents that learn from practitioner feedback and production traces to improve accuracy and reduce review time. Outcome engineers get a concrete template for safe continuous learning — versioned feedback loops, gated rollout, and audit trails are essential to let agents improve without drifting (Principles 03, 16).
Docker Sandboxes and microVMs, explained. Docker’s microVM sandboxes deliver VM-level isolation at container speed, giving a practical execution environment for untrusted or powerful agent workloads. Use these sandboxes to enforce least privilege, contain misbehavior, and speed safe experimentation — your island and immune-system need this isolation layer (Principles 07, 14).