Agent Ops: outages, caching, Minions, containers, Claws

Sources: Amazon’s AI tools caused at least two AWS outages, including a 13-hour December disruption after Kiro AI deleted and recreated an environment. Amazon’s Kiro AI triggered multiple AWS outages, including a 13‑hour disruption after deleting and recreating an environment. Outcome engineers must treat agents as first‑class availability risks and harden monitoring, automated rollbacks, and gates to contain runaway agent actions (Principles 14, 15).

Quoting Thariq Shihipar. Prompt caching enables long‑running agentic products like Claude Code by reusing prior computation to cut latency, costs, and enable generous rate limits. If you design persistent agents, add prompt caching to your architecture to make stateful, low‑latency workflows economically viable (Principle 06).

Minions: Stripe’s one-shot, end-to-end coding agents — Part 2. Minions autonomously generate end‑to‑end code changes at scale, producing thousands of pull requests weekly while humans act as review checkpoints. Treat this as a production pattern: agent factories plus human review lanes require audit trails, CI integration, and orchestration tooling (Principle 09).

State of Agentic AI Report: Key Findings. Docker’s global survey finds widespread agent deployments alongside security, governance, and orchestration gaps, and identifies containers as the foundational substrate for scaling agents. Use the report as an operational checklist: containerize agents, close orchestration gaps, and bake testing and governance into deploy pipelines (Principles 02, 16).

Andrej Karpathy talks about “Claws”. Claws propose a personal‑agent layer — containerized, schedulable, message‑driven runtimes that persist and orchestrate local agent workflows. Consider claw‑like primitives when building local or edge agents that need durable state, secure tool access, and schedulable execution (Principles 09, 07).