Agent Ops: Intercept, sandbox, debug, evaluate, and test agents

Announcing Genkit Middleware: Intercept, extend, and harden your agentic apps adds interceptable hooks to generation that enable retries, fallbacks, and human approvals to harden agentic apps. This gives outcome engineers a runtime control plane for approvals, failover logic, and audit trails — directly supporting Gate and Law (Principles 15 & 10).

CoreWeave launches Sandboxes for secure AI runs ships isolated, stateful execution environments for RL and agent evaluations that run on CKS or serverless via Weights & Biases. Outcome engineers can use these sandboxes to run high-fidelity evaluations, reproduce agent behavior, and contain exploits without touching production — aligning with Build the Island and Legible Landscapes (Principles 07 & 06).

Developers can now debug and evaluate AI agents locally with Raindrop’s open-source Workshop brings local, real-time agent debugging and self-healing evaluation via an MIT-licensed tool. Local observability and automated checks shorten the feedback loop for agent failures and let teams iterate safer, testable agents — matching Legible Landscapes and Validation (Principles 06 & 16).

Claude Code’s ’/goals’ separates the agent that works from the one that decides it’s done splits execution and evaluation by introducing an independent evaluator that decides task completion. That separation creates an explicit verifier to prevent premature success claims and supports auditability and automated shutdowns — core needs for the Immune System and Validation (Principles 14 & 16).

TestMu AI Launches Test.md For Kane CLI releases a markdown-first, replayable test format that turns live exploratory sessions into human- and agent-readable executable tests in the Kane CLI. Executable, replayable tests let outcome engineers capture real sessions as artifacts for continuous testing, provenance, and proof-of-outcome — applying Artifacts and Documentation practices (Principles 08 & 13).