Agent Tooling & Evaluation: Code, Models, and Artifacts

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments shows Hugging Face and Meta exposing agents to production-grade tools via a Calendar Gym to surface failures in permissions, temporal reasoning, and multi-step coordination. Outcome engineers must run the same kind of sandboxed, instrumented evaluations to catch permission gaps and coordination failures before agents touch production — Principle 16.

CodeRLM — Tree-sitter-backed code indexing for LLM agents introduces a tree-sitter-powered code index so agents can retrieve precise, structured code context for reasoning and edits. Reliable, structured code retrieval reduces context noise and makes automated edits and code reasoning tractable in agent pipelines — Principle 11.

Introducing Showboat and Rodney — agents demo what they’ve built presents tools that force agents to produce executable Markdown demos and CLI-driven browser artifacts so overseers can verify generated code actually works. Requiring executable artifacts moves oversight from claims to evidence and is a practical step toward auditable outcomes — Principle 08.

Harness engineering: leveraging Codex in an agent-first world documents OpenAI’s approach of redirecting engineers to design agent-ready environments and feedback loops, enabling large products with minimal human-written code. If you build agent systems, prioritize harnesses, edit tooling, and tight feedback channels as the delivery surface for agents rather than prompts alone — Principle 07.

Z.ai launches GLM-5, flagship open-weight model for reasoning, coding, and agentic tasks releases an open-weight LLM optimized for long-horizon reasoning, coding, and persistent agentic workflows. Open agentic models let outcome engineering teams self-host, iterate on orchestration and safety layers, and avoid vendor lock-in when composing long-running agent systems — Principle 09.