← Latest Update

Agent Ops: Eval, RAG, Injection, Orchestration, Delegation

olmo-eval: An evaluation workbench for the model development loop. AllenAI publishes olmo-eval, a reproducible evaluation workbench built for iterative LLM development that supports agentic evaluations and prompt-level analysis. Outcome engineers gain a standardized, auditable loop for measuring regressions and validating agent behaviors across releases (Principles 13 & 16).

PixelRAG beats text parsers on accuracy and cuts AI agent token costs 10x. PixelRAG indexes webpage screenshots and uses VLM readers to boost RAG accuracy up to 18% while slashing token costs by tenfold. If your agents ground on web content, pixel-first retrieval is an immediate lever to improve fidelity and operating cost (Principle 11).

AI Agents Still Can’t Stop Prompt Injection Attacks, Researchers Warn. Researchers show state-of-the-art agents remain highly vulnerable to prompt-injection, with attacks succeeding in over 79% of tests. Treat this as a production-level hazard: design layered sanitization, provenance checks, and runtime guards into agent inputs and tool chains to avoid takeover and data corruption (Principles 14 & 16).

architect-loop: Repo-centered Claude Fable planning with Codex builders. architect-loop wires Claude planning and GPT Codex builders around repo-driven specs, frozen gates, and sandboxed worktrees to reduce token spend and enforce human review. This repo-centric pattern is a practical blueprint for making agent work auditable, versioned, and gateable in engineering pipelines (Principles 03 & 07).

How we made GitHub Copilot CLI more selective about delegation. GitHub reworked Copilot CLI to cut unnecessary subagent handoffs, lowering tool failures and latency by smarter delegation and parallelization. Use selective-delegation heuristics and runtime policies like theirs to shrink failure surface area and improve agent throughput in production (Principle 09).