Hardening agents: benchmarks, infra, and the new model economy

How We Broke Top AI Agent Benchmarks: And What Comes Next. UC Berkeley researchers build an automated agent that exploits eight major agent benchmarks, exposing systemic vulnerabilities that inflate capability scores. Outcome engineers must treat benchmarks as adversarial surfaces — add red-team evaluation, robust harnesses, and continual validation (Principles 02 & 16).

Google’s TurboQuant compression likely expands memory chip demand, analysts say. Google’s TurboQuant promises big model compression gains but likely increases overall memory-chip demand due to changed trade-offs between compute and memory. That shifts deployment economics and capacity planning for agents — plan for different latency/cost profiles, and bake observability into model packing and serving (Principle 12).

These startups are racing to make AI safe for the Pentagon’s most closely guarded secrets. Startups are building secure AI infrastructure and sandboxed clouds so the U.S. defense community can run LLMs without leaking classified data. For outcome engineers, this highlights the designs you’ll need for isolation, provenance, audited compute, and policy-enforced gates when agents operate on sensitive assets (Principles 07 & 10).

Starbucks’ game plan to roll out AI chatbots at cafés could serve as a ‘litmus test’ for the industry. Starbucks pilots Green Dot Assist to help baristas with recipes, substitutions, troubleshooting, and staffing at scale. Study their human-agent workflows and telemetry patterns: durable outcomes come from tight human handoffs, clear escalation, and artifactable proofs of work (Principles 03 & 09).

The inevitable need for an open model consortium. The piece argues a funded industry consortium is the sustainable path to well-resourced near-frontier open models. Outcome engineers should track how consortium rules will affect model access, licensing, reproducibility, and standards for artifacts and audits — it will reshape procurement and integration strategies (Principle 12).