Agent Infrastructure: Models, Benchmarks, and Orchestration
Gemini 3.1 Pro: A smarter model for your most complex tasks debuts with boosted reasoning and a 1M‑token context across the Gemini API, Vertex AI, the Gemini app, and NotebookLM. This shifts what outcome engineers can automate: longer plans, richer state, and multistep agentic workflows that change orchestration and evaluation requirements (Principle 09).
Step 3.5 Flash: Fast Enough to Think. Reliable Enough to Act ships an MoE-powered, agent-ready model (11B active / 196B total) built for fast, long‑context local deployments. That matters for teams building islands of compute — you can push substantive agent logic to on‑prem or edge hosts, lowering latency, cost, and dependency on centralized API providers (Principle 07).
Introducing EVMbench: Benchmarking AI agents for detecting, exploiting, and patching high-severity smart contract vulnerabilities launches a targeted benchmark that measures agents’ ability to find, exploit, and fix critical smart contract bugs. Outcome engineers get a concrete evaluation surface for security‑critical agent behaviors, making validation, red‑teaming, and automated patching repeatable (Principles 14 & 16).
Cogent Security raises $42M to scale AI agents for enterprise vulnerability remediation secures funding to deploy governed agents that autonomously remediate vulnerabilities at scale. This is a live example of agent orchestration plus governance: teams must design audit trails, escalation paths, and human‑in‑the‑loop controls before agents touch production systems (Principles 09 & 15).
Anthropic’s Agent Autonomy study publishes telemetry showing how Claude Code behaves in the wild, including user approval patterns and divergence from idealized autonomy estimates. Those operational signals are exactly what outcome engineers need to instrument and validate agent behavior in production — they feed metrics, guardrails, and incident response for audited outcomes (Principle 16).