Agent Skills, Sandboxes, and Agent‑Assisted Discovery

Custom Kernels for All from Codex and Claude shows agents using a ‘cuda-kernels’ skill to generate production-grade CUDA kernels, integrate with PyTorch, benchmark on H100, and publish to the Hub. Outcome engineers get a pattern for shipping executable, performance-critical artifacts from agents — this forces artifact versioning, benchmarked CI, and artifact provenance (Principles 08,16).

cloudrouter: Skill letting Claude Code/Codex spin up VMs and GPUs lets agents spin up cloud sandboxes and GPUs, run commands, and automate browsers directly from the CLI. This turns infrastructure provisioning into an agent skill, so teams must design sandboxing, cost controls, and approval gates when agents can self-provision (Principles 07,09).

IronClaw: Rust-based assistant that runs tools in isolated WASM sandboxes runs untrusted tools safely in Rust-backed WASM sandboxes while keeping all data local and encrypted. It provides a concrete architecture for secure agent tool execution and local-first deployments, informing immune-system patterns for runtime isolation and threat containment (Principles 14,07).

GPT-5.2 derives a new result in theoretical physics reports GPT-5.2 conjecturing and helping prove a new nonzero single-minus gluon tree amplitude, which authors then confirmed analytically. That shows agents can produce verifiable research artifacts, raising the bar for audit trails, reproducible artifacts, and verification pipelines in outcome engineering (Principles 03,16).

Scaling Social Science Research introduces GABRIEL, a system that turns unstructured text and images into consistent quantitative measurements to scale qualitative analysis with GPT. Outcome engineers can use similar context-engineering and measurement pipelines to create dependable metrics from messy data, feeding more reliable outcome graphs and validation workflows (Principles 11,16).