Agents in the Wild: Proven Models, Micro‑Deployments, NVMe GPUs & ASICs

How Tinfoil Proves Exactly What Model Is Running. Modelwrap cryptographically binds published weights to a running server, proving the exact model served via attestation and kernel-level verification. Outcome engineers can assert model identity for audits, reproducibility, and compliance — a Ground Truth and Law play that closes trust gaps in inference pipelines (Principles 02 & 10).

NTransformer — Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU. It streams model layers from NVMe directly to GPU to run Llama 3.1 70B on a single RTX 3090, bypassing the CPU. That shifts on-prem cost/performance tradeoffs and forces new deployment patterns for outcome engineers building compact inference islands and scheduling layered I/O (Principles 07 & 12).

zclaw: personal AI assistant in under 888 KiB for ESP32. zclaw runs a personal AI assistant on an ESP32 in under 888 KiB with Telegram chat, GPIO control, and persistent memory. It proves agents can live at the extreme edge, letting outcome engineers design privacy-first, offline interactions and legible device landscapes (Principles 06 & 07).

How Taalas ‘prints’ an LLM onto a chip. Taalas embeds Llama 3.1 weights as fixed silicon, hitting ~17,000 tokens/sec with massive power and cost efficiency. Outcome engineers must consider fixed-function ASICs as a deployment tier that trades update flexibility for operational scale, changing artifact, graph, and validation strategies (Principles 07 & 11).

Elixir/BEAM Doesn’t Solve Everything for AI Agents — Addressing the Criticisms. The author argues BEAM alone doesn’t provide durable execution for long-lived agents and recommends pairing it with persistent state or workflow systems (Temporal, durable_object, Oban). That matters for orchestration designers: durable state and workflow guarantees are necessary if agents must survive restarts, ensure ordered effects, and meet operational SLAs (Principles 09 & 12).