Production ai

12-factor Agents - Patterns of reliable LLM applications // Dexter Horthy

12-factor Agents - Patterns of reliable LLM applications // Dexter Horthy

Drawing from conversations with top AI builders, Dex argues that production-grade AI agents are not magical loops but well-architected software. This talk introduces "12-Factor Agents," a methodology centered on "Context Engineering" to build reliable, high-performance LLM-powered applications by applying rigorous software engineering principles.

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Practical tactics to build reliable AI apps — Dmitry Kuchin, Multinear

Moving an AI PoC from 50% to 100% reliability requires a new development paradigm. This talk introduces a practical, evaluations-first approach, reverse-engineering tests from real-world user scenarios and business outcomes to build a robust benchmark, prevent regressions, and enable confident optimization.

Building Agents at Cloud Scale — Antje Barth, AWS

Building Agents at Cloud Scale — Antje Barth, AWS

A deep dive into building and scaling production-ready AI agents, detailing a model-driven approach using the open-source 'Strands' SDK and a cloud-native architecture for deploying remote tools with MCP and AWS Lambda.

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Traditional benchmarks and leaderboards are insufficient for production AI. This summary details a practical, multi-layered evaluation strategy, moving from foundational system performance to factual accuracy and finally to safety and bias, using open-source tools like GuideLLM, lm-eval-harness, and Promptfoo.

AI Agent Development Tradeoffs You NEED to Know

AI Agent Development Tradeoffs You NEED to Know

Sherwood Callaway of 11X discusses the architecture of "Alice," an AI Sales Development Representative. He covers the practical decision to use LangGraph for its reliability in production, the challenges of infrastructure and observability when using hosted agent platforms, and their methodology for running Evals to mitigate hallucinations by comparing generated content against source data.