Mlops

Evals Aren't Useful? Really?

Evals Aren't Useful? Really?

A deep dive into the critical importance of robust evaluation for building reliable AI agents. The summary covers bootstrapping evaluation sets, advanced testing techniques like multi-turn simulations and red teaming, and the necessity of integrating traditional software engineering and MLOps practices into the agent development lifecycle.

Why Your Cloud Isn't Ready for Production AI

Why Your Cloud Isn't Ready for Production AI

Zhen Lu, CEO of Runpod, discusses the shift from Web 2.0 architectures to an "AI-first" cloud. The conversation covers the unique hardware and software requirements for production AI, key use cases like generative media and enterprise agents, and the critical challenges of reliability and operationalization in the new AI stack.

Beyond Chatbots: How to build Agentic AI systems with Google Gemini // Philipp Schmid

Beyond Chatbots: How to build Agentic AI systems with Google Gemini // Philipp Schmid

A deep dive into the evolution from static chatbots to dynamic, agentic AI systems. Philipp Schmid of Google DeepMind explores how to design, build, and evaluate AI agents that leverage structured outputs, function calling, and workflow orchestration with Google Gemini, covering key agentic patterns and the future of AI development.

Why You Should Care About Observability in LLM Workflows

Why You Should Care About Observability in LLM Workflows

An inside look at AlwaysCool.ai's journey from simple GPT wrappers to a production-ready agentic infrastructure. This talk covers the evolution from synchronous tools to asynchronous, multi-step flows orchestrated by LangGraph, the critical role of OpenTelemetry for compliance and observability, and the architectural patterns of using FastAPI to serve centralized AI agents.

How to Optimize AI Agents in Production

How to Optimize AI Agents in Production

Engineers building AI agents face a combinatorial explosion of configuration choices (prompts, models, parameters), leading to guesswork and suboptimal results. This talk introduces a structured, data-driven approach using multi-objective optimization to systematically explore this vast design space. Learn how the Traigent SDK helps engineers efficiently identify optimal tradeoffs between cost, latency, and accuracy, yielding significant quality improvements and cost reductions without manual trial-and-error.

Evaluating AI Agents: Why It Matters and How We Do It

Evaluating AI Agents: Why It Matters and How We Do It

Annie Condon and Jeff Groom from Acre Security detail their practical approach to robustly evaluating non-deterministic AI agents. They share their philosophy that evaluations are critical for quality, introduce their "X-ray machine" analogy for observability, and walk through their evaluation stack, including versioning strategies and the use of tools like Logfire for tracing and Confident AI (Deep Evals) for systematic metric tracking.