Llmops

Build and monitor multi-agent contact centers using Weights & Biases

Build and monitor multi-agent contact centers using Weights & Biases

This post explores the shift from costly legacy contact center software to multi-agent AI systems. It demonstrates how to build, monitor, and evaluate these complex agentic systems using the Weights & Biases AI Developer Platform, with a focus on tracing, quality assessment, and ensuring consistent customer support.

Production monitoring for AI applications using W&B Weave

Production monitoring for AI applications using W&B Weave

Learn how W&B Weave's online evaluations enable real-time monitoring of AI applications in production, allowing teams to track performance, catch failures, and iterate on quality over time using LLM-as-a-judge scores.

Trust at Scale: Security and Governance for Open Source Models // Hudson Buzby // MLOps Podcast #338

Trust at Scale: Security and Governance for Open Source Models // Hudson Buzby // MLOps Podcast #338

Hudson Buzby from JFrog discusses the critical security, governance, and legal challenges enterprises face when adopting open-source AI models. He highlights the risks lurking in repositories like Hugging Face and argues for a centralized, curated AI gateway as the essential framework for enabling safe, scalable, and cost-effective AI development.

Streamline evaluation, monitoring, optimization of AI data flywheel with NVIDIA and Weights & Biases

Streamline evaluation, monitoring, optimization of AI data flywheel with NVIDIA and Weights & Biases

A walkthrough of the NVIDIA Data Flywheel Blueprint, demonstrating how to use production data and Weights & Biases to systematically fine-tune AI agents. This process enhances model accuracy and efficiency by creating a continuous improvement cycle, moving beyond the limitations of prompt engineering.

LLMOps for eval-driven development at scale

LLMOps for eval-driven development at scale

Mercari's engineering team shares their practical, evaluation-centric approach to LLMOps. Learn how they leverage tiered evaluations, strategic tooling for observability, and rapid iteration to productionize LLM features for over 23 million users, emphasizing that good 'evals' are often more critical than model fine-tuning or RAG.