Vllm

The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava

The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava

Tuhin Srivastava, CEO of Baseten, joins Gradient Dissent to discuss the core challenges of AI inference, from infrastructure and runtime bottlenecks to the practical differences between vLLM, TensorRT-LLM, and SGLang. He shares how Baseten navigated years of searching for a market before the explosion of large-scale models, emphasizing a company-building philosophy focused on avoiding premature scaling and "burning the boats" to chase the biggest opportunities.

Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber

Serving Voice AI at $1/hr: Open-source, LoRAs, Latency, Load Balancing - Neil Dwyer, Gabber

An in-depth look at Gabber's experience deploying the Orpheus text-to-speech model to production, covering latency optimization, high-fidelity LoRa-based voice cloning, and a cost-effective inference stack using vLLM and a consistent hash ring for load balancing.

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Traditional benchmarks and leaderboards are insufficient for production AI. This summary details a practical, multi-layered evaluation strategy, moving from foundational system performance to factual accuracy and finally to safety and bias, using open-source tools like GuideLLM, lm-eval-harness, and Promptfoo.

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

A deep dive into SGLang, an open-source serving framework for LLMs. This summary covers its core features, history, performance optimization techniques like CUDA Graph and Eagle 3 speculative decoding, and how to contribute to the project.