Tensor rt llm

How We Cut LLM Latency 70% With TensorRT in Production

How We Cut LLM Latency 70% With TensorRT in Production

An engineering leader details the journey of self-hosting LLMs at enterprise scale, covering how his team slashed latency by 70% with TensorRT-LLM, optimized GPU costs through counterintuitive scaling, and built a verticalized AI platform for HR tech. The summary explores practical solutions for cold starts, KV cache optimization, and managing the cultural adoption of AI coding agents in engineering teams.

The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava

The CEO Behind the Fastest-Growing AI Inference Company | Tuhin Srivastava

Tuhin Srivastava, CEO of Baseten, joins Gradient Dissent to discuss the core challenges of AI inference, from infrastructure and runtime bottlenecks to the practical differences between vLLM, TensorRT-LLM, and SGLang. He shares how Baseten navigated years of searching for a market before the explosion of large-scale models, emphasizing a company-building philosophy focused on avoiding premature scaling and "burning the boats" to chase the biggest opportunities.