Self hosting

How We Cut LLM Latency 70% With TensorRT in Production

How We Cut LLM Latency 70% With TensorRT in Production

An engineering leader details the journey of self-hosting LLMs at enterprise scale, covering how his team slashed latency by 70% with TensorRT-LLM, optimized GPU costs through counterintuitive scaling, and built a verticalized AI platform for HR tech. The summary explores practical solutions for cold starts, KV cache optimization, and managing the cultural adoption of AI coding agents in engineering teams.