Embeddings

May 19, 2026

Personalization in the Era of LLMs - Shivam Verma, Spotify

Spotify is personalizing open-weight LLMs without full fine-tuning by combining three key components: foundational user embeddings from streaming history, 'Semantic IDs' that tokenize its 100M+ item catalog, and a 'soft tokenization' layer that projects a user's embedding directly into the LLM's context. This allows the model to autoregressively generate the next song or podcast as the next token in a sequence.

May 05, 2026

The Small Model Infrastructure Nobody Built (So We Did) — Filip Makraduli, Superlinked

Filip Makraduli from Superlinked discusses the common infrastructure gaps and profiling mistakes encountered when deploying small embedding and transformer models. He introduces the Superlinked Inference Engine (SIE), an open-source solution designed for dynamic model loading, hot-swapping, and memory-aware eviction to maximize GPU utilization and streamline the path from development to production.

Feb 02, 2026

Real-time features, AI search, Agentic similarities

Varant Zanoyan and Nikhil Simha Raprolu of Zipline AI explain why traditional feature stores are the wrong abstraction. They detail the journey of Chronon, the open-source engine born at Airbnb and battle-tested at Stripe, which focuses on compute, orchestration, and real-time correctness to solve the hardest data engineering challenges in ML, from fraud detection to powering modern AI agents with features and embeddings.

Aug 06, 2025

How to look at your data — Jeff Huber (Choma) + Jason Liu (567)

A detailed summary of a talk by Jeff Huber (Chroma) and Jason Liu on systematically improving AI applications. The talk covers using fast, inexpensive evaluations for retrieval systems (inputs) and applying structured data analysis and clustering to conversational logs (outputs) to derive actionable product insights.

Jul 29, 2025

Layering every technique in RAG, one query at a time - David Karam, Pi Labs (fmr. Google Search)

David Karam, formerly of Google Search, presents a pragmatic framework for enhancing RAG systems, advocating a "quality engineering" approach. The talk progresses through a ladder of techniques, from in-memory retrieval and BM25 to custom embeddings, re-ranking, and advanced orchestration, emphasizing that the choice of technique should be driven by empirical analysis of system failures ("loss analysis") and balanced by a "complexity-adjusted impact" mindset.

Jul 24, 2025

Inside GPT – The Maths Behind the Magic • Alan Smith • GOTO 2024

A deep dive into the internal workings of Large Language Models like GPT, explaining the journey from a text prompt through tokenization, embeddings, and the attention mechanism to generate a response.