Benchmarking

Aug 25, 2025

Using LongMemEval to Improve Agent Memory

Sam Bhagwat of Mastra details their process for optimizing AI agent memory using the Long Mem Eval benchmark. He breaks down memory into subtasks like temporal reasoning and knowledge updates, and shares how targeted improvements—such as tailored templates, targeted data updates, and structured message formatting—led to state-of-the-art performance, emphasizing the importance of iterative evaluation.

Jul 27, 2025

Strategies for LLM Evals (GuideLLM, lm-eval-harness, OpenAI Evals Workshop) — Taylor Jordan Smith

Traditional benchmarks and leaderboards are insufficient for production AI. This summary details a practical, multi-layered evaluation strategy, moving from foundational system performance to factual accuracy and finally to safety and bias, using open-source tools like GuideLLM, lm-eval-harness, and Promptfoo.

← Previous