The developer of Bastra Recall, an MIT-licensed memory server for Claude, has revised their benchmarking approach after realizing their initial metric of 98.3% recall was a tautology. This new benchmark uses six distinct persona agents to generate paraphrased queries for 30 memories each, simulating real-world usage where users describe situations differently over time. The revised tests revealed that while lexical search alone achieved 63.1% recall on paraphrased queries, local embeddings significantly improved performance to 79.6%, particularly for queries with language or experience level differences. The developer also found that features like trigger phrases and write-time paraphrases offered no measurable lift on these challenging queries, indicating that the remaining gap is in ranking rather than retrieval. AI
IMPACT Highlights the importance of robust, real-world benchmarking for AI memory systems, especially when dealing with paraphrased or varied user inputs.
RANK_REASON Developer's revised benchmark of an open-source AI memory tool. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →