PulseAugur
EN
LIVE 12:20:06

AI memory benchmark revised after initial metric found to be misleading

The developer of Bastra Recall, an MIT-licensed memory server for Claude, has revised their benchmarking approach after realizing their initial metric of 98.3% recall was a tautology. This new benchmark uses six distinct persona agents to generate paraphrased queries for 30 memories each, simulating real-world usage where users describe situations differently over time. The revised tests revealed that while lexical search alone achieved 63.1% recall on paraphrased queries, local embeddings significantly improved performance to 79.6%, particularly for queries with language or experience level differences. The developer also found that features like trigger phrases and write-time paraphrases offered no measurable lift on these challenging queries, indicating that the remaining gap is in ranking rather than retrieval. AI

IMPACT Highlights the importance of robust, real-world benchmarking for AI memory systems, especially when dealing with paraphrased or varied user inputs.

RANK_REASON Developer's revised benchmark of an open-source AI memory tool. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — MCP tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

AI memory benchmark revised after initial metric found to be misleading

COVERAGE [1]

  1. dev.to — MCP tag TIER_1 English(EN) · Daniel Nevoigt ·

    My AI memory benchmark said 98.3%. The number was true — and worthless.

    <p>In my last post I introduced Bastra Recall — an MIT-licensed MCP memory server that gives Claude persistent memory as plain Markdown in a local Obsidian vault. I promised a follow-up on retrieval and benchmarking.<br /> Here it is. It starts with me being wrong.<br /> The 98.3…