Everyone building memory for AI assumes they need embeddings. We did too. Then we benchmarked it.
FTS5 — SQLite’s built-in full-text search — hits 95.40% R@5 on LongMemEval-S. Adding a 384-dimension ONNX vector pipeline via weighted fusion dropped that to 82.40%. We ran this on 500 questions from a public ICLR 2025 dataset. The numbers are independently verifiable in 10 minutes.
One more surprise: when we fused FTS5 and vectors via max-fusion instead of weighted average, performance stayed exactly at 95.40%. Diffing the per-question hit sets: Mode A and Mode B return the same 477 question IDs in their R@5 hit set. Vectors contributed zero additional top-5 hits over FTS5 alone.
| Mode | Description | R@5 | Elapsed |
|---|---|---|---|
| A | FTS5 only | 95.40% | 10s |
| B | FTS5 + ONNX MiniLM-L6 (max fusion) | 95.40% | ~25 min |
| C | FTS5 + ONNX (60/40 weighted) | 82.40% | ~13 min |
memesh ships as a 300 KB npm package. It runs entirely on the user’s machine — no cloud round-trip, no API key, no vector index. Mode A is within 1.2 percentage points of MemPalace’s vendor-reported ceiling of 96.6%, which uses a vector + reranker stack.
How we tested it
LongMemEval simulates a real scenario: you’ve been talking to an AI for weeks, and you ask a question that depends on something said months ago. Each of the 500 questions comes with a haystack of about 50 past conversation sessions — some are the user’s own, some are generic public Q&A distractors (ultrachat_*, sharegpt_*).
The metric is R@5: what fraction of questions have the correct answer session in the top-5 ranked retrievals. Dataset SHA256: 08d8dad4be43ee2049a22ff5674eb86725d0ce5ff434cde2627e5e8e7e117894.
This is a retrieval test, not an end-to-end LLM test. We’re measuring whether the memory layer surfaces the right session. What an LLM does with it afterwards is out of scope.
Mode A pipeline — the whole thing:
- Index every haystack session into a fresh SQLite database. Each session becomes one entity; full text stored as a single observation, truncated at 8,000 characters. FTS5 builds the full-text index with
unicode61 remove_diacriticstokenization. - Tokenize the question. Strip non-alphanumeric characters, drop tokens ≤ 2 characters, take up to 20 tokens, OR-join as quoted terms.
- FTS5 MATCH with the default BM25 ranker. Take up to 20 top results.
- Score by rank position:
score = 1 − (rank / n_results). - Return ranked list, compute R@5.
No embeddings. No API calls. No vector store.
Why FTS5 alone gets 95%
The common assumption — “you need vectors for semantic search” — is right for cross-vocabulary retrieval and wrong for personal memory.
In a public corpus, asking “how do I treat anxiety?” might need to match “managing panic disorder.” Different vocabulary, same intent. Vectors bridge that gap.
In personal memory, the haystack is the user’s own past writing. The vocabulary they use to ask the question now is statistically very close to the vocabulary they used to record the memory then. BM25 on keywords is, surprisingly often, the right tool.
Here’s a concrete example from the dataset:
- Question (
e47becba): “What degree did I graduate with?” - Haystack: 54 sessions — productivity apps, hiking, dating, side projects, plus 26 generic public Q&A distractors.
- Answer session (
answer_280352e9, haystack index 52): an 8-turn conversation about task-management apps. The degree appears once, casually: “I graduated with a degree in Business Administration, which has definitely helped me in my new role.” The rest of the session is about Todoist, Trello, meal-prep. - Mode A result: answer session ranks 2nd. Top-5 returned:
02bd2b90_3, answer_280352e9, sharegpt_Cr2tc1f_0, ultrachat_214101, f6859b48_2.
FTS5 surfaced a single keyword match in a long off-topic session and ranked it above 53 alternatives — including distractor sessions that discuss education in general. That’s the lexical fingerprint advantage: the user’s specific word “graduated” is a much stronger signal than semantic similarity to “higher education.”
Where it still fails — the 23 questions (4.6%) cluster clearly:
- Temporal reasoning (8 fails): “What’s the order of the three trips I took in the past three months?” FTS5 ranks by lexical relevance; it has no concept of timestamps.
- Multi-session aggregation (7 fails): “How many different doctors did I visit?” These require combining evidence across multiple sessions, not retrieving one answer.
- Single-session preference (5 fails): The relevant past session uses domain vocabulary that the current question doesn’t echo directly.
- Other (3 fails): Two abstention questions, one knowledge-update case.
Two-thirds of the failure mass is temporal reasoning and multi-session aggregation. The fix there is not a better embedder — it’s a different retrieval mode: timestamp-aware ranking, or a graph/aggregation layer above raw retrieval.
Why adding vectors made it worse
Mode C is the more interesting result. The intuition: “FTS5 gets 95.40%; vectors give an independent signal; blending them 60/40 should match or beat 95.40%.” Empirically: 82.40%. Down 13 percentage points.
The root cause: the LongMemEval haystack includes generic public Q&A sessions (ultrachat_*, sharegpt_*) that have high cosine similarity to almost any question. They’re broad chatbot training conversations — full of generic vocabulary that overlaps semantically with everything. They are not the user’s personal memory.
Mode A correctly ranks them low: they don’t share the user’s specific keywords with the question.
Mode B (max-fusion) preserves FTS5 dominance: if FTS5 ranks the correct answer high, the vector signal can’t displace it. Hit set is identical to Mode A.
Mode C (weighted average) dilutes the strong FTS5 signal with the weak vector signal. The generic Q&A distractors get pushed up, displacing personal sessions.
The lesson: fusion mechanism matters more than feature inclusion. A weak signal added via max-fusion is harmless. The same signal via weighted average is actively harmful when the haystack contains adversarial distractors. This isn’t a memesh-specific finding — it’s a general architecture lesson for any retrieval system built on top of personal data.
We published the Mode C regression alongside the positive results, not because it’s embarrassing, but because the regression is the finding.
What this means for building memory
For personal-memory retrieval — one user, their own conversation history, low thousands of entities:
Start with FTS5. It’s already in your SQLite. No embedding pipeline, no vector store, no API costs at write time. It will likely get you to 90%+ R@5 on personal memory.
Add vectors as a tie-breaker, not a primary signal. Max-fusion with FTS5 dominance is safe. Weighted blending is not — not when your haystack contains anything that didn’t come from the specific user.
Be skeptical of vendor benchmarks without methodology. MemPalace’s 96.6% is a self-report on a possibly-cleaned dataset variant. Supermemory’s ~82% is a vendor estimate. Mem0 and Zep are published in the original LongMemEval paper on the same dataset variant we use — those numbers are directly comparable. The others are not.
| System | R@5 | Source |
|---|---|---|
| MemPalace | 96.6% | Vendor self-report; uses reranker; possibly on longmemeval-cleaned |
| memesh (Mode A) | 95.40% | This benchmark, longmemeval_s |
| Supermemory | ~82% | Vendor estimate |
| Zep | 63.8% | LongMemEval paper, arXiv:2410.10813, longmemeval_s |
| Mem0 | 49.0% | LongMemEval paper, arXiv:2410.10813, longmemeval_s |
For corpus-scale retrieval — millions of entries, many users, cross-language — the trade-off shifts. Vector search becomes essential. memesh isn’t trying to replace pgvector or Pinecone. It’s making a different bet: for personal memory, the simpler architecture is the better one.
What the 95.40% doesn’t include
The benchmark runs only the retrieval pipeline. The production plugin includes features that are off in the benchmark to keep comparisons apples-to-apples:
- Multi-factor scoring — weighting recency, frequency, confidence, and impact alongside relevance.
- LLM query expansion (“Smart Mode”) — synonym-expands the question before the FTS5 query.
- Cross-entity graph traversal — for queries that require following relationships between entities.
- Auto-decay — old or unused memories fade in ranking without being deleted.
Production R@5 is likely higher than 95.40%. We haven’t isolated those features in a published number and won’t claim them here.
Reproduce in 8 steps
npm install takes 30–60 seconds depending on network speed. The full process including dataset download runs in about 10–15 minutes.
git clone https://github.com/PCIRCLE-AI/memesh-llm-memory.git
cd memesh-llm-memory
git checkout bench/longmemeval-public-r1
npm install
curl -L "https://huggingface.co/datasets/xiaowu0162/longmemeval/resolve/main/longmemeval_s" \
-o /tmp/longmemeval_s.json
shasum -a 256 /tmp/longmemeval_s.json # expect 08d8dad4be43ee2049a22ff5674eb86725d0ce5ff434cde2627e5e8e7e117894
node benchmarks/longmemeval/run.mjs --mode A --dataset /tmp/longmemeval_s.json
node -e "const d=require('./benchmarks/longmemeval/results/mode-A-2026-05-03T12-31-26.json'); console.log('R@5:', (d.overall_metrics.r_at_5 * 100).toFixed(2) + '%')"
Expected output: R@5: 95.40%. Modes B and C take longer (~25 min and ~13 min respectively) because they download and run the ONNX MiniLM-L6 model.
Repository: github.com/PCIRCLE-AI/memesh-llm-memory
Benchmark branch: bench/longmemeval-public-r1
License: MIT
FURTHER QUESTIONS
- How does the FTS5-only configuration scale beyond ~50 sessions per haystack to ~5,000+ entities per user?
- Would a smaller, domain-tuned embedder (rather than 384-dim MiniLM-L6) actually contribute additional top-5 hits, or is the FTS5 ceiling structural for personal memory?