LongMemEval tests whether a memory system retrieves the right evidence for questions about past conversations, across six categories. We report Recall@K at chunk granularity — the fraction of questions whose ground-truth evidence lands in the top K.
| Stack | R@5 | Notes |
|---|---|---|
| PLUR hybrid + reranker | 97.6% | fully local cross-encoder — no API |
| PLUR hybrid (openai-3-large) | 97.0% | optional cloud embedder |
| PLUR BM25 only | 92.2% | no embedder — fully airgapped |
97.6% R@5 on LongMemEval, fully local — zero API calls. Retrieval recall (finding the right memory) and end-to-end answer accuracy are different axes; PLUR reports them separately and never conflates them.
| Category | R@5 | R@10 |
|---|---|---|
| Single-session assistant | 100.0% | 100.0% |
| Knowledge updates | 98.7% | 100.0% |
| Single-session user | 98.6% | 100.0% |
| Multi-session | 97.7% | 99.2% |
| Temporal reasoning | 94.0% | 97.0% |
| Single-session preference | 93.3% | 96.7% |
Per-category figures from the openai-3-large control run (hybrid, chunk granularity), produced by the reproducible plur-bench harness on LongMemEval-S (N=500).
We gave the same task to Claude — with and without memory. Without it, your agent got house rules right 10–38% of the time, depending on model. With PLUR: 12–0. Every model. Every run.
Not general intelligence. Not coding ability. Just: can your agent apply knowledge that only exists in your organization's memory? Tag conventions, file routing, deployment servers, which of your 100 tools handles trading. The answer is either in an engram or nowhere.
28 scenarios tested across Haiku 4.5, Sonnet 4.6, and Opus 4.5. Each scenario runs the same prompt through the agent twice: once with PLUR memory, once vanilla. Ties removed — only decisive contests count. 19 scenarios produced clear winners.
| Knowledge type | PLUR wins | Losses | Win rate | What it tests |
|---|---|---|---|---|
| House rules | 12 | 0 | 12–0 | Project conventions, tag formats, file placement |
| Tool routing | 10 | 2 | 83% | Finding the right tool among 100+ options |
| Past experience | 4 | 0 | 4–0 | API quirks, debugging insights, infrastructure |
| Learned style | 5 | 2 | 71% | Communication tone, design preferences |
| General tasks | 0 | 0 | — | Zero penalty (control group) |
House rules: 12–0 across every model, every run. Without memory, agents guessed right 10–38% of the time. With it: zero losses. When your agent needs to know how things work here — tag conventions, file routing, DIP format, deployment patterns — memory is the difference between guessing and knowing.
PLUR helps every model, but for different reasons. Cheaper models cannot explore — memory gives them navigation. Expensive models can explore — memory gives them things they cannot discover.
| Model | Win rate | Record | Notes |
|---|---|---|---|
| Haiku 4.5 | 90% | 9W / 1L | Cheapest model benefits most |
| Sonnet 4.6 | 91% | 10W / 1L | Most popular coding model |
| Opus 4.5 | 86% | 12W / 2L | Most capable model |
| Category | Haiku PLUR/Van. | Adv. | Sonnet PLUR/Van. | Adv. | Opus PLUR/Van. | Adv. |
|---|---|---|---|---|---|---|
| House rules | 12–0 / 10% | 10.0x | 12–0 / 26% | 3.9x | 12–0 / 38% | 2.7x |
| Tool routing | 83% / 39% | 2.1x | 83% / 39% | 2.1x | 59% / 42% | 1.4x |
| Past experience | 28% / 28% | 1.0x | 64% / 34% | 1.9x | 44% / 28% | 1.6x |
| Learned style | 56% / 51% | 1.1x | 61% / 63% | 1.0x | 71% / 50% | 1.4x |
The pattern: smarter models guess better on house rules (Haiku 10%, Opus 38%) but none come close to reliable. Memory isn't a reasoning crutch — it's information the model literally cannot infer.
The cheapest model with memory outperforms the most expensive without it.
0.80 avg on discoverability. Cost: ~$1/run. The smallest, cheapest model — but it knows what tools exist because PLUR tells it.
0.31 avg on discoverability. Cost: ~$10/run. The most capable model available — but it can't discover tools it has never seen.
Haiku with PLUR: 2.6x better at ~10x less cost. Instead of spending more on a bigger model, spend less on memory.
19 decisive contests across 3 Claude models. We run the same prompt through the agent twice:
Same model. Same prompt. Different context. The only variable is whether the agent has access to persistent memory.
Each scenario is scored two ways:
Baseline isolation is critical. Claude Code walks up the directory tree looking for CLAUDE.md files — running the baseline inside the project would leak context. Baseline runs execute from /tmp/ with a single-line CLAUDE.md. Our first attempt missed this, and all early results were invalid.
Making BM25 and embeddings see entity names, temporal dates, and rationale text delivered +43 percentage points on LongMemEval. This is a PLUR-unique advantage — generic search engines index raw text; PLUR indexes knowledge-enriched text.
At 97.6% R@5, the correct memory is almost always in the top results. How well an agent then answers is a separate axis — it depends on the model reasoning over that context. PLUR provides retrieval; the model provides reasoning.
Haiku 4.5 at ~$1/run with PLUR (0.80 avg) outperforms Opus 4.5 at ~$10/run without it (0.31 avg) on discoverability. This reframes the cost equation: instead of spending 10x on a bigger model, spend 0.1x on memory.
Our context file was 572 lines of facts that overpowered engram recall. We cut it to 207 lines of instructions and moved facts to engrams. Scores improved across the board. The lesson: context files should teach the agent how to use the system, not dump everything the system knows.
Haiku cannot explore — memory gives it navigation (2.2x on inferable scenarios). Opus can explore — memory gives it things it cannot discover (1.7x on memory-only scenarios). PLUR serves both ends of the intelligence spectrum.
An honest benchmark is one you can trust. Here is what our data does and does not support:
| Claim | Confidence | Caveat |
|---|---|---|
| 89% A/B win rate | High | 35 decided contests across 3 models. Consistent 86–91% per model. |
| 12–0 house rules | Very high | 12/12 wins, zero losses, across Haiku/Sonnet/Opus. |
| 97.6% R@5 (retrieval) | High | LongMemEval-S, N=500, chunk granularity. Reproducible via the plur-bench harness. |
| Cost reframe (Haiku > Opus) | High | Same benchmark, same scenarios, same scoring. |
Add your own scenarios as YAML files in scenarios/<category>/. The harness handles execution, scoring, and reporting automatically.