Benchmark · LongMemEval-S (N=500) + agent A/B

97.6% recall on LongMemEval.

Two axes, measured separately. Retrieval: 97.6% R@5 on LongMemEval (hybrid + reranker, chunk granularity, N=500), fully local — zero search cost, zero API calls, full data sovereignty. Agent impact: on local-knowledge tasks, agents with PLUR win 89% of decided contests across Haiku, Sonnet, and Opus — house rules 12–0, every model.

Retrieval — recall on LongMemEval

LongMemEval tests whether a memory system retrieves the right evidence for questions about past conversations, across six categories. We report Recall@K at chunk granularity — the fraction of questions whose ground-truth evidence lands in the top K.

StackR@5Notes
PLUR hybrid + reranker97.6%fully local cross-encoder — no API
PLUR hybrid (openai-3-large)97.0%optional cloud embedder
PLUR BM25 only92.2%no embedder — fully airgapped

97.6% R@5 on LongMemEval, fully local — zero API calls. Retrieval recall (finding the right memory) and end-to-end answer accuracy are different axes; PLUR reports them separately and never conflates them.

Per-category recall

CategoryR@5R@10
Single-session assistant100.0%100.0%
Knowledge updates98.7%100.0%
Single-session user98.6%100.0%
Multi-session97.7%99.2%
Temporal reasoning94.0%97.0%
Single-session preference93.3%96.7%

Per-category figures from the openai-3-large control run (hybrid, chunk granularity), produced by the reproducible plur-bench harness on LongMemEval-S (N=500).

The test

We gave the same task to Claude — with and without memory. Without it, your agent got house rules right 10–38% of the time, depending on model. With PLUR: 12–0. Every model. Every run.

Not general intelligence. Not coding ability. Just: can your agent apply knowledge that only exists in your organization's memory? Tag conventions, file routing, deployment servers, which of your 100 tools handles trading. The answer is either in an engram or nowhere.

19 decisive contests · 3 Claude models · 31 wins · 4 losses · 89% win rate

Local knowledge benchmark

28 scenarios tested across Haiku 4.5, Sonnet 4.6, and Opus 4.5. Each scenario runs the same prompt through the agent twice: once with PLUR memory, once vanilla. Ties removed — only decisive contests count. 19 scenarios produced clear winners.

Knowledge type PLUR wins Losses Win rate What it tests
House rules 12 0 12–0 Project conventions, tag formats, file placement
Tool routing 10 2 83% Finding the right tool among 100+ options
Past experience 4 0 4–0 API quirks, debugging insights, infrastructure
Learned style 5 2 71% Communication tone, design preferences
General tasks 0 0 Zero penalty (control group)

House rules: 12–0 across every model, every run. Without memory, agents guessed right 10–38% of the time. With it: zero losses. When your agent needs to know how things work here — tag conventions, file routing, DIP format, deployment patterns — memory is the difference between guessing and knowing.

Per-model breakdown

PLUR helps every model, but for different reasons. Cheaper models cannot explore — memory gives them navigation. Expensive models can explore — memory gives them things they cannot discover.

Model Win rate Record Notes
Haiku 4.5 90% 9W / 1L Cheapest model benefits most
Sonnet 4.6 91% 10W / 1L Most popular coding model
Opus 4.5 86% 12W / 2L Most capable model

PLUR vs vanilla by category and model

Category Haiku PLUR/Van. Adv. Sonnet PLUR/Van. Adv. Opus PLUR/Van. Adv.
House rules 12–0 / 10% 10.0x 12–0 / 26% 3.9x 12–0 / 38% 2.7x
Tool routing 83% / 39% 2.1x 83% / 39% 2.1x 59% / 42% 1.4x
Past experience 28% / 28% 1.0x 64% / 34% 1.9x 44% / 28% 1.6x
Learned style 56% / 51% 1.1x 61% / 63% 1.0x 71% / 50% 1.4x

The pattern: smarter models guess better on house rules (Haiku 10%, Opus 38%) but none come close to reliable. Memory isn't a reasoning crutch — it's information the model literally cannot infer.

The cost equation, reframed

The cheapest model with memory outperforms the most expensive without it.

Haiku 4.5 + PLUR

0.80 avg on discoverability. Cost: ~$1/run. The smallest, cheapest model — but it knows what tools exist because PLUR tells it.

Opus 4.5 alone

0.31 avg on discoverability. Cost: ~$10/run. The most capable model available — but it can't discover tools it has never seen.

Haiku with PLUR: 2.6x better at ~10x less cost. Instead of spending more on a bigger model, spend less on memory.

Methodology

19 decisive contests across 3 Claude models. We run the same prompt through the agent twice:

Same model. Same prompt. Different context. The only variable is whether the agent has access to persistent memory.

Scoring

Each scenario is scored two ways:

Isolation

Baseline isolation is critical. Claude Code walks up the directory tree looking for CLAUDE.md files — running the baseline inside the project would leak context. Baseline runs execute from /tmp/ with a single-line CLAUDE.md. Our first attempt missed this, and all early results were invalid.

What we learned building this

The enriched schema was everything.

Making BM25 and embeddings see entity names, temporal dates, and rationale text delivered +43 percentage points on LongMemEval. This is a PLUR-unique advantage — generic search engines index raw text; PLUR indexes knowledge-enriched text.

The answering model is the bottleneck, not retrieval.

At 97.6% R@5, the correct memory is almost always in the top results. How well an agent then answers is a separate axis — it depends on the model reasoning over that context. PLUR provides retrieval; the model provides reasoning.

The cheapest model with memory beats the most expensive without it.

Haiku 4.5 at ~$1/run with PLUR (0.80 avg) outperforms Opus 4.5 at ~$10/run without it (0.31 avg) on discoverability. This reframes the cost equation: instead of spending 10x on a bigger model, spend 0.1x on memory.

Static docs compete with dynamic memory.

Our context file was 572 lines of facts that overpowered engram recall. We cut it to 207 lines of instructions and moved facts to engrams. Scores improved across the board. The lesson: context files should teach the agent how to use the system, not dump everything the system knows.

Weaker models benefit from navigation. Stronger models benefit from recall.

Haiku cannot explore — memory gives it navigation (2.2x on inferable scenarios). Opus can explore — memory gives it things it cannot discover (1.7x on memory-only scenarios). PLUR serves both ends of the intelligence spectrum.

What we don't claim

An honest benchmark is one you can trust. Here is what our data does and does not support:

Confidence levels

Claim Confidence Caveat
89% A/B win rate High 35 decided contests across 3 models. Consistent 86–91% per model.
12–0 house rules Very high 12/12 wins, zero losses, across Haiku/Sonnet/Opus.
97.6% R@5 (retrieval) High LongMemEval-S, N=500, chunk granularity. Reproducible via the plur-bench harness.
Cost reframe (Haiku > Opus) High Same benchmark, same scenarios, same scoring.

Try it yourself

# clone the repo
git clone https://github.com/plur-ai/plur
cd plur/bench
# list all scenarios
python run.py --list
# run deterministic checks only (fast, no API cost)
python run.py --deterministic-only
# run specific category
python run.py --category discoverability
# full run with LLM-judge
python run.py

Add your own scenarios as YAML files in scenarios/<category>/. The harness handles execution, scoring, and reporting automatically.