Benchmark · LongMemEval-S (N=500) + agent A/B

97.6% recall on LongMemEval.

Two axes, measured separately. Retrieval: 97.6% R@5 on LongMemEval (hybrid + reranker, chunk granularity, N=500), fully local — zero search cost, zero API calls, full data sovereignty. Agent impact: on local-knowledge tasks, agents with PLUR win 89% of decided contests across Haiku, Sonnet, and Opus — house rules 12–0, every model.

Retrieval — recall on LongMemEval

LongMemEval tests whether a memory system retrieves the right evidence for questions about past conversations, across six categories. We report Recall@K at chunk granularity — the fraction of questions whose ground-truth evidence lands in the top K.

Stack	R@5	Notes
PLUR hybrid + reranker	97.6%	fully local cross-encoder — no API
PLUR hybrid (openai-3-large)	97.0%	optional cloud embedder
PLUR BM25 only	92.2%	no embedder — fully airgapped

97.6% R@5 on LongMemEval, fully local — zero API calls. Retrieval recall (finding the right memory) and end-to-end answer accuracy are different axes; PLUR reports them separately and never conflates them.

Per-category recall

Category	R@5	R@10
Single-session assistant	100.0%	100.0%
Knowledge updates	98.7%	100.0%
Single-session user	98.6%	100.0%
Multi-session	97.7%	99.2%
Temporal reasoning	94.0%	97.0%
Single-session preference	93.3%	96.7%

Per-category figures from the openai-3-large control run (hybrid, chunk granularity), produced by the reproducible plur-bench harness on LongMemEval-S (N=500).

The test

We gave the same task to Claude — with and without memory. Without it, your agent got house rules right 10–38% of the time, depending on model. With PLUR: 12–0. Every model. Every run.

Not general intelligence. Not coding ability. Just: can your agent apply knowledge that only exists in your organization's memory? Tag conventions, file routing, deployment servers, which of your 100 tools handles trading. The answer is either in an engram or nowhere.

19 decisive contests · 3 Claude models · 31 wins · 4 losses · 89% win rate

Local knowledge benchmark

28 scenarios tested across Haiku 4.5, Sonnet 4.6, and Opus 4.5. Each scenario runs the same prompt through the agent twice: once with PLUR memory, once vanilla. Ties removed — only decisive contests count. 19 scenarios produced clear winners.

Knowledge type	PLUR wins	Losses	Win rate	What it tests
House rules	12	0	12–0	Project conventions, tag formats, file placement
Tool routing	10	2	83%	Finding the right tool among 100+ options
Past experience	4	0	4–0	API quirks, debugging insights, infrastructure
Learned style	5	2	71%	Communication tone, design preferences
General tasks	0	0	—	Zero penalty (control group)

House rules: 12–0 across every model, every run. Without memory, agents guessed right 10–38% of the time. With it: zero losses. When your agent needs to know how things work here — tag conventions, file routing, DIP format, deployment patterns — memory is the difference between guessing and knowing.

Per-model breakdown

PLUR helps every model, but for different reasons. Cheaper models cannot explore — memory gives them navigation. Expensive models can explore — memory gives them things they cannot discover.

Model	Win rate	Record	Notes
Haiku 4.5	90%	9W / 1L	Cheapest model benefits most
Sonnet 4.6	91%	10W / 1L	Most popular coding model
Opus 4.5	86%	12W / 2L	Most capable model

PLUR vs vanilla by category and model

Category	Haiku PLUR/Van.	Adv.	Sonnet PLUR/Van.	Adv.	Opus PLUR/Van.	Adv.
House rules	12–0 / 10%	10.0x	12–0 / 26%	3.9x	12–0 / 38%	2.7x
Tool routing	83% / 39%	2.1x	83% / 39%	2.1x	59% / 42%	1.4x
Past experience	28% / 28%	1.0x	64% / 34%	1.9x	44% / 28%	1.6x
Learned style	56% / 51%	1.1x	61% / 63%	1.0x	71% / 50%	1.4x

The pattern: smarter models guess better on house rules (Haiku 10%, Opus 38%) but none come close to reliable. Memory isn't a reasoning crutch — it's information the model literally cannot infer.

The cost equation, reframed

The cheapest model with memory outperforms the most expensive without it.

Haiku 4.5 + PLUR

0.80 avg on discoverability. Cost: ~$1/run. The smallest, cheapest model — but it knows what tools exist because PLUR tells it.

Opus 4.5 alone

0.31 avg on discoverability. Cost: ~$10/run. The most capable model available — but it can't discover tools it has never seen.

Haiku with PLUR: 2.6x better at ~10x less cost. Instead of spending more on a bigger model, spend less on memory.

Methodology

19 decisive contests across 3 Claude models. We run the same prompt through the agent twice:

Run A — Agent with PLUR memory: engrams, CLAUDE.md context, MCP tools, loaded modules
Run B — Agent alone: vanilla setup, no MCP servers, no memory, no context beyond the prompt

Same model. Same prompt. Different context. The only variable is whether the agent has access to persistent memory.

Scoring

Each scenario is scored two ways:

Deterministic checks — does the output contain specific strings, match patterns, avoid incorrect answers? Ground truth, not subjective.
LLM-as-judge — a separate Claude instance scores both outputs blind. Order is randomized. Each judgment is repeated three times and averaged to reduce noise.

Isolation

Baseline isolation is critical. Claude Code walks up the directory tree looking for CLAUDE.md files — running the baseline inside the project would leak context. Baseline runs execute from /tmp/ with a single-line CLAUDE.md. Our first attempt missed this, and all early results were invalid.

What we learned building this

The enriched schema was everything.

Making BM25 and embeddings see entity names, temporal dates, and rationale text delivered +43 percentage points on LongMemEval. This is a PLUR-unique advantage — generic search engines index raw text; PLUR indexes knowledge-enriched text.

The answering model is the bottleneck, not retrieval.

At 97.6% R@5, the correct memory is almost always in the top results. How well an agent then answers is a separate axis — it depends on the model reasoning over that context. PLUR provides retrieval; the model provides reasoning.

The cheapest model with memory beats the most expensive without it.

Haiku 4.5 at ~$1/run with PLUR (0.80 avg) outperforms Opus 4.5 at ~$10/run without it (0.31 avg) on discoverability. This reframes the cost equation: instead of spending 10x on a bigger model, spend 0.1x on memory.

Static docs compete with dynamic memory.

Our context file was 572 lines of facts that overpowered engram recall. We cut it to 207 lines of instructions and moved facts to engrams. Scores improved across the board. The lesson: context files should teach the agent how to use the system, not dump everything the system knows.

Weaker models benefit from navigation. Stronger models benefit from recall.

Haiku cannot explore — memory gives it navigation (2.2x on inferable scenarios). Opus can explore — memory gives it things it cannot discover (1.7x on memory-only scenarios). PLUR serves both ends of the intelligence spectrum.

What we don't claim

An honest benchmark is one you can trust. Here is what our data does and does not support:

89% is a win rate, not an accuracy score. It measures decisive contests with ties excluded. 22 of 57 total contests across all models were ties — reported as inconclusive, not ignored.
97.6% is retrieval recall (R@5), not answer accuracy. It measures whether the right evidence is retrieved, on LongMemEval-S at chunk granularity (N=500). Whether the agent then answers correctly is a separate axis.
Retrieval and answering are different axes. PLUR provides retrieval; the model reasons over what's retrieved. We never conflate the two or compare across them.
The A/B bench tests the full system, not just PLUR. Datacore includes CLAUDE.md context and module loading alongside PLUR memory. Be precise about what is being measured.
Zero penalty on general tasks — cold task scores are identical (0.90 vs 0.90), but memory adds ~2x latency from MCP initialization and engram injection.

Confidence levels

Claim	Confidence	Caveat
89% A/B win rate	High	35 decided contests across 3 models. Consistent 86–91% per model.
12–0 house rules	Very high	12/12 wins, zero losses, across Haiku/Sonnet/Opus.
97.6% R@5 (retrieval)	High	LongMemEval-S, N=500, chunk granularity. Reproducible via the plur-bench harness.
Cost reframe (Haiku > Opus)	High	Same benchmark, same scenarios, same scoring.

Try it yourself

# clone the repo
git clone https://github.com/plur-ai/plur
cd plur/bench
# list all scenarios
python run.py --list
# run deterministic checks only (fast, no API cost)
python run.py --deterministic-only
# run specific category
python run.py --category discoverability
# full run with LLM-judge
python run.py

Add your own scenarios as YAML files in scenarios/<category>/. The harness handles execution, scoring, and reporting automatically.