Retrieval quality
Does recall find the right memory? Measured with reproducible
benchmarks - LoCoMo recall@10 ≈ 94.5%, multilingual top-10 ≈ 99.2%.
Outcome impact
Does using memory change outcomes? Measured by Earned Memory, joining
each surfaced lesson to the outcome of the turn it was active in.
Earned Memory - three honest layers
PMB joins each surfaced lesson to the turn’s outcome (tests pass/fail, red→green, build, deploy - no LLM) and reports effectiveness at three levels of rigor, refusing to overclaim at each one.Associational lift (weakest)
success_rate(lesson active) minus success_rate(no lesson). Useful first
look, but confounded: lessons surface on harder turns, so a helpful
lesson can show negative lift. A flag for review, never ground truth.Statistical honesty
Each lesson carries a 95% Wilson confidence interval and a conservative
verdict -
useful/harmful only when the CI clears the baseline and
n ≥ min_n; otherwise unverified or insufficient. An n=1 fluke can
never read as a real effect.What PMB will not do
Run it on your own data
Seeing
signal: insufficient early is the honest answer, not a bug -
outcome turns are rare, so a young workspace simply hasn’t earned a verdict
yet. A lesson only earns “useful”/“helps” once the statistics back it.