Standard NLP evaluation metrics measure one thing: accuracy. They are silent on cost. In production AI systems, both dimensions matter — a system that is accurate but burns 10,000 tokens per query is not viable at scale.
F1 score (harmonic mean of precision and recall) tells you how accurate a system's answers are. It does not tell you what those answers cost. A system with F1 = 0.60 using 300 tokens per query is fundamentally different from a system with F1 = 0.60 using 3,000 tokens per query — but F1 treats them identically.
BLEU (bilingual evaluation understudy) and ROUGE (recall-oriented understudy for gisting evaluation) measure n-gram overlap between generated and reference text. Neither has a concept of token efficiency. They are appropriate for translation and summarization quality; they are not designed for retrieval system evaluation where cost is a first-class constraint.
In enterprise AI, the decision is not "which system is most accurate?" — it is "which system delivers the most accuracy per dollar of compute?" That question requires a metric with two dimensions: accuracy and cost. No standard metric captured this before RDS.
The gap: Teams routinely compare systems on F1 alone, deploy the most accurate one, and then discover it is 10× more expensive to run than the alternatives. RDS prevents this mistake by making cost visible in the comparison itself.
The formula is intentionally simple. Both inputs — F1 score and mean tokens per query — are measurable from any benchmark run. The result is a single number that encodes both quality and efficiency.
System F1 Tokens RDS vs CKG ------- ------ ------ ---------- ------- CKG 0.4709 269 0.001751 baseline RAG 0.1231 2,982 0.0000413 42× lower GraphRAG 0.1780 ~3,960 ~0.0000449 ~39× lower Source: Yarmoluk & McCreary, arXiv 2026 Benchmark: 45 domains · 7,928 queries · fully reproducible
Note that GraphRAG achieves higher F1 than standard RAG (0.178 vs. 0.123) but uses significantly more tokens. The RDS for GraphRAG is comparable to RAG — the accuracy gain is consumed by the token cost increase. CKG outperforms both by approximately 39–42×.
Full benchmark data: github.com/Yarmoluk/ckg-benchmark →
RDS penalizes both failure modes simultaneously: a system that is inaccurate scores lower, and a system that is accurate but verbose scores lower. Only a system that is both accurate and compact achieves a high RDS.
The only sustainable production AI retrieval system is in the fourth quadrant: accurate and compact. RDS is the metric that identifies it.
The benchmark compares three retrieval architectures across the same 45-domain, 7,928-query test set. Every result is reproducible.
GraphRAG's higher F1 compared to RAG is not enough to overcome its higher token usage. The RDS for GraphRAG is comparable to RAG — demonstrating that adding graph structure to RAG retrieval improves accuracy marginally but does not improve efficiency. CKG achieves both, because it replaces retrieval entirely with pre-structured context.
RDS shifts the evaluation question from "which system is most accurate?" to "which system delivers the most accuracy per token spent?" These are different questions with different answers.
Without RDS, teams compare systems on F1, BLEU, or human evaluation. They pick the most accurate system and discover later that it is too expensive to run at production scale. Or they pick a cheap system that turns out to be inaccurate. RDS surfaces both failure modes upfront.
With RDS, you can immediately answer: does this system's accuracy improvement justify its token cost increase? If GraphRAG improves F1 from 0.12 to 0.18 (1.45×) but increases token cost from 2,982 to 3,960 (1.33×), the RDS improvement is minimal — barely worth the architectural complexity. That conclusion is only visible with RDS.
Teams can set a minimum acceptable RDS threshold based on their production token budget and accuracy requirements — and only consider architectures that meet it. This is analogous to cost-per-acquisition in performance marketing: a single number that encodes both the benefit (accuracy) and the cost (tokens).
The business case framing: If your team runs 100,000 queries per month and your current system has RDS = 0.0000413 (RAG), switching to a system with RDS = 0.001751 (CKG) means you get 42× more accurate information for the same token budget — or 81% token savings at equivalent accuracy. Same RDS math, two ways to communicate the value.
Retrieval Density Score was introduced by Daniel Yarmoluk and Dan McCreary in the arXiv preprint "Compact Knowledge Graphs vs. RAG and GraphRAG: A Reproducible Benchmark Across 45 Educational Domains" (2026).
Dan McCreary is a former Senior Distinguished Engineer at UnitedHealth Group / Optum, former Head of AI at TigerGraph, and patent holder (US 11,204,950 — knowledge graph systems). He built one of the world's largest healthcare knowledge graphs at UHG and served as co-author and benchmark collaborator on the RDS paper.
The benchmark covers 12,261 nodes and 19,626 edges across 45 domains, with 7,928 evaluation queries. It is fully reproducible — the methodology, data, and evaluation code are published.
Access the full benchmark on GitHub →
Calculating RDS requires a fixed benchmark query set, an accuracy measurement, and token counting. Here is the step-by-step process.
Your system results: F1 score (across 200 queries): 0.35 Mean tokens per query: 800 RDS = 0.35 / 800 = 0.000438 Benchmark calibration: CKG: 0.001751 (4.0× above your system) Yours: 0.000438 RAG: 0.0000413 (10.6× below your system) GraphRAG: ~0.0000449 Interpretation: Your system is significantly more efficient than RAG but still 4× below CKG. Token reduction or accuracy improvement of ~2× each would reach CKG range.
Related pages: What Is a Compact Knowledge Graph? — the architecture behind CKG's 0.001751 RDS. How to Reduce LLM Token Costs — the dollar math behind the RDS advantage.
Bring your domain, your query set, or your current benchmark numbers. We will calculate your system's RDS and show you exactly what a CKG would deliver on the same queries.
Book a 30-Minute Demo What Is a CKG? →