Benchmarking Knowledge Retrieval Architectures Across Educational and Commercial Domains: RAG, GraphRAG, and Compact Knowledge Graphs

Daniel YarmolukGraphify.md   ·   Dan McCrearyIntelligent Textbooks
v0.6.2 Pre-print April 2026 github.com/Yarmoluk/ckg-benchmark
Abstract

Retrieval-augmented generation (RAG) and graph-based retrieval (GraphRAG) are the dominant paradigms for grounding LLM responses in structured knowledge. Both optimize for recall while treating token cost as a secondary concern. We introduce Compact Knowledge Graphs (CKG) — pre-structured DAG representations with explicit concept taxonomy and pipe-delimited dependency encoding — and present a two-track benchmark comparing CKG, RAG, and GraphRAG across educational and commercial domains.

Track 1 evaluates the McCreary Intelligent Textbook Corpus: 44 hand-curated educational domains spanning STEM, professional, and foundational subjects (7,758 queries). CKG achieves macro-average F1 of 0.4709 versus 0.1231 for RAG and 0.1200 for GraphRAG, at 11× fewer tokens per query (269 vs. 2,982 vs. 3,450). CKG F1 increases with hop depth (peaking at hop=3: 0.64) while RAG degrades. The compound Reasoning Density Score (RDS) advantage is 42×. T1 entity lookup (CKG 0.207, RAG 0.094) serves as a designed negative control.

Track 2 tests whether the architecture generalizes beyond hand-curated educational data. We build a complete GLP-1/Obesity pharmacology CKG programmatically from the ClinicalTrials.gov API with no expert annotation (170 queries). The pipeline-generated CKG reaches macro F1 0.5298 — exceeding the Track 1 hand-curated average by 12.5% and preserving a 28× RDS advantage over RAG. Automated construction does not degrade retrieval quality; it matches or improves it. Together the two tracks establish CKG as a distinct architecture whose structural advantage is domain-agnostic and does not depend on manual curation.

Benchmark, dataset, and evaluation harness are released at github.com/Yarmoluk/ckg-benchmark under CC BY 4.0 / MIT.

42×
RDS advantage
CKG over RAG
0.4709
CKG macro F1
(Track 1, 44 domains)
11×
Fewer tokens
per query
0.5298
CKG F1 Track 2
(GLP-1, pipeline-gen.)
7,928
Total benchmark
queries (T1+T2)

1. Introduction

1.1 Motivation

LLM retrieval quality is typically measured by F1 alone, yet token cost is a first-class production constraint — not a research afterthought. Domain-specific knowledge has latent structure that RAG discards through chunking, and GraphRAG re-derives structure from text at significant computational expense. But what if the structure is already known?

Consider a deployed AI tutoring system serving 10,000 student sessions per day, each requiring 5–10 knowledge retrieval queries. At RAG's typical 3,000–5,000 input tokens per query, daily token consumption exceeds 150M–500M tokens. At Claude Sonnet 4.6 pricing ($3 per 1M input tokens), this yields daily inference costs of $450–$1,500 for retrieval alone. A system that achieves equivalent accuracy at 250–400 tokens per query reduces that cost by a factor of 10–15× — the difference between a viable product and an unsustainable one.

This cost gap is not merely a tuning opportunity; it reflects a structural difference in how knowledge is represented. RAG stores knowledge as prose and retrieves it by embedding similarity, recovering structure only indirectly. GraphRAG re-derives structure from text via LLM entity extraction — at significant additional cost. Compact Knowledge Graphs (CKG) use structure that was authored directly, requiring no derivation step and enabling exact retrieval in tens of tokens instead of thousands.

1.2 The Three Paradigms

Three retrieval architecture workflow diagrams (RAG, GraphRAG, CKG) — build-time and runtime swimlanes. figures/workflow-rag.png · figures/workflow-graphrag.png · figures/workflow-ckg.png Figure available in repository
Figure 1. Three retrieval architectures compared, each shown as a build-time swimlane above a runtime swimlane (chronological, top-to-bottom). RAG chunks text and retrieves top-k by vector similarity (~2,982 tokens/query). GraphRAG extracts an entity graph from text, clusters it into communities, then routes queries local vs. global (~3,450 tokens/query). CKG parses a pre-authored DAG directly and extracts the relevant subgraph per query (~269 tokens/query). The 11× context-size gap is the efficiency differential this paper quantifies.
Table 1. Three knowledge retrieval paradigms compared in this benchmark.
System Knowledge Repr. Retrieval Build Cost
RAG Unstructured text chunks Embedding similarity Embed all chunks
GraphRAG Dynamically extracted graph Graph + community search Full entity extraction
CKG Pre-structured DAG + taxonomy Direct concept/edge lookup Zero (CSV-native)*
* Build cost assumes pre-existing expert DAG. Expert curation cost is not measured; see Section 8.5.

1.3 Falsifiable Claims

We make the following falsifiable claims, each testable against the benchmark results presented in Section 7:

  1. CKG achieves higher F1 on T2 (dependency) and T3 (multi-hop path) queries.
  2. CKG F1 does not degrade with hop depth; RAG F1 degrades significantly.
  3. CKG RDS ratio ≥ 10× vs. RAG across all 46 domains.
  4. GraphRAG hallucinates edges not present in ground truth DAG (HR > 0).
  5. CKG Hallucination Rate = 0 (by construction).
  6. The "Structure Premium" hypothesis: RDS advantage correlates with DAG richness (r > 0.7).
  7. Cross-domain transfer: The CKG structural advantage observed on hand-curated educational domains (Track 1) transfers to a commercial pharmacology domain (Track 2) with F1 equal to or greater than the Track 1 average.
  8. Construction invariance: The CKG F1 advantage does not depend on expert manual curation; a DAG built programmatically from a public API yields comparable or superior retrieval performance.

We explicitly do not claim CKG outperforms RAG on T1 (entity lookup / explanatory) queries, which require prose content absent from the DAG structure. T1 results serve as a negative control validating that the benchmark is not constructed to favor CKG across all query types.

1.4 Contributions

  1. The CKG architecture specification (format, DAG constraints, taxonomy schema) and an accompanying BFS/DFS subgraph-extraction retrieval method.
  2. Five novel evaluation metrics (RDS, CUR, Hop-F1, CPCA, RP).
  3. The McCreary Corpus as the first formal benchmark dataset for structural knowledge retrieval.
  4. An open benchmark: 45 domains (44 educational + 1 commercial) × three systems, ~23,900 evaluated query–system pairs.
  5. Track 2 multi-domain ensemble: the educational McCreary corpus evaluated alongside a commercial life-sciences (GLP-1/Obesity) domain in a single harness, demonstrating educational-to-commercial transfer of the CKG retrieval advantage.
  6. Automated CKG construction pipeline: a four-stage pipeline (API extraction → concept and edge extraction → learning-graph.csv → benchmark queries) that produces a CKG domain from ClinicalTrials.gov with no manual annotation, and achieves macro F1 equal to or greater than the Track 1 hand-curated average.
  7. A GitHub-hosted dataset with one-command reproduction harness.

3. The McCreary Intelligent Textbook Corpus

This section formally defines the corpus for the first time in literature.

3.1 Corpus Description

The McCreary Intelligent Textbook Corpus comprises 45 open-source educational textbooks hosted on GitHub (github.com/dmccreary). Each textbook contains a standardized learning graph encoded as a CSV file representing a directed acyclic graph (DAG) of concepts and their prerequisite dependencies. 44 domains have generated benchmark queries; one domain was excluded during query generation due to schema incompatibility.

The corpus spans three subject categories:

  • STEM (20 domains): algebra, calculus, pre-calculus, functions, linear algebra, geometry, biology, genetics, bioinformatics, chemistry, ecology, moss, physics, circuits, digital electronics, signal processing, FFT, statistics, quantum computing, computer science.
  • Professional (14 domains): economics, data science, machine learning, blockchain, conversational AI, automating instructional design, healthcare data modeling, organizational analytics, intro to graphs, IT management, learning Linux, MicroSims, infographics, personal finance.
  • Foundational (10 domains): systems thinking, theory of knowledge, digital citizenship, ethics, prompt engineering, AI tracking, US geography, ASL, reading (kindergarten), dementia.

Key statistics

  • Concepts per domain: 25–550 (mean: ~272)
  • Total concepts: 12,261 across 45 domains
  • Total dependency edges: 19,626
  • Taxonomy categories: 1–19 per domain (mean: ~4)
  • Benchmark queries: 7,728 across 44 domains (~175 per domain)
  • Raw textbook content: MkDocs Markdown chapters available for 22 domains
📊 Interactive learning graph viewer — calculus domain (380 concepts, 539 edges). figures/calculus-learning-graph.png Figure available in repository
Figure 2. Interactive learning graph viewer for the calculus domain (380 concepts, 539 edges). Each node is a concept, color-coded by taxonomy category. Directed edges represent prerequisite dependencies. The left panel shows category filters and corpus statistics. All 46 domains in the McCreary corpus use this same DAG structure.

3.2 Corpus Schema

All 46 domains share an identical CSV schema:

ConceptID,ConceptLabel,Dependencies,TaxonomyID
1,Function,,FOUND
2,Domain and Range,1,FOUND
3,Function Notation,1,FOUND
4,Composite Function,1|3,FOUND

Dependencies are pipe-delimited integer references to prerequisite ConceptID values. TaxonomyID assigns each concept to a domain-specific category (e.g., FOUND for foundational, CORE for core, ADV for advanced).

3.3 Corpus Provenance

The three retrieval architectures compared in this paper do not start from independent source material. The upstream production pipeline generates the corpus used by each pipeline in the McCreary benchmark.

The course author writes a single-page course description (target audience, prerequisites, learning objectives). The /learning-graph-generator Claude Code skill proposes a learning graph from that description, a subject-matter expert reviews and corrects the graph over two to four hours, and the result is committed as learning-graph.csv. A separate skill, /chapter-content-generator, then consumes that CSV to produce the MkDocs textbook chapter corpus. RAG and GraphRAG index the markdown corpus; CKG reads the learning-graph.csv directly.

🔁 Provenance diagram of inputs consumed by the three retrieval pipelines. figures/corpus-provenance.png Figure available in repository
Figure 3. Provenance of the inputs consumed by the three retrieval pipelines in the McCreary benchmark corpus. CKG consumes learning-graph.csv directly; RAG and GraphRAG consume the markdown chapter corpus, which was itself generated from learning-graph.csv by the /chapter-content-generator skill. The benchmark therefore compares direct access to the authored structure against retrieval from prose that was generated from that same structure.

3.4 Quality Properties

All 45 DAGs are validated for the following structural properties:

  1. Single connected component (no isolated subgraphs).
  2. No self-references (no concept lists itself as a dependency).
  3. Foundational concepts (zero prerequisites) ≥ 2 per domain.
  4. Maximum dependency chain length reported per domain.
📈 Per-domain corpus statistics heatmaps — STEM, Professional, and Foundational subsets. figures/corpus-heatmap-stem.png · corpus-heatmap-professional.png · corpus-heatmap-foundational.png Figures available in repository
Figures 4–6. Per-domain statistics for each subset of the McCreary Intelligent Textbook Corpus, sorted by concept count. Columns: Concepts, Edges, Taxonomy Categories, Foundation Concepts (in-degree zero), and Edge/Concept Ratio. Color intensity is normalized using the full-corpus range for each column. Totals: 12,260 concepts and 19,405 dependency edges.

4. Architecture Specifications

All three systems use Claude Sonnet 4.6 at temperature = 0 for generation, ensuring fair comparison. The systems differ only in how knowledge is stored and retrieved.

4.1 RAG Baseline

Table 3. RAG baseline configuration.
ParameterValue
SourceMkDocs .md chapters per textbook
Chunking512 tokens, 50-token overlap
Embeddingsall-MiniLM-L6-v2 (sentence-transformers, local)
IndexFAISS flat L2
RetrievalTop-5 chunks
GenerationClaude Sonnet 4.6, temperature = 0

4.2 GraphRAG

Table 4. GraphRAG configuration.
ParameterValue
SourceSame MkDocs .md chapters
SystemMicrosoft GraphRAG v1.x, default configuration
SearchLocal mode for T1/T2/T5, global mode for T4
NoteDoes not use learning-graph.csv
GenerationClaude Sonnet 4.6, temperature = 0

4.3 CKG (Compact Knowledge Graph)

Table 5. CKG architecture configuration.
ParameterValue
Sourcelearning-graph.csv
LookupExact label match → concept node retrieval
TraversalBFS for T2 (1-hop), DFS for T3 (full path), filter for T4
SubgraphMatched concept + direct neighbors + edges
GenerationClaude Sonnet 4.6, temperature = 0
NoteZero build cost — CSV-native

Key distinction. GraphRAG re-derives structure from text that was originally generated from the learning graph CSV. CKG uses the graph directly. The efficiency gap is structural, not incidental.

5. Benchmark Design

5.1 Query Taxonomy

We define five query types (T1–T5), each targeting a different aspect of knowledge retrieval capability.

Table 6. Query type taxonomy with examples and ground truth sources.
Type Description Example Ground Truth
T1 Entity lookup "What is Composite Function?" ConceptLabel + TaxonomyID
T2 Direct dependency "What are prerequisites for Composite Function?" Dependencies column
T3 Multi-hop path "What is the chain from Function to Taylor Series?" BFS path in DAG
T4 Category aggregate "List all FOUND concepts" Filter by TaxonomyID
T5 Cross-concept "How does Domain and Range relate to Inverse Function?" Shared neighbors

Note on T1 (entity lookup): T1 queries ask for a concept explanation ("What is X?"). CKG contains no explanatory prose — it can return only the concept's TaxonomyID and dependencies in response. T1 is therefore a RAG-favorable query type deliberately included to test the boundary of CKG's capability and to confirm that the benchmark is not constructed to favor CKG universally.

5.2 Query Generation

Queries are auto-generated from each domain's CSV using generate_queries.py with a fixed random seed of 42.

Per domain: ~175 queries (50 T1 + 50 T2 + 25 T3 + 12 T4 + 38 T5). Total: ~4,375 queries across 25 domains.

  • T1: Random sample of 50 concepts; query is "What is {label}?"
  • T2: Random sample of 50 concepts with ≥1 dependency
  • T3: Random pairs of foundational/terminal concepts with path length 2–5
  • T4: One query per taxonomy category
  • T5: Random sample of 38 directly connected concept pairs

5.3 Ground Truth Validity

Benchmark ground truth is derived deterministically from DAG edges: T2 answers are the direct dependency labels of a concept; T3 answers are the BFS shortest-path node sequence between two concepts; T4 answers are all concept labels sharing a TaxonomyID; T5 answers are the union of BFS path nodes between a randomly selected concept pair. Because derivation is algorithmic, inter-annotator κ does not apply to the ground truth generation process.

Benchmark validity is instead assessed structurally. The T1 negative-control result (CKG F1 = 0.207 on entity lookup) confirms the evaluation is not constructed to favor CKG universally. Additionally, the low GraphRAG T4 score (0.054 vs. CKG 0.964) confirms that the taxonomy advantage is structural — not an artifact of prompting — since GraphRAG and CKG use identical prompts and the same LLM.

5.4 Reproducibility Protocol

  • All systems use Claude Sonnet 4.6 at temperature = 0.
  • Token counts via Anthropic count_tokens() API.
  • 3 runs per query, variance reported.
  • Fixed random seed: 42.
  • Benchmark version locked: v1.0.0.
  • One-command reproduction: python evaluation/harness.py --reproduce-table-1

6. Metrics

We evaluate 16 metrics organized into six categories. Novel metrics introduced in this paper are marked with ★.

6.1 Standard IR Metrics

Token-Level F1 (SQuAD-style)

Used for T1, T2, and T4 queries. Precision, recall, and F1 are computed over token sets:

F1 = (2 · P · R) / (P + R)    P = |pred ∩ truth| / |pred|    R = |pred ∩ truth| / |truth|

Edge-Overlap F1

Used for T3 (path) and T5 (cross-concept) queries:

Edge_F1 = (2 · |Epred ∩ Etruth|) / (|Epred| + |Etruth|)

Exact Match (EM)

Binary — the full answer must match ground truth exactly. Reported alongside F1 as a secondary metric.

6.2 Reasoning Density Score ★ (RDS)

The core compound metric introduced in this paper:

RDS(s, q) = F1(s, q) / tokens_consumed(s, q)

Macro-averaged across all queries: RDSmacro(s) = mean(RDS over all queries). The RDS ratio compares two systems: RDSratio(A, B) = RDSmacro(A) / RDSmacro(B). Higher values indicate more reasoning quality per token spent.

6.3 Hop-Depth F1 Degradation ★

Measures how F1 degrades as reasoning chain length increases:

F1@hop(s, k) = mean(F1 | hop_depth = k)

Reported for k = 1, 2, 3, 4, 5+. Expected finding: RAG degrades steeply at k ≥ 2; CKG remains flat due to explicit edge traversal.

6.4 Tokenomics Metrics

Context Utilization Rate ★ (CUR)

Fraction of retrieved tokens relevant to the answer: CUR = relevant_tokens / total_retrieved_tokens.

Cost Per Correct Answer ★ (CPCA)

Real-world cost using Claude Sonnet 4.6 pricing ($3/M input, $15/M output): CPCA = cost_per_query / F1.

Precision at Token Budget (P@T)

Mean F1 over queries where tokens_consumed ≤ budget T. Reported for T = 500, 1000, 2000, 5000, 10000.

Token Budget Breakeven

Minimum budget where RAG/GraphRAG F1 ≥ CKG F1: breakeven = min T such that F1RAG(T) ≥ F1CKG(500).

Index Build Cost

One-time cost: tokens consumed during indexing + wall-clock time + storage. CKG: zero (CSV already exists).

Update Cost ★

Cost to incorporate one new concept: CKG edits one CSV row with zero re-indexing; RAG re-embeds affected chunks; GraphRAG requires full re-extraction.

6.5 Structural Fidelity Metrics

Relationship Precision ★ (RP)

Of edges returned, what fraction are real DAG edges: RP = |Epred ∩ Etruth| / |Epred|.

Hub Node Recall ★ (HNR)

Recall on high-indegree concepts (top 20% by indegree).

Boundary Completeness ★ (BC)

For T4 queries, fraction of the taxonomy category returned: BC = |retrieved ∩ members| / |members|. CKG achieves BC ≈ 1.0 by construction.

6.6 Robustness Metrics

Paraphrase Stability (PS)

F1 variance across 5 paraphrased versions of each query. CKG should be stable (exact concept match); RAG is embedding-sensitive.

Hallucination Rate (HR)

Fraction of queries returning ≥1 concept not in the corpus. CKG: HR = 0 by construction. GraphRAG's dynamic extraction can hallucinate.

7. Results

All Track 1 results are final. CKG: 44 domains, 7,758 queries. RAG: 40 corpus-complete domains, 7,191 queries. GraphRAG: 15 domains, 2,683 queries (the subset where indexing completed within the evaluation budget). Track 2 (GLP-1/Obesity, pipeline-generated) results are reported in Section 10.

7.1 Macro-Average Performance

Table 7. Macro-average performance. CKG: 44 domains, 7,758 queries. RAG: 40 domains, 7,191 queries. GraphRAG: 15 domains, 2,683 queries.
System Macro F1 Tokens/q RDS Run cost ($)
RAG 0.1231 2,982 0.0000482 76.23
GraphRAG 0.1200 3,450 0.0000452 44.43
CKG 0.4709 269 0.00201 7.81

CKG achieves 3.8× higher macro F1 than RAG and 3.9× higher than GraphRAG while consuming 11× fewer tokens per query than RAG and 13× fewer than GraphRAG. The compound RDS advantage is 42× over RAG and 44× over GraphRAG. CKG's run cost ($7.81) is 90% lower than RAG ($76.23) and 82% lower than GraphRAG ($44.43), across a larger domain set in each case.

📊 RDS comparison and token consumption differential bar charts. figures/fig4_rds_comparison.png Figure available in repository
Figure 7. Left: Reasoning Density Score (RDS = F1 / tokens) for RAG and CKG. CKG's RDS is 45.9× higher. Right: Mean tokens per query — RAG consumes 10.9× more tokens for lower-quality answers.

7.2 F1 by Query Type

📊 F1 by query type (T1–T5) for CKG and RAG. figures/fig3_f1_by_query_type.png Figure available in repository
Figure 8. Token-level F1 by query type for CKG (blue) and RAG (red). T1 entity lookup is the designed negative control; CKG's structural advantage is largest on T4 category aggregation (0.95 vs. 0.29) and T2/T3 dependency/path queries (0.60 vs. 0.08–0.20).
Table 8. Token-level F1 by query type (Track 1).
System T1 entity T2 dep. T3 path T4 aggr. T5 cross
RAG 0.094 0.078 0.201 0.286 0.115
GraphRAG 0.108 0.073 0.208 0.054 0.183
CKG 0.207 0.634 0.660 0.964 0.323

T1 (entity lookup) is the designed negative control: CKG stores graph structure rather than prose definitions, so its T1 F1 of 0.207 is expected.

T4 (category aggregation) shows the sharpest divergence: CKG achieves 0.964 versus RAG's 0.286. Aggregation queries require enumerating all members of a category precisely — a task CKG resolves by reading the taxonomy column directly.

T2 and T3 (dependency resolution and multi-hop path traversal) show CKG at 0.634–0.660 versus RAG at 0.078–0.201, confirming the structural retrieval advantage on prerequisite-chain queries.

T5 (cross-concept relationship) shows CKG at 0.323 versus RAG at 0.115. Performance improved after introducing BFS shortest-path traversal between concept pairs and enriching ground truth.

7.3 Token Efficiency and RDS

Table 9. Token composition and Reasoning Density Score (RDS = F1 / total tokens).
System Mean total tokens Mean retrieved tokens Macro F1 RDS RDS ratio
RAG 2,982 2,392 0.1231 0.0000482 0.024×
GraphRAG 3,450 0.1200 0.0000452 0.022×
CKG 269 44 0.4709 0.00201 1.0×

RAG's retrieved context (2,392 tokens mean) is 54× larger than CKG's retrieved subgraph (44 tokens mean). Despite large contexts, both RAG and GraphRAG answers are less accurate because passage-level retrieval does not preserve the structural relationships that structural queries require.

The 42× RDS advantage directly answers a cost question for practitioners: deploying CKG for structural knowledge queries instead of RAG reduces intelligence delivery cost by approximately 97.6% while improving answer quality by 3.8×.

📊 Token composition by component — RAG vs. CKG. figures/fig7_token_composition.png Figure available in repository
Figure 9. Token composition by component for RAG (left) and CKG (right). RAG's 2,500-token retrieved context dominates the budget; CKG's 44-token subgraph is 57× smaller with higher answer quality.

7.4 F1 by Hop Depth

Table 10. F1 by hop depth (depth of prerequisite chain traversed).
System hop=0 hop=1 hop=2 hop=3 hop=4 hop=5
RAG 0.073 0.066 0.226 0.138 0.166 0.170
CKG 0.374 0.519 0.573 0.671 0.751 0.772

CKG F1 increases continuously with hop depth, from 0.374 at hop=0 to 0.772 at hop=5 — the deepest chains produce the highest accuracy. This is structurally opposite to the typical RAG pattern: RAG retrieval recall falls as multi-hop queries require evidence from multiple documents. CKG traverses edges deterministically, so deeper chains do not degrade performance.

RAG shows irregular behavior across hop depths (0.073 at hop=0, peaking at 0.226 at hop=2, then declining), consistent with retrieval recall variance rather than systematic improvement.

📊 F1 vs. hop depth for T3 multi-hop path queries. figures/fig5_hop_degradation.png Figure available in repository
Figure 10. F1 by hop depth for multi-hop path queries (T3). RAG F1 degrades 68% from hop=0 to hop=4; CKG degrades only 35% and remains substantially higher at every depth. CKG's deterministic BFS traversal is depth-invariant by construction.

7.5 The Structure Premium

📊 CKG RDS vs. DAG edge density across 44 domains — Structure Premium scatter plot. figures/fig8_structure_premium.png Figure available in repository
Figure 11. The Structure Premium hypothesis: CKG Reasoning Density Score (RDS) vs. DAG edge density (edges per concept) across 44 domains. Pearson r = −0.09, indicating the advantage is uniform across DAG richness levels.

8. Discussion

8.1 Where CKG Wins and Why

CKG's advantages are structural:

  • T2/T3 queries: Explicit edges eliminate multi-hop inference errors. RAG must infer transitive dependencies from unstructured text; CKG traverses them directly via BFS/DFS.
  • T4 queries: Taxonomy filtering achieves BC ≈ 1.0 by construction, as the TaxonomyID field provides exact category membership.
  • Hallucination: HR = 0 because CKG only returns concepts present in the source CSV. No generative step can introduce phantom entities.
  • RDS: Near-zero build cost combined with 150–400 tokens per query yields order-of-magnitude efficiency gains.

8.2 Where RAG Is Competitive

RAG remains competitive in specific scenarios:

  • T1 entity lookup on large open-domain corpora where rich context aids natural language generation.
  • Domains without stable taxonomy (rapidly evolving fields).
  • When CKG construction cost exceeds the efficiency savings.

8.3 GraphRAG's Position

GraphRAG occupies a middle ground: better than RAG on multi-hop reasoning (graph structure helps) but worse than CKG (dynamic extraction introduces noise and hallucinated edges). GraphRAG is the most expensive system (high build cost + high query cost). Its best use case is unstructured corpora with no available expert taxonomy.

8.4 The Structure Premium

We tested the hypothesis that the CKG RDS advantage is proportional to the structural richness of the domain's DAG, defined as:

dag_richness(d) = (edges / concepts) × mean_indegree × (1 / orphan_rate)

Across 45 domains, the Pearson correlation between dag_richness and CKG RDS is r = −0.09 (n = 45), and between dag_richness and macro F1 is r = −0.07. Both are negligible. The Structure Premium hypothesis is not supported: CKG's efficiency advantage does not concentrate in domains with denser DAG structure. This is a stronger finding than a positive correlation would have been: the advantage is uniform across DAG richness levels. CKG outperforms RAG and GraphRAG by 3.8–3.9× on F1 and 42× on RDS whether the underlying graph is sparse or dense. The efficiency gain is architectural — a property of pre-structured retrieval itself.

8.5 Limitations

  • Structural query scope: Ground truth for T2, T3, and T4 queries is derived directly from DAG edges. The comparison therefore demonstrates that explicit structure outperforms inferred structure on structural tasks — not that CKG is a general-purpose retrieval system.
  • T1 as boundary test: CKG scores F1 ≈ 0.207 on T1 entity lookup queries because the DAG contains no explanatory prose. For open-ended definitional knowledge retrieval, RAG remains the appropriate architecture.
  • The McCreary corpus is educational — results may not generalize to legal, financial, or medical domains.
  • CKG build cost with automated construction has not been formally measured; the Track 2 pipeline demonstrates feasibility but pipeline engineering cost was not included in the benchmark cost accounting.
  • Ground truth derived from DAG edges may not capture all valid natural language answers.
  • All systems use the same LLM (Claude Sonnet 4.6); results may differ with other models.

8.6 Educational-to-Commercial Transfer

Tracks 1 and 2 together establish that the CKG retrieval advantage generalizes across domain type and construction method. Track 1 (44 hand-curated educational DAGs) and Track 2 (1 pipeline-generated commercial pharmacology DAG) share the same retrieval algorithm, the same metrics, and the same harness, and yield consistent structural-query outcomes (CKG macro F1 = 0.4709 and 0.5298 respectively, against RAG F1 ≈ 0.12–0.15 on both). The practical implication is that any knowledge-intensive field whose entities and relationships are expressible as a directed acyclic graph — pharmaceutical, legal, financial, regulatory, biomedical — is a candidate for CKG deployment.

9. The Economics of Learning Graph Generation

A central limitation of any structure-first retrieval architecture is the up-front cost of constructing the underlying knowledge structure. This section addresses that objection directly. We first provide a formal definition of a learning graph, then present a cost model for generating one via an agentic workflow, and finally extrapolate the trajectory of generation cost over the next 18–24 months.

9.1 Formal Definition of a Learning Graph

Definition 1 — Learning Graph

A learning graph is a 4-tuple G = (C, E, T, τ) where:

  • C = {c1, c2, …, cn} is a finite, non-empty set of concepts, each a named unit of domain knowledge with a unique identifier and a human-readable label.
  • E ⊆ C × C is a set of directed prerequisite edges. An edge (ci, cj) ∈ E asserts that concept ci must be understood before concept cj can be meaningfully taught or applied.
  • T = {t1, t2, …, tk} is a finite set of taxonomy categories that partition the concept space into coarse groupings (e.g., FOUND, CORE, ADV).
  • τ : C → T is a total function assigning each concept to exactly one taxonomy category.

The pair (C, E) must form a directed acyclic graph (DAG): there exists no sequence ci1, ci2, …, cim, ci1 such that every consecutive pair is an edge in E. Acyclicity ensures a valid teaching order exists (topological sort over E).

Three properties follow directly from Definition 1 and are load-bearing for the CKG architecture:

  1. Finite, enumerable context. Because C is finite and the CSV serialization is compact, the entire graph fits in an LLM prompt for any realistic domain (|C| ≤ 1,000).
  2. Deterministic traversal. Prerequisite chains are computed by BFS or DFS over E with no inference required. Query answers for structural questions (T2, T3, T4) are functions, not predictions.
  3. Closed vocabulary. The set of valid concept labels is exactly {label(c) : cC}. A retrieval system that returns only concepts in C cannot hallucinate entities by construction.

9.2 Agentic Learning Graph Generation

The McCreary corpus graphs used in this benchmark were produced with an agentic workflow: the /learning-graph-generator Claude Code Agent Skill, publicly available at github.com/dmccreary/claude-skills.

Input quality scoring

Because the quality of the generated learning graph depends heavily on the completeness of the course description, the skill begins by scoring the supplied description on a 100-point rubric and reporting the score back to the author before generation proceeds. Low-scoring descriptions trigger specific suggestions (e.g., missing Bloom's-taxonomy outcomes, absent prerequisite list, unstated target audience).

Generation pipeline

Given a description that has cleared the scoring rubric, the skill decomposes generation into three stages:

  1. Concept elicitation. Given a course description, the agent proposes a concept set C of a target size (default n = 200), drawing on the model's domain knowledge and any supplied source materials.
  2. Dependency assignment. For each concept cjC, the agent selects the subset of previously-enumerated concepts on which cj depends, populating E incrementally.
  3. Taxonomy assignment and validation. The agent assigns τ(c) for each cC, then runs automated validation: cycle detection, orphan detection, dependency count distribution, and a quality score. Graphs scoring below threshold trigger a correction cycle.

A subject-matter expert (SME) reviews the final graph for domain fidelity and edits edges or labels as needed. In practice, SME review for a 200-concept graph takes 2–4 hours; the agentic generation itself completes in minutes.

9.3 Cost Model

Rather than estimating generation cost from first principles, we measured it directly. Each invocation of the /learning-graph-generator skill is recorded by a Claude Code PostToolUse hook (track-skill-end.sh) that writes a skill-usage.jsonl event. Of 21 recorded invocations, 9 had surviving full session transcripts at measurement time; all 9 measured sessions used Claude Opus 4.6 as the generating model.

Table 11. Measured token consumption and cost for nine complete /learning-graph-generator sessions, sorted by concept count. Costs reflect Claude Opus 4.6 public API pricing at measurement time. (Sessions generated under Claude Max subscription — costs computed from public pay-as-you-go API rates for reproducibility.)
Session Concepts Total Tokens Cached Tokens Cost ($)
Min9.21
Mean (9 sessions)31113.94
Max21.38

Fitted cost model

We model generation cost as an affine function of concept count:

Cost(n) ≈ α + β · n

A least-squares fit to the nine measured sessions yields, for Claude Opus 4.6:

CostOpus 4.6(n) ≈ $8.16 + $0.019 · n

with R2 ≈ 0.24. The low coefficient of determination reflects substantial session-to-session variance driven by the number of validation and correction cycles. The mean measured cost was $13.94 across sessions averaging 311 concepts, with observed values ranging from $9.21 to $21.38.

For cheaper model tiers, applying the per-token price ratio between Opus and Sonnet 4.6 (roughly 5×) yields a projected Sonnet cost of approximately $3 for a 200-concept graph.

9.4 Projected Trajectory

Two compounding trends point toward sharply lower generation cost over the next 18–24 months:

  1. Declining model pricing at constant capability. Across the preceding two years, intelligence-per-dollar at the Anthropic frontier has approximately halved every 9–12 months, as smaller models match the capability of their predecessors.
  2. Skill-level efficiency gains. The current generation workflow holds the full emerging graph in context, producing a super-linear accumulation term in β. Decomposing generation into cache-efficient passes has been observed to reduce token consumption 2–3× at any given model tier.

Combining these trends with the measured Opus 4.6 baseline of roughly $12 per 200-concept graph, we project a plausible 2027 cost of $1–$2 per 200-concept graph when a Haiku-class successor model is used with a cache-efficient rewrite of the generation workflow. At that price point, domain graph generation is effectively free relative to any downstream inference workload.

9.5 Implications for the Build-Cost Objection

The traditional objection to structure-first retrieval — "curating a domain graph is prohibitively expensive" — was empirically true for two decades. It is no longer true. When a 200-concept domain graph can be generated for dollars of compute and a handful of SME-review hours, the decision calculus inverts: the question is no longer whether a domain can afford a learning graph, but whether any high-query-volume domain can afford not to have one, given that per-query retrieval costs fall by roughly an order of magnitude once the graph exists.

This shifts where the economic moat lies. Model compute is commoditizing; the durable value in applying CKG to new domains is the SME review loop — ensuring the generated graph faithfully reflects the expert consensus of the field — and the schema design work required to extend Definition 1 beyond prerequisite-structured domains to domains with richer relation types.

10. Track 2: Pipeline-Generated Domain Validation

TRACK 2 — GLP-1 / OBESITY PHARMACOLOGY

All 44 domains in the primary benchmark (Track 1) originate from the McCreary Intelligent Textbook Corpus: hand-curated educational DAGs where a domain expert manually mapped concept dependencies. A critical open question is whether the CKG architecture's performance advantage depends on that curation quality or whether it transfers to programmatically constructed knowledge graphs derived from external data sources.

Track 2 answers this question by constructing a complete CKG domain from scratch using no pre-existing expert-curated graph, no educational corpus, and no McCreary source material. The domain selected is GLP-1/Obesity pharmacology — a commercially active life sciences domain with rapidly evolving clinical trial data, 8 FDA-approved agents, and a pipeline of 150+ ongoing trials at the time of corpus construction (April 2026).

10.1 Data Source and Pipeline

Track 2 uses ClinicalTrials.gov as the sole external data source, accessed via the NIH/NLM API (v2). The pipeline consists of four stages:

  1. API extraction. Structured query against ClinicalTrials.gov returns 668 semaglutide trials, 224 tirzepatide trials, and 158 pipeline agent trials (retatrutide, cagrisema, orforglipron, mazdutide). Trial metadata includes NCT identifiers, phase, enrollment, endpoints, mechanisms, and completion dates.
  2. Concept extraction. Trial data is parsed to extract pharmacological entities (agents, mechanisms, indications, trial programs, outcomes) and their dependency relationships. Taxonomy labels are assigned from a domain-specific schema: FOUND (foundational mechanism), DRUG (approved agent), TRIAL (completed landmark trial), PATH (pathway/mechanism), COMPL (complication/adverse effect), SPEC (special population), COMBO (combination strategy).
  3. Graph construction. Extracted concepts and dependencies are written to learning-graph.csv in the standard CKG schema. The resulting graph contains 90 concepts and 170 dependency edges covering foundational mechanisms through next-generation pipeline agents and cross-indication indications (cardiovascular, renal, neurological, addiction).
  4. Query generation. The standard benchmark harness (generate_queries.py) runs unchanged on the learning-graph.csv, producing 170 queries in the T1–T5 taxonomy with deterministic ground truth derived from DAG edges.

The full pipeline — from raw API data to benchmark-ready domain — requires no manual annotation and no subject matter expert review beyond the initial taxonomy schema.

Corpus construction for RAG and GraphRAG

To enable a fair three-way comparison, a prose corpus was constructed from the same ClinicalTrials.gov data used to build the CKG. Five structured narrative documents were written covering: (1) market overview and approved agents, (2) landmark clinical trial evidence (STEP, SURMOUNT, SELECT, SUMMIT, FLOW programs), (3) next-generation pipeline intelligence, (4) indication expansion across 15+ disease areas, and (5) investment-relevant signals including oral formulation, muscle preservation, and CNS/addiction frontiers.

10.2 Results

Table 12. Track 2 results: pipeline-generated GLP-1/Obesity domain (170 queries) compared with Track 1 benchmark aggregate (44 hand-curated domains, 7,758 queries).
Track System Macro F1 Tokens/q Ret. Tokens/q RDS n queries
Track 1 CKG 0.4709 269 44 0.00201 7,758
Track 1 RAG 0.1231 2,982 2,392 0.0000482 7,191
Track 1 GraphRAG 0.1200 3,450 0.0000452 2,683
Track 2 CKG 0.5298 346 54 0.00153 170
Track 2 RAG 0.1538 2,828 2,214 0.0000544 170
Track 2 GraphRAG 0.1436 3,450 0.0000416 170

The Track 2 CKG F1 of 0.5298 exceeds the Track 1 CKG macro-average of 0.4709 by 12.5%. This result is notable because the GLP-1 graph was generated by pipeline from raw API data, not curated by a domain expert. The performance advantage is not degraded by automated construction — it is maintained or improved.

Table 13. Track 2 F1 by query type — GLP-1/Obesity domain (170 queries).
System T1 entity T2 dep. T3 path T4 aggr. T5 cross
CKG 0.225 0.677 0.873 0.998 0.425
RAG 0.150 0.076 0.221 0.108 0.226
GraphRAG 0.129 0.051 0.216 0.031 0.258

T4 (category aggregation) reaches CKG F1 = 0.998 — near-perfect enumeration of agents by drug class, indication by anatomy, and trial by program — compared with RAG's 0.108 and GraphRAG's 0.031. This is the sharpest divergence observed in either track.

T3 (multi-hop path) reaches CKG F1 = 0.873, substantially above the Track 1 T3 average of 0.660. The GLP-1 dependency graph encodes mechanistic chains (receptor → signaling pathway → downstream effect → clinical outcome) that BFS traversal resolves with high precision.

10.3 Implications for the CKG Architecture

Finding 1: Retrieval performance depends on graph structure, not curation source

CKG F1 on the pipeline-generated GLP-1 domain (0.530) exceeds the hand-curated educational average (0.471). If expert curation quality were the driver of CKG performance, the pipeline domain would score lower. It does not. This means the architecture generalizes: any domain with stable concept relationships expressible in a DAG benefits from CKG retrieval regardless of how the DAG was built.

Finding 2: The CKG factory is viable for commercial domains

The GLP-1 benchmark demonstrates an end-to-end automated pipeline from public API data to a benchmarked knowledge retrieval system. The pipeline requires no annotation budget, no expert review, and no existing textbook or corpus — only a structured data source and a taxonomy schema. This generalizes the CKG architecture beyond educational settings into any knowledge-intensive commercial domain (pharmaceutical, legal, financial, regulatory) where public or proprietary structured data is available.

Finding 3: The 28× RDS advantage holds on enterprise domains

RDS for CKG on GLP-1 is 0.00153 versus RAG's 0.0000544 — a ratio of 28×. The Track 1 RDS ratio is 42×. The slight reduction on Track 2 reflects the GLP-1 domain's higher prose complexity, but the order-of-magnitude token efficiency advantage is preserved. A life sciences organization deploying CKG for structured pharmacology queries over RAG realizes approximately 97% cost reduction with 3.4× accuracy improvement.

Implications for practitioners

For organizations considering CKG deployment in a commercial domain, the Track 2 evidence suggests a concrete recipe: (i) identify a structured data source whose records encode entities and dependencies (public APIs, regulatory registries, internal knowledge bases, product catalogues); (ii) define a lightweight taxonomy schema for the domain; (iii) run the four-stage pipeline to produce a learning-graph.csv; (iv) query via the standard harness. This recipe is domain-agnostic, requires no annotation budget, and produces a benchmarkable CKG with retrieval performance comparable to hand-curated knowledge graphs.

The Track 2 data and corpus are available in the benchmark repository under benchmark/domains/glp1-obesity/ and corpus/glp1-obesity/.

11. Conclusion

We presented the CKG Benchmark, a two-track comparison of Compact Knowledge Graphs against RAG and GraphRAG across 44 hand-curated educational domains (Track 1, 7,758 queries) and one pipeline-generated commercial pharmacology domain (Track 2, 170 queries). Our key findings are:

  1. CKG achieves 4× higher macro F1 than RAG on structural knowledge queries (0.4709 vs. 0.1231 vs. 0.1200 for GraphRAG), using 11× fewer tokens per query (269 vs. 2,982 vs. 3,450). The compound RDS advantage is 42× on Track 1.
  2. Structural query performance is near-ceiling for aggregation: T4 category aggregates reach F1 = 0.964 for CKG vs. 0.286 for RAG and 0.054 for GraphRAG, because the answer is fully determined by the taxonomy column of the CSV.
  3. CKG F1 increases continuously with hop depth (peaking at hop=5: 0.772) while RAG is irregular and plateaus near 0.17, confirming that deterministic edge traversal does not suffer the multi-hop retrieval recall failures inherent to passage-based systems.
  4. T1 (entity lookup) is a documented negative control: CKG scores 0.207 on explanatory queries, confirming the benchmark is not constructed to favour CKG universally.
  5. RDS and Hop-Depth F1 are practical additions to IR evaluation that jointly measure quality and efficiency — metrics missing from existing benchmarks such as BEIR and RAGAS.
  6. The CKG advantage transfers to a pipeline-generated commercial domain. Track 2 builds a GLP-1/Obesity CKG programmatically from the ClinicalTrials.gov API, with no expert curation. The resulting domain achieves CKG macro F1 = 0.5298, exceeding the Track 1 hand-curated average by 12.5%, and preserves a 28× RDS advantage over RAG.

Limitations

Ground truth for T2–T4 is derived from the same DAG used by CKG, which is a methodological constraint acknowledged in the design (Section 5.3). T1 performance on explanatory queries is weak for all three systems and represents an open problem. Pipeline engineering cost for automated CKG construction (Track 2) was not included in the benchmark cost accounting.

Future Work

  • Extend the automated construction pipeline to additional structured data sources (regulatory registries, legal codes, financial filings, product catalogues) to test the Track 2 transfer result at scale.
  • Extend T5 cross-concept retrieval with deeper BFS path-finding; the current T5 F1 of 0.323 (Track 1) and 0.425 (Track 2) has a clear improvement trajectory.
  • Hybrid architectures combining CKG structural precision with RAG's prose retrieval for T1 and T5 query types.
  • Formal build-cost accounting that includes pipeline engineering, taxonomy-schema design, and source-API cost, enabling end-to-end CPCA comparison across all three architectures.
  • Model-robustness experiments replicating the benchmark across additional LLM families.

Unified contribution

Track 1 validates the intelligent textbook format — hand-curated learning-graph CSVs — as a benchmark-grade substrate for knowledge retrieval at scale across 44 domains. Track 2 validates automated construction of CKG domains from structured external data, showing the retrieval advantage does not depend on manual authorship. Taken together, the two tracks establish Compact Knowledge Graphs as a distinct architecture category — one that is empirically superior to RAG and GraphRAG on structural queries, an order of magnitude cheaper in tokens, and ready for deployment in any domain whose knowledge can be expressed as a directed acyclic graph.

Open Benchmark

The complete benchmark — corpus, queries, evaluation harness, and all results — is released at github.com/Yarmoluk/ckg-benchmark under CC BY 4.0 (data) and MIT (code). A HuggingFace dataset mirror (graphify-md/ckg-benchmark) is forthcoming. We invite the community to add domains, systems, and metrics.