Benchmarking Knowledge Retrieval Architectures Across Educational and Commercial Domains: RAG, GraphRAG, and Compact Knowledge Graphs

Daniel Yarmoluk — Graphify.md · Dan McCreary — Intelligent Textbooks

v0.6.2 Pre-print April 2026 github.com/Yarmoluk/ckg-benchmark

Abstract

Retrieval-augmented generation (RAG) and graph-based retrieval (GraphRAG) are the dominant paradigms for grounding LLM responses in structured knowledge. Both optimize for recall while treating token cost as a secondary concern. We introduce Compact Knowledge Graphs (CKG) — pre-structured DAG representations with explicit concept taxonomy and pipe-delimited dependency encoding — and present a two-track benchmark comparing CKG, RAG, and GraphRAG across educational and commercial domains.

Track 1 evaluates the McCreary Intelligent Textbook Corpus: 44 hand-curated educational domains spanning STEM, professional, and foundational subjects (7,758 queries). CKG achieves macro-average F1 of 0.4709 versus 0.1231 for RAG and 0.1200 for GraphRAG, at 11× fewer tokens per query (269 vs. 2,982 vs. 3,450). CKG F1 increases with hop depth (peaking at hop=3: 0.64) while RAG degrades. The compound Reasoning Density Score (RDS) advantage is 42×. T1 entity lookup (CKG 0.207, RAG 0.094) serves as a designed negative control.

Track 2 tests whether the architecture generalizes beyond hand-curated educational data. We build a complete GLP-1/Obesity pharmacology CKG programmatically from the ClinicalTrials.gov API with no expert annotation (170 queries). The pipeline-generated CKG reaches macro F1 0.5298 — exceeding the Track 1 hand-curated average by 12.5% and preserving a 28× RDS advantage over RAG. Automated construction does not degrade retrieval quality; it matches or improves it. Together the two tracks establish CKG as a distinct architecture whose structural advantage is domain-agnostic and does not depend on manual curation.

Benchmark, dataset, and evaluation harness are released at github.com/Yarmoluk/ckg-benchmark under CC BY 4.0 / MIT.

42×

RDS advantage
CKG over RAG

0.4709

CKG macro F1
(Track 1, 44 domains)

11×

Fewer tokens
per query

0.5298

CKG F1 Track 2
(GLP-1, pipeline-gen.)

7,928

Total benchmark
queries (T1+T2)

1. Introduction

1.1 Motivation

LLM retrieval quality is typically measured by F1 alone, yet token cost is a first-class production constraint — not a research afterthought. Domain-specific knowledge has latent structure that RAG discards through chunking, and GraphRAG re-derives structure from text at significant computational expense. But what if the structure is already known?

Consider a deployed AI tutoring system serving 10,000 student sessions per day, each requiring 5–10 knowledge retrieval queries. At RAG's typical 3,000–5,000 input tokens per query, daily token consumption exceeds 150M–500M tokens. At Claude Sonnet 4.6 pricing ($3 per 1M input tokens), this yields daily inference costs of $450–$1,500 for retrieval alone. A system that achieves equivalent accuracy at 250–400 tokens per query reduces that cost by a factor of 10–15× — the difference between a viable product and an unsustainable one.

This cost gap is not merely a tuning opportunity; it reflects a structural difference in how knowledge is represented. RAG stores knowledge as prose and retrieves it by embedding similarity, recovering structure only indirectly. GraphRAG re-derives structure from text via LLM entity extraction — at significant additional cost. Compact Knowledge Graphs (CKG) use structure that was authored directly, requiring no derivation step and enabling exact retrieval in tens of tokens instead of thousands.

1.2 The Three Paradigms

⚙ Three retrieval architecture workflow diagrams (RAG, GraphRAG, CKG) — build-time and runtime swimlanes. figures/workflow-rag.png · figures/workflow-graphrag.png · figures/workflow-ckg.png Figure available in repository

Figure 1. Three retrieval architectures compared, each shown as a build-time swimlane above a runtime swimlane (chronological, top-to-bottom). RAG chunks text and retrieves top-k by vector similarity (~2,982 tokens/query). GraphRAG extracts an entity graph from text, clusters it into communities, then routes queries local vs. global (~3,450 tokens/query). CKG parses a pre-authored DAG directly and extracts the relevant subgraph per query (~269 tokens/query). The 11× context-size gap is the efficiency differential this paper quantifies.

Table 1. Three knowledge retrieval paradigms compared in this benchmark.
System	Knowledge Repr.	Retrieval	Build Cost
RAG	Unstructured text chunks	Embedding similarity	Embed all chunks
GraphRAG	Dynamically extracted graph	Graph + community search	Full entity extraction
CKG	Pre-structured DAG + taxonomy	Direct concept/edge lookup	Zero (CSV-native)*

* Build cost assumes pre-existing expert DAG. Expert curation cost is not measured; see Section 8.5.

1.3 Falsifiable Claims

We make the following falsifiable claims, each testable against the benchmark results presented in Section 7:

CKG achieves higher F1 on T2 (dependency) and T3 (multi-hop path) queries.
CKG F1 does not degrade with hop depth; RAG F1 degrades significantly.
CKG RDS ratio ≥ 10× vs. RAG across all 46 domains.
GraphRAG hallucinates edges not present in ground truth DAG (HR > 0).
CKG Hallucination Rate = 0 (by construction).
The "Structure Premium" hypothesis: RDS advantage correlates with DAG richness (r > 0.7).
Cross-domain transfer: The CKG structural advantage observed on hand-curated educational domains (Track 1) transfers to a commercial pharmacology domain (Track 2) with F1 equal to or greater than the Track 1 average.
Construction invariance: The CKG F1 advantage does not depend on expert manual curation; a DAG built programmatically from a public API yields comparable or superior retrieval performance.

We explicitly do not claim CKG outperforms RAG on T1 (entity lookup / explanatory) queries, which require prose content absent from the DAG structure. T1 results serve as a negative control validating that the benchmark is not constructed to favor CKG across all query types.

1.4 Contributions

The CKG architecture specification (format, DAG constraints, taxonomy schema) and an accompanying BFS/DFS subgraph-extraction retrieval method.
Five novel evaluation metrics (RDS, CUR, Hop-F1, CPCA, RP).
The McCreary Corpus as the first formal benchmark dataset for structural knowledge retrieval.
An open benchmark: 45 domains (44 educational + 1 commercial) × three systems, ~23,900 evaluated query–system pairs.
Track 2 multi-domain ensemble: the educational McCreary corpus evaluated alongside a commercial life-sciences (GLP-1/Obesity) domain in a single harness, demonstrating educational-to-commercial transfer of the CKG retrieval advantage.
Automated CKG construction pipeline: a four-stage pipeline (API extraction → concept and edge extraction → learning-graph.csv → benchmark queries) that produces a CKG domain from ClinicalTrials.gov with no manual annotation, and achieves macro F1 equal to or greater than the Track 1 hand-curated average.
A GitHub-hosted dataset with one-command reproduction harness.

2. Related Work

2.1 Retrieval-Augmented Generation

Lewis et al. [2020] introduced RAG as a method for grounding language model outputs in retrieved passages, establishing the retrieve-then-read paradigm. Concurrent work on REALM demonstrated that retrieval augmentation could be integrated into pre-training. Dense Passage Retrieval (DPR) established embedding-based retrieval as the dominant approach, and Fusion-in-Decoder showed that conditioning generation on multiple retrieved passages improves multi-hop performance.

More recent work has explored self-reflective retrieval, which trains models to decide when to retrieve and to critique retrieved content. Gao et al. [2024] survey the current landscape, cataloguing trade-offs between retrieval granularity, index size, and generation quality.

A consistent theme across RAG work is that token budget is treated as an engineering constraint rather than an evaluation axis. The BEIR benchmark evaluates zero-shot retrieval across 18 heterogeneous IR tasks, and RAGAS measures faithfulness and relevance of RAG pipelines — but neither measures per-query token consumption. This gap motivates our Reasoning Density Score (Section 6.2).

2.2 Multi-Hop Question Answering

The SQuAD benchmark established token-level F1 as the standard for extractive QA, which we adopt for its well-understood properties. HotpotQA introduced multi-hop reasoning as a distinct challenge: answering a question requires traversing multiple supporting facts. MuSiQue further isolated compositional multi-hop reasoning from shortcut exploitation.

These benchmarks evaluate whether a system can find multi-hop answers in unstructured text. Our T3 query type tests the related but distinct problem of traversing a structured dependency graph — a task where the ground truth is the explicit path, not an extracted span. RAG must infer graph paths from prose; CKG traverses them directly.

2.3 Graph-Based Retrieval

Edge et al. [2024] introduced GraphRAG, which applies community detection over LLM-extracted entity graphs to answer both local (entity-level) and global (corpus-level) queries. LightRAG reduces GraphRAG's computational footprint with a dual-level retrieval strategy. HippoRAG draws on hippocampal memory models to build associative knowledge networks from text.

Graph retrieval applied specifically to structured domain representations is explored in G-Retriever, which encodes text-attributed graphs for QA, and Think-on-Graph, which performs beam search over knowledge graphs to reason step-by-step.

A key distinction for this benchmark: GraphRAG, LightRAG, and HippoRAG all perform dynamic extraction — they derive graph structure from unstructured text at index time. CKG uses pre-structured domain knowledge in which the graph was authored by a domain expert. When expert structure is available, the dynamic extraction step is wasted computation; our benchmark quantifies this waste.

2.4 Knowledge Graphs for LLMs

Pan et al. [2024] provide a comprehensive roadmap for unifying LLMs and knowledge graphs, identifying four paradigms: KG-enhanced LLMs, LLM-augmented KGs, synergized systems, and KG-only inference. Structured KG embeddings such as TransE established the representation-learning approach to relational data, while KGQA work demonstrated that KG embeddings can support complex multi-hop question answering.

CKG differs from these approaches in representation: rather than full ontologies or relational triples, it uses lightweight 4-column DAGs trading expressiveness for construction simplicity and token efficiency.

2.5 Evaluation Gaps in IR

Standard IR metrics (F1, MRR, NDCG) do not account for token cost. RAGAS measures faithfulness and relevance but not efficiency. Token cost has begun receiving attention as LLM inference scales: retrieval systems that return large, redundant contexts inflate cost without improving accuracy. This paper introduces Reasoning Density Score (RDS = F1 / tokens) and validates it on a 44-domain corpus — the first metric to jointly optimize quality and token consumption at multi-domain scale.

2.6 Educational Knowledge Graphs

McCreary [2024] introduced the Intelligent Textbooks methodology, which uses a task-specific data structure we call a learning graph to drive chapter planning, concept sequencing, and content generation. This paper is the first to formalize this structure as a citable multi-domain benchmark.

Learning graph, defined

A learning graph (LG) is a small, domain-scoped DAG whose nodes are atomic teachable concepts and whose edges encode a recommended learning order (prerequisite relationships). A lightweight taxonomy assigns each concept to one of a small number of categories (typically 10–16), used primarily to color nodes. Four characteristics distinguish it: (i) the node set represents a single bounded pedagogical domain; (ii) edges encode one relation type — prerequisite; (iii) the graph is deliberately small (200–550 concepts); and (iv) taxonomy categories are a visualization aid, not a formal ontology.

Contrast with general-purpose knowledge graphs

Table 2. Learning Graph (LG) vs. general-purpose Knowledge Graph (KG).
Dimension	Learning Graph (LG)	General-purpose KG
Primary purpose	Accelerate textbook and course content generation; guide concept sequencing.	Answer open-world queries; power search, chatbots, and question answering.
Scope	Single bounded pedagogical domain.	Open-world or large enterprise domain.
Typical size	200–550 concepts.	10⁵ to 10⁹+ entities.
Node semantics	Atomic teachable concept.	Named entity, class, property, or literal.
Edge semantics	One relation: recommended learning order (prerequisite).	Many relation types: `is-a`, `part-of`, `located-in`, and dozens more.
Schema / ontology	Lightweight taxonomy of 10–16 categories, primarily for visualization.	Formal ontology (RDFS, OWL, SHACL) with class hierarchy, datatype constraints, and inference rules.
Structural constraint	Must be acyclic (valid teaching order must exist).	Cycles are permitted and common.
Quality criterion	Pedagogical fidelity: an expert teacher agrees the ordering is sensible.	Factual accuracy: assertions match ground truth about the world.
Maintenance model	One SME reviews the entire graph.	Many editors, automated extraction pipelines, versioned releases.

Design intent vs. benchmark use

The McCreary learning graphs were not originally designed as retrieval structures. They were designed to accelerate textbook content generation: the DAG dictates chapter order, the taxonomy drives visual navigation, and the concept labels seed prompts for automated chapter drafting. That the same structure also serves as a highly token-efficient retrieval substrate — the finding this paper quantifies — is a secondary consequence of three properties the original design happened to enforce: a closed concept vocabulary, deterministic edge traversal, and a corpus-wide uniform schema.

3. The McCreary Intelligent Textbook Corpus

This section formally defines the corpus for the first time in literature.

3.1 Corpus Description

The McCreary Intelligent Textbook Corpus comprises 45 open-source educational textbooks hosted on GitHub (github.com/dmccreary). Each textbook contains a standardized learning graph encoded as a CSV file representing a directed acyclic graph (DAG) of concepts and their prerequisite dependencies. 44 domains have generated benchmark queries; one domain was excluded during query generation due to schema incompatibility.

The corpus spans three subject categories:

STEM (20 domains): algebra, calculus, pre-calculus, functions, linear algebra, geometry, biology, genetics, bioinformatics, chemistry, ecology, moss, physics, circuits, digital electronics, signal processing, FFT, statistics, quantum computing, computer science.
Professional (14 domains): economics, data science, machine learning, blockchain, conversational AI, automating instructional design, healthcare data modeling, organizational analytics, intro to graphs, IT management, learning Linux, MicroSims, infographics, personal finance.
Foundational (10 domains): systems thinking, theory of knowledge, digital citizenship, ethics, prompt engineering, AI tracking, US geography, ASL, reading (kindergarten), dementia.

Key statistics

Concepts per domain: 25–550 (mean: ~272)
Total concepts: 12,261 across 45 domains
Total dependency edges: 19,626
Taxonomy categories: 1–19 per domain (mean: ~4)
Benchmark queries: 7,728 across 44 domains (~175 per domain)
Raw textbook content: MkDocs Markdown chapters available for 22 domains

📊 Interactive learning graph viewer — calculus domain (380 concepts, 539 edges). figures/calculus-learning-graph.png Figure available in repository

Figure 2. Interactive learning graph viewer for the calculus domain (380 concepts, 539 edges). Each node is a concept, color-coded by taxonomy category. Directed edges represent prerequisite dependencies. The left panel shows category filters and corpus statistics. All 46 domains in the McCreary corpus use this same DAG structure.

3.2 Corpus Schema

All 46 domains share an identical CSV schema:

ConceptID,ConceptLabel,Dependencies,TaxonomyID
1,Function,,FOUND
2,Domain and Range,1,FOUND
3,Function Notation,1,FOUND
4,Composite Function,1|3,FOUND

Dependencies are pipe-delimited integer references to prerequisite ConceptID values. TaxonomyID assigns each concept to a domain-specific category (e.g., FOUND for foundational, CORE for core, ADV for advanced).

3.3 Corpus Provenance

The three retrieval architectures compared in this paper do not start from independent source material. The upstream production pipeline generates the corpus used by each pipeline in the McCreary benchmark.

The course author writes a single-page course description (target audience, prerequisites, learning objectives). The /learning-graph-generator Claude Code skill proposes a learning graph from that description, a subject-matter expert reviews and corrects the graph over two to four hours, and the result is committed as learning-graph.csv. A separate skill, /chapter-content-generator, then consumes that CSV to produce the MkDocs textbook chapter corpus. RAG and GraphRAG index the markdown corpus; CKG reads the learning-graph.csv directly.

🔁 Provenance diagram of inputs consumed by the three retrieval pipelines. figures/corpus-provenance.png Figure available in repository

Figure 3. Provenance of the inputs consumed by the three retrieval pipelines in the McCreary benchmark corpus. CKG consumes learning-graph.csv directly; RAG and GraphRAG consume the markdown chapter corpus, which was itself generated from learning-graph.csv by the /chapter-content-generator skill. The benchmark therefore compares direct access to the authored structure against retrieval from prose that was generated from that same structure.

3.4 Quality Properties

All 45 DAGs are validated for the following structural properties:

Single connected component (no isolated subgraphs).
No self-references (no concept lists itself as a dependency).
Foundational concepts (zero prerequisites) ≥ 2 per domain.
Maximum dependency chain length reported per domain.

📈 Per-domain corpus statistics heatmaps — STEM, Professional, and Foundational subsets. figures/corpus-heatmap-stem.png · corpus-heatmap-professional.png · corpus-heatmap-foundational.png Figures available in repository

Figures 4–6. Per-domain statistics for each subset of the McCreary Intelligent Textbook Corpus, sorted by concept count. Columns: Concepts, Edges, Taxonomy Categories, Foundation Concepts (in-degree zero), and Edge/Concept Ratio. Color intensity is normalized using the full-corpus range for each column. Totals: 12,260 concepts and 19,405 dependency edges.

4. Architecture Specifications

All three systems use Claude Sonnet 4.6 at temperature = 0 for generation, ensuring fair comparison. The systems differ only in how knowledge is stored and retrieved.

4.1 RAG Baseline

Table 3. RAG baseline configuration.
Parameter	Value
Source	MkDocs `.md` chapters per textbook
Chunking	512 tokens, 50-token overlap
Embeddings	`all-MiniLM-L6-v2` (sentence-transformers, local)
Index	FAISS flat L2
Retrieval	Top-5 chunks
Generation	Claude Sonnet 4.6, temperature = 0

4.2 GraphRAG

Table 4. GraphRAG configuration.
Parameter	Value
Source	Same MkDocs `.md` chapters
System	Microsoft GraphRAG v1.x, default configuration
Search	Local mode for T1/T2/T5, global mode for T4
Note	Does not use `learning-graph.csv`
Generation	Claude Sonnet 4.6, temperature = 0

4.3 CKG (Compact Knowledge Graph)

Table 5. CKG architecture configuration.
Parameter	Value
Source	`learning-graph.csv`
Lookup	Exact label match → concept node retrieval
Traversal	BFS for T2 (1-hop), DFS for T3 (full path), filter for T4
Subgraph	Matched concept + direct neighbors + edges
Generation	Claude Sonnet 4.6, temperature = 0
Note	Zero build cost — CSV-native

Key distinction. GraphRAG re-derives structure from text that was originally generated from the learning graph CSV. CKG uses the graph directly. The efficiency gap is structural, not incidental.

5. Benchmark Design

5.1 Query Taxonomy

We define five query types (T1–T5), each targeting a different aspect of knowledge retrieval capability.

Table 6. Query type taxonomy with examples and ground truth sources.
Type	Description	Example	Ground Truth
T1	Entity lookup	"What is Composite Function?"	ConceptLabel + TaxonomyID
T2	Direct dependency	"What are prerequisites for Composite Function?"	Dependencies column
T3	Multi-hop path	"What is the chain from Function to Taylor Series?"	BFS path in DAG
T4	Category aggregate	"List all FOUND concepts"	Filter by TaxonomyID
T5	Cross-concept	"How does Domain and Range relate to Inverse Function?"	Shared neighbors

Note on T1 (entity lookup): T1 queries ask for a concept explanation ("What is X?"). CKG contains no explanatory prose — it can return only the concept's TaxonomyID and dependencies in response. T1 is therefore a RAG-favorable query type deliberately included to test the boundary of CKG's capability and to confirm that the benchmark is not constructed to favor CKG universally.

5.2 Query Generation

Queries are auto-generated from each domain's CSV using generate_queries.py with a fixed random seed of 42.

Per domain: ~175 queries (50 T1 + 50 T2 + 25 T3 + 12 T4 + 38 T5). Total: ~4,375 queries across 25 domains.

T1: Random sample of 50 concepts; query is "What is {label}?"
T2: Random sample of 50 concepts with ≥1 dependency
T3: Random pairs of foundational/terminal concepts with path length 2–5
T4: One query per taxonomy category
T5: Random sample of 38 directly connected concept pairs

5.3 Ground Truth Validity

Benchmark ground truth is derived deterministically from DAG edges: T2 answers are the direct dependency labels of a concept; T3 answers are the BFS shortest-path node sequence between two concepts; T4 answers are all concept labels sharing a TaxonomyID; T5 answers are the union of BFS path nodes between a randomly selected concept pair. Because derivation is algorithmic, inter-annotator κ does not apply to the ground truth generation process.

Benchmark validity is instead assessed structurally. The T1 negative-control result (CKG F1 = 0.207 on entity lookup) confirms the evaluation is not constructed to favor CKG universally. Additionally, the low GraphRAG T4 score (0.054 vs. CKG 0.964) confirms that the taxonomy advantage is structural — not an artifact of prompting — since GraphRAG and CKG use identical prompts and the same LLM.

5.4 Reproducibility Protocol

All systems use Claude Sonnet 4.6 at temperature = 0.
Token counts via Anthropic count_tokens() API.
3 runs per query, variance reported.
Fixed random seed: 42.
Benchmark version locked: v1.0.0.
One-command reproduction: python evaluation/harness.py --reproduce-table-1

6. Metrics

We evaluate 16 metrics organized into six categories. Novel metrics introduced in this paper are marked with ★.

6.1 Standard IR Metrics

Token-Level F1 (SQuAD-style)

Used for T1, T2, and T4 queries. Precision, recall, and F1 are computed over token sets:

Edge-Overlap F1

Used for T3 (path) and T5 (cross-concept) queries:

Edge_F1 = (2 · |E_pred ∩ E_truth|) / (|E_pred| + |E_truth|)

Exact Match (EM)

Binary — the full answer must match ground truth exactly. Reported alongside F1 as a secondary metric.

6.2 Reasoning Density Score ★ (RDS)

The core compound metric introduced in this paper:

RDS(s, q) = F1(s, q) / tokens_consumed(s, q)

Macro-averaged across all queries: RDS_macro(s) = mean(RDS over all queries). The RDS ratio compares two systems: RDS_ratio(A, B) = RDS_macro(A) / RDS_macro(B). Higher values indicate more reasoning quality per token spent.

6.3 Hop-Depth F1 Degradation ★

Measures how F1 degrades as reasoning chain length increases:

F1@hop(s, k) = mean(F1 | hop_depth = k)

Reported for k = 1, 2, 3, 4, 5+. Expected finding: RAG degrades steeply at k ≥ 2; CKG remains flat due to explicit edge traversal.

6.4 Tokenomics Metrics

Context Utilization Rate ★ (CUR)

Fraction of retrieved tokens relevant to the answer: CUR = relevant_tokens / total_retrieved_tokens.

Cost Per Correct Answer ★ (CPCA)

Real-world cost using Claude Sonnet 4.6 pricing ($3/M input, $15/M output): CPCA = cost_per_query / F1.

Precision at Token Budget (P@T)

Mean F1 over queries where tokens_consumed ≤ budget T. Reported for T = 500, 1000, 2000, 5000, 10000.

Token Budget Breakeven

Minimum budget where RAG/GraphRAG F1 ≥ CKG F1: breakeven = min T such that F1_RAG(T) ≥ F1_CKG(500).

Index Build Cost

One-time cost: tokens consumed during indexing + wall-clock time + storage. CKG: zero (CSV already exists).

Update Cost ★

Cost to incorporate one new concept: CKG edits one CSV row with zero re-indexing; RAG re-embeds affected chunks; GraphRAG requires full re-extraction.

6.5 Structural Fidelity Metrics

Relationship Precision ★ (RP)

Of edges returned, what fraction are real DAG edges: RP = |E_pred ∩ E_truth| / |E_pred|.

Hub Node Recall ★ (HNR)

Recall on high-indegree concepts (top 20% by indegree).

Boundary Completeness ★ (BC)

For T4 queries, fraction of the taxonomy category returned: BC = |retrieved ∩ members| / |members|. CKG achieves BC ≈ 1.0 by construction.

6.6 Robustness Metrics

Paraphrase Stability (PS)

F1 variance across 5 paraphrased versions of each query. CKG should be stable (exact concept match); RAG is embedding-sensitive.

Hallucination Rate (HR)

Fraction of queries returning ≥1 concept not in the corpus. CKG: HR = 0 by construction. GraphRAG's dynamic extraction can hallucinate.

7. Results

All Track 1 results are final. CKG: 44 domains, 7,758 queries. RAG: 40 corpus-complete domains, 7,191 queries. GraphRAG: 15 domains, 2,683 queries (the subset where indexing completed within the evaluation budget). Track 2 (GLP-1/Obesity, pipeline-generated) results are reported in Section 10.

7.1 Macro-Average Performance

Table 7. Macro-average performance. CKG: 44 domains, 7,758 queries. RAG: 40 domains, 7,191 queries. GraphRAG: 15 domains, 2,683 queries.
System	Macro F1	Tokens/q	RDS	Run cost ($)
RAG	0.1231	2,982	0.0000482	76.23
GraphRAG	0.1200	3,450	0.0000452	44.43
CKG	0.4709	269	0.00201	7.81

CKG achieves 3.8× higher macro F1 than RAG and 3.9× higher than GraphRAG while consuming 11× fewer tokens per query than RAG and 13× fewer than GraphRAG. The compound RDS advantage is 42× over RAG and 44× over GraphRAG. CKG's run cost ($7.81) is 90% lower than RAG ($76.23) and 82% lower than GraphRAG ($44.43), across a larger domain set in each case.

📊 RDS comparison and token consumption differential bar charts. figures/fig4_rds_comparison.png Figure available in repository

Figure 7. Left: Reasoning Density Score (RDS = F1 / tokens) for RAG and CKG. CKG's RDS is 45.9× higher. Right: Mean tokens per query — RAG consumes 10.9× more tokens for lower-quality answers.

7.2 F1 by Query Type

📊 F1 by query type (T1–T5) for CKG and RAG. figures/fig3_f1_by_query_type.png Figure available in repository

Figure 8. Token-level F1 by query type for CKG (blue) and RAG (red). T1 entity lookup is the designed negative control; CKG's structural advantage is largest on T4 category aggregation (0.95 vs. 0.29) and T2/T3 dependency/path queries (0.60 vs. 0.08–0.20).

Table 8. Token-level F1 by query type (Track 1).
System	T1 entity	T2 dep.	T3 path	T4 aggr.	T5 cross
RAG	0.094	0.078	0.201	0.286	0.115
GraphRAG	0.108	0.073	0.208	0.054	0.183
CKG	0.207	0.634	0.660	0.964	0.323

T1 (entity lookup) is the designed negative control: CKG stores graph structure rather than prose definitions, so its T1 F1 of 0.207 is expected.

T4 (category aggregation) shows the sharpest divergence: CKG achieves 0.964 versus RAG's 0.286. Aggregation queries require enumerating all members of a category precisely — a task CKG resolves by reading the taxonomy column directly.

T2 and T3 (dependency resolution and multi-hop path traversal) show CKG at 0.634–0.660 versus RAG at 0.078–0.201, confirming the structural retrieval advantage on prerequisite-chain queries.

T5 (cross-concept relationship) shows CKG at 0.323 versus RAG at 0.115. Performance improved after introducing BFS shortest-path traversal between concept pairs and enriching ground truth.

7.3 Token Efficiency and RDS

Table 9. Token composition and Reasoning Density Score (RDS = F1 / total tokens).
System	Mean total tokens	Mean retrieved tokens	Macro F1	RDS	RDS ratio
RAG	2,982	2,392	0.1231	0.0000482	0.024×
GraphRAG	3,450	—	0.1200	0.0000452	0.022×
CKG	269	44	0.4709	0.00201	1.0×

RAG's retrieved context (2,392 tokens mean) is 54× larger than CKG's retrieved subgraph (44 tokens mean). Despite large contexts, both RAG and GraphRAG answers are less accurate because passage-level retrieval does not preserve the structural relationships that structural queries require.

The 42× RDS advantage directly answers a cost question for practitioners: deploying CKG for structural knowledge queries instead of RAG reduces intelligence delivery cost by approximately 97.6% while improving answer quality by 3.8×.

📊 Token composition by component — RAG vs. CKG. figures/fig7_token_composition.png Figure available in repository

Figure 9. Token composition by component for RAG (left) and CKG (right). RAG's 2,500-token retrieved context dominates the budget; CKG's 44-token subgraph is 57× smaller with higher answer quality.

7.4 F1 by Hop Depth

Table 10. F1 by hop depth (depth of prerequisite chain traversed).
System	hop=0	hop=1	hop=2	hop=3	hop=4	hop=5
RAG	0.073	0.066	0.226	0.138	0.166	0.170
CKG	0.374	0.519	0.573	0.671	0.751	0.772

CKG F1 increases continuously with hop depth, from 0.374 at hop=0 to 0.772 at hop=5 — the deepest chains produce the highest accuracy. This is structurally opposite to the typical RAG pattern: RAG retrieval recall falls as multi-hop queries require evidence from multiple documents. CKG traverses edges deterministically, so deeper chains do not degrade performance.

RAG shows irregular behavior across hop depths (0.073 at hop=0, peaking at 0.226 at hop=2, then declining), consistent with retrieval recall variance rather than systematic improvement.

📊 F1 vs. hop depth for T3 multi-hop path queries. figures/fig5_hop_degradation.png Figure available in repository

Figure 10. F1 by hop depth for multi-hop path queries (T3). RAG F1 degrades 68% from hop=0 to hop=4; CKG degrades only 35% and remains substantially higher at every depth. CKG's deterministic BFS traversal is depth-invariant by construction.

7.5 The Structure Premium

📊 CKG RDS vs. DAG edge density across 44 domains — Structure Premium scatter plot. figures/fig8_structure_premium.png Figure available in repository

Figure 11. The Structure Premium hypothesis: CKG Reasoning Density Score (RDS) vs. DAG edge density (edges per concept) across 44 domains. Pearson r = −0.09, indicating the advantage is uniform across DAG richness levels.

8. Discussion

8.1 Where CKG Wins and Why

CKG's advantages are structural:

T2/T3 queries: Explicit edges eliminate multi-hop inference errors. RAG must infer transitive dependencies from unstructured text; CKG traverses them directly via BFS/DFS.
T4 queries: Taxonomy filtering achieves BC ≈ 1.0 by construction, as the TaxonomyID field provides exact category membership.
Hallucination: HR = 0 because CKG only returns concepts present in the source CSV. No generative step can introduce phantom entities.
RDS: Near-zero build cost combined with 150–400 tokens per query yields order-of-magnitude efficiency gains.

8.2 Where RAG Is Competitive

RAG remains competitive in specific scenarios:

T1 entity lookup on large open-domain corpora where rich context aids natural language generation.
Domains without stable taxonomy (rapidly evolving fields).
When CKG construction cost exceeds the efficiency savings.

8.3 GraphRAG's Position

GraphRAG occupies a middle ground: better than RAG on multi-hop reasoning (graph structure helps) but worse than CKG (dynamic extraction introduces noise and hallucinated edges). GraphRAG is the most expensive system (high build cost + high query cost). Its best use case is unstructured corpora with no available expert taxonomy.

8.4 The Structure Premium

We tested the hypothesis that the CKG RDS advantage is proportional to the structural richness of the domain's DAG, defined as:

dag_richness(d) = (edges / concepts) × mean_indegree × (1 / orphan_rate)

Across 45 domains, the Pearson correlation between dag_richness and CKG RDS is r = −0.09 (n = 45), and between dag_richness and macro F1 is r = −0.07. Both are negligible. The Structure Premium hypothesis is not supported: CKG's efficiency advantage does not concentrate in domains with denser DAG structure. This is a stronger finding than a positive correlation would have been: the advantage is uniform across DAG richness levels. CKG outperforms RAG and GraphRAG by 3.8–3.9× on F1 and 42× on RDS whether the underlying graph is sparse or dense. The efficiency gain is architectural — a property of pre-structured retrieval itself.

8.5 Limitations

Structural query scope: Ground truth for T2, T3, and T4 queries is derived directly from DAG edges. The comparison therefore demonstrates that explicit structure outperforms inferred structure on structural tasks — not that CKG is a general-purpose retrieval system.
T1 as boundary test: CKG scores F1 ≈ 0.207 on T1 entity lookup queries because the DAG contains no explanatory prose. For open-ended definitional knowledge retrieval, RAG remains the appropriate architecture.
The McCreary corpus is educational — results may not generalize to legal, financial, or medical domains.
CKG build cost with automated construction has not been formally measured; the Track 2 pipeline demonstrates feasibility but pipeline engineering cost was not included in the benchmark cost accounting.
Ground truth derived from DAG edges may not capture all valid natural language answers.
All systems use the same LLM (Claude Sonnet 4.6); results may differ with other models.

8.6 Educational-to-Commercial Transfer

Tracks 1 and 2 together establish that the CKG retrieval advantage generalizes across domain type and construction method. Track 1 (44 hand-curated educational DAGs) and Track 2 (1 pipeline-generated commercial pharmacology DAG) share the same retrieval algorithm, the same metrics, and the same harness, and yield consistent structural-query outcomes (CKG macro F1 = 0.4709 and 0.5298 respectively, against RAG F1 ≈ 0.12–0.15 on both). The practical implication is that any knowledge-intensive field whose entities and relationships are expressible as a directed acyclic graph — pharmaceutical, legal, financial, regulatory, biomedical — is a candidate for CKG deployment.

9. The Economics of Learning Graph Generation

A central limitation of any structure-first retrieval architecture is the up-front cost of constructing the underlying knowledge structure. This section addresses that objection directly. We first provide a formal definition of a learning graph, then present a cost model for generating one via an agentic workflow, and finally extrapolate the trajectory of generation cost over the next 18–24 months.

9.1 Formal Definition of a Learning Graph

Definition 1 — Learning Graph

A learning graph is a 4-tuple G = (C, E, T, τ) where:

C = {c₁, c₂, …, c_n} is a finite, non-empty set of concepts, each a named unit of domain knowledge with a unique identifier and a human-readable label.
E ⊆ C × C is a set of directed prerequisite edges. An edge (c_i, c_j) ∈ E asserts that concept c_i must be understood before concept c_j can be meaningfully taught or applied.
T = {t₁, t₂, …, t_k} is a finite set of taxonomy categories that partition the concept space into coarse groupings (e.g., FOUND, CORE, ADV).
τ : C → T is a total function assigning each concept to exactly one taxonomy category.

The pair (C, E) must form a directed acyclic graph (DAG): there exists no sequence c_i₁, c_i₂, …, c_{i_m}, c_i₁ such that every consecutive pair is an edge in E. Acyclicity ensures a valid teaching order exists (topological sort over E).

Three properties follow directly from Definition 1 and are load-bearing for the CKG architecture:

Finite, enumerable context. Because C is finite and the CSV serialization is compact, the entire graph fits in an LLM prompt for any realistic domain (|C| ≤ 1,000).
Deterministic traversal. Prerequisite chains are computed by BFS or DFS over E with no inference required. Query answers for structural questions (T2, T3, T4) are functions, not predictions.
Closed vocabulary. The set of valid concept labels is exactly {label(c) : c ∈ C}. A retrieval system that returns only concepts in C cannot hallucinate entities by construction.

9.2 Agentic Learning Graph Generation

The McCreary corpus graphs used in this benchmark were produced with an agentic workflow: the /learning-graph-generator Claude Code Agent Skill, publicly available at github.com/dmccreary/claude-skills.

Input quality scoring

Because the quality of the generated learning graph depends heavily on the completeness of the course description, the skill begins by scoring the supplied description on a 100-point rubric and reporting the score back to the author before generation proceeds. Low-scoring descriptions trigger specific suggestions (e.g., missing Bloom's-taxonomy outcomes, absent prerequisite list, unstated target audience).

Generation pipeline

Given a description that has cleared the scoring rubric, the skill decomposes generation into three stages:

Concept elicitation. Given a course description, the agent proposes a concept set C of a target size (default n = 200), drawing on the model's domain knowledge and any supplied source materials.
Dependency assignment. For each concept c_j ∈ C, the agent selects the subset of previously-enumerated concepts on which c_j depends, populating E incrementally.
Taxonomy assignment and validation. The agent assigns τ(c) for each c ∈ C, then runs automated validation: cycle detection, orphan detection, dependency count distribution, and a quality score. Graphs scoring below threshold trigger a correction cycle.

A subject-matter expert (SME) reviews the final graph for domain fidelity and edits edges or labels as needed. In practice, SME review for a 200-concept graph takes 2–4 hours; the agentic generation itself completes in minutes.

9.3 Cost Model

Rather than estimating generation cost from first principles, we measured it directly. Each invocation of the /learning-graph-generator skill is recorded by a Claude Code PostToolUse hook (track-skill-end.sh) that writes a skill-usage.jsonl event. Of 21 recorded invocations, 9 had surviving full session transcripts at measurement time; all 9 measured sessions used Claude Opus 4.6 as the generating model.

Table 11. Measured token consumption and cost for nine complete /learning-graph-generator sessions, sorted by concept count. Costs reflect Claude Opus 4.6 public API pricing at measurement time. (Sessions generated under Claude Max subscription — costs computed from public pay-as-you-go API rates for reproducibility.)
Session	Concepts	Total Tokens	Cached Tokens	Cost ($)
Min	—	—	—	9.21
Mean (9 sessions)	311	—	—	13.94
Max	—	—	—	21.38

Fitted cost model

We model generation cost as an affine function of concept count:

Cost(n) ≈ α + β · n

A least-squares fit to the nine measured sessions yields, for Claude Opus 4.6:

Cost_{Opus 4.6}(n) ≈ $8.16 + $0.019 · n

with R² ≈ 0.24. The low coefficient of determination reflects substantial session-to-session variance driven by the number of validation and correction cycles. The mean measured cost was $13.94 across sessions averaging 311 concepts, with observed values ranging from $9.21 to $21.38.

For cheaper model tiers, applying the per-token price ratio between Opus and Sonnet 4.6 (roughly 5×) yields a projected Sonnet cost of approximately $3 for a 200-concept graph.

9.4 Projected Trajectory

Two compounding trends point toward sharply lower generation cost over the next 18–24 months:

Declining model pricing at constant capability. Across the preceding two years, intelligence-per-dollar at the Anthropic frontier has approximately halved every 9–12 months, as smaller models match the capability of their predecessors.
Skill-level efficiency gains. The current generation workflow holds the full emerging graph in context, producing a super-linear accumulation term in β. Decomposing generation into cache-efficient passes has been observed to reduce token consumption 2–3× at any given model tier.

Combining these trends with the measured Opus 4.6 baseline of roughly $12 per 200-concept graph, we project a plausible 2027 cost of $1–$2 per 200-concept graph when a Haiku-class successor model is used with a cache-efficient rewrite of the generation workflow. At that price point, domain graph generation is effectively free relative to any downstream inference workload.

9.5 Implications for the Build-Cost Objection

The traditional objection to structure-first retrieval — "curating a domain graph is prohibitively expensive" — was empirically true for two decades. It is no longer true. When a 200-concept domain graph can be generated for dollars of compute and a handful of SME-review hours, the decision calculus inverts: the question is no longer whether a domain can afford a learning graph, but whether any high-query-volume domain can afford not to have one, given that per-query retrieval costs fall by roughly an order of magnitude once the graph exists.

This shifts where the economic moat lies. Model compute is commoditizing; the durable value in applying CKG to new domains is the SME review loop — ensuring the generated graph faithfully reflects the expert consensus of the field — and the schema design work required to extend Definition 1 beyond prerequisite-structured domains to domains with richer relation types.

10. Track 2: Pipeline-Generated Domain Validation

TRACK 2 — GLP-1 / OBESITY PHARMACOLOGY

All 44 domains in the primary benchmark (Track 1) originate from the McCreary Intelligent Textbook Corpus: hand-curated educational DAGs where a domain expert manually mapped concept dependencies. A critical open question is whether the CKG architecture's performance advantage depends on that curation quality or whether it transfers to programmatically constructed knowledge graphs derived from external data sources.

Track 2 answers this question by constructing a complete CKG domain from scratch using no pre-existing expert-curated graph, no educational corpus, and no McCreary source material. The domain selected is GLP-1/Obesity pharmacology — a commercially active life sciences domain with rapidly evolving clinical trial data, 8 FDA-approved agents, and a pipeline of 150+ ongoing trials at the time of corpus construction (April 2026).

10.1 Data Source and Pipeline

Track 2 uses ClinicalTrials.gov as the sole external data source, accessed via the NIH/NLM API (v2). The pipeline consists of four stages:

API extraction. Structured query against ClinicalTrials.gov returns 668 semaglutide trials, 224 tirzepatide trials, and 158 pipeline agent trials (retatrutide, cagrisema, orforglipron, mazdutide). Trial metadata includes NCT identifiers, phase, enrollment, endpoints, mechanisms, and completion dates.
Concept extraction. Trial data is parsed to extract pharmacological entities (agents, mechanisms, indications, trial programs, outcomes) and their dependency relationships. Taxonomy labels are assigned from a domain-specific schema: FOUND (foundational mechanism), DRUG (approved agent), TRIAL (completed landmark trial), PATH (pathway/mechanism), COMPL (complication/adverse effect), SPEC (special population), COMBO (combination strategy).
Graph construction. Extracted concepts and dependencies are written to learning-graph.csv in the standard CKG schema. The resulting graph contains 90 concepts and 170 dependency edges covering foundational mechanisms through next-generation pipeline agents and cross-indication indications (cardiovascular, renal, neurological, addiction).
Query generation. The standard benchmark harness (generate_queries.py) runs unchanged on the learning-graph.csv, producing 170 queries in the T1–T5 taxonomy with deterministic ground truth derived from DAG edges.

The full pipeline — from raw API data to benchmark-ready domain — requires no manual annotation and no subject matter expert review beyond the initial taxonomy schema.

Corpus construction for RAG and GraphRAG

To enable a fair three-way comparison, a prose corpus was constructed from the same ClinicalTrials.gov data used to build the CKG. Five structured narrative documents were written covering: (1) market overview and approved agents, (2) landmark clinical trial evidence (STEP, SURMOUNT, SELECT, SUMMIT, FLOW programs), (3) next-generation pipeline intelligence, (4) indication expansion across 15+ disease areas, and (5) investment-relevant signals including oral formulation, muscle preservation, and CNS/addiction frontiers.

10.2 Results

Table 12. Track 2 results: pipeline-generated GLP-1/Obesity domain (170 queries) compared with Track 1 benchmark aggregate (44 hand-curated domains, 7,758 queries).
Track	System	Macro F1	Tokens/q	Ret. Tokens/q	RDS	n queries
Track 1	CKG	0.4709	269	44	0.00201	7,758
Track 1	RAG	0.1231	2,982	2,392	0.0000482	7,191
Track 1	GraphRAG	0.1200	3,450	—	0.0000452	2,683
Track 2	CKG	0.5298	346	54	0.00153	170
Track 2	RAG	0.1538	2,828	2,214	0.0000544	170
Track 2	GraphRAG	0.1436	3,450	—	0.0000416	170

The Track 2 CKG F1 of 0.5298 exceeds the Track 1 CKG macro-average of 0.4709 by 12.5%. This result is notable because the GLP-1 graph was generated by pipeline from raw API data, not curated by a domain expert. The performance advantage is not degraded by automated construction — it is maintained or improved.

Table 13. Track 2 F1 by query type — GLP-1/Obesity domain (170 queries).
System	T1 entity	T2 dep.	T3 path	T4 aggr.	T5 cross
CKG	0.225	0.677	0.873	0.998	0.425
RAG	0.150	0.076	0.221	0.108	0.226
GraphRAG	0.129	0.051	0.216	0.031	0.258

T4 (category aggregation) reaches CKG F1 = 0.998 — near-perfect enumeration of agents by drug class, indication by anatomy, and trial by program — compared with RAG's 0.108 and GraphRAG's 0.031. This is the sharpest divergence observed in either track.

T3 (multi-hop path) reaches CKG F1 = 0.873, substantially above the Track 1 T3 average of 0.660. The GLP-1 dependency graph encodes mechanistic chains (receptor → signaling pathway → downstream effect → clinical outcome) that BFS traversal resolves with high precision.

10.3 Implications for the CKG Architecture

Finding 1: Retrieval performance depends on graph structure, not curation source

CKG F1 on the pipeline-generated GLP-1 domain (0.530) exceeds the hand-curated educational average (0.471). If expert curation quality were the driver of CKG performance, the pipeline domain would score lower. It does not. This means the architecture generalizes: any domain with stable concept relationships expressible in a DAG benefits from CKG retrieval regardless of how the DAG was built.

Finding 2: The CKG factory is viable for commercial domains

The GLP-1 benchmark demonstrates an end-to-end automated pipeline from public API data to a benchmarked knowledge retrieval system. The pipeline requires no annotation budget, no expert review, and no existing textbook or corpus — only a structured data source and a taxonomy schema. This generalizes the CKG architecture beyond educational settings into any knowledge-intensive commercial domain (pharmaceutical, legal, financial, regulatory) where public or proprietary structured data is available.

Finding 3: The 28× RDS advantage holds on enterprise domains

RDS for CKG on GLP-1 is 0.00153 versus RAG's 0.0000544 — a ratio of 28×. The Track 1 RDS ratio is 42×. The slight reduction on Track 2 reflects the GLP-1 domain's higher prose complexity, but the order-of-magnitude token efficiency advantage is preserved. A life sciences organization deploying CKG for structured pharmacology queries over RAG realizes approximately 97% cost reduction with 3.4× accuracy improvement.

Implications for practitioners

For organizations considering CKG deployment in a commercial domain, the Track 2 evidence suggests a concrete recipe: (i) identify a structured data source whose records encode entities and dependencies (public APIs, regulatory registries, internal knowledge bases, product catalogues); (ii) define a lightweight taxonomy schema for the domain; (iii) run the four-stage pipeline to produce a learning-graph.csv; (iv) query via the standard harness. This recipe is domain-agnostic, requires no annotation budget, and produces a benchmarkable CKG with retrieval performance comparable to hand-curated knowledge graphs.

The Track 2 data and corpus are available in the benchmark repository under benchmark/domains/glp1-obesity/ and corpus/glp1-obesity/.

11. Conclusion

We presented the CKG Benchmark, a two-track comparison of Compact Knowledge Graphs against RAG and GraphRAG across 44 hand-curated educational domains (Track 1, 7,758 queries) and one pipeline-generated commercial pharmacology domain (Track 2, 170 queries). Our key findings are:

CKG achieves 4× higher macro F1 than RAG on structural knowledge queries (0.4709 vs. 0.1231 vs. 0.1200 for GraphRAG), using 11× fewer tokens per query (269 vs. 2,982 vs. 3,450). The compound RDS advantage is 42× on Track 1.
Structural query performance is near-ceiling for aggregation: T4 category aggregates reach F1 = 0.964 for CKG vs. 0.286 for RAG and 0.054 for GraphRAG, because the answer is fully determined by the taxonomy column of the CSV.
CKG F1 increases continuously with hop depth (peaking at hop=5: 0.772) while RAG is irregular and plateaus near 0.17, confirming that deterministic edge traversal does not suffer the multi-hop retrieval recall failures inherent to passage-based systems.
T1 (entity lookup) is a documented negative control: CKG scores 0.207 on explanatory queries, confirming the benchmark is not constructed to favour CKG universally.
RDS and Hop-Depth F1 are practical additions to IR evaluation that jointly measure quality and efficiency — metrics missing from existing benchmarks such as BEIR and RAGAS.
The CKG advantage transfers to a pipeline-generated commercial domain. Track 2 builds a GLP-1/Obesity CKG programmatically from the ClinicalTrials.gov API, with no expert curation. The resulting domain achieves CKG macro F1 = 0.5298, exceeding the Track 1 hand-curated average by 12.5%, and preserves a 28× RDS advantage over RAG.

Limitations

Ground truth for T2–T4 is derived from the same DAG used by CKG, which is a methodological constraint acknowledged in the design (Section 5.3). T1 performance on explanatory queries is weak for all three systems and represents an open problem. Pipeline engineering cost for automated CKG construction (Track 2) was not included in the benchmark cost accounting.

Future Work

Extend the automated construction pipeline to additional structured data sources (regulatory registries, legal codes, financial filings, product catalogues) to test the Track 2 transfer result at scale.
Extend T5 cross-concept retrieval with deeper BFS path-finding; the current T5 F1 of 0.323 (Track 1) and 0.425 (Track 2) has a clear improvement trajectory.
Hybrid architectures combining CKG structural precision with RAG's prose retrieval for T1 and T5 query types.
Formal build-cost accounting that includes pipeline engineering, taxonomy-schema design, and source-API cost, enabling end-to-end CPCA comparison across all three architectures.
Model-robustness experiments replicating the benchmark across additional LLM families.

Unified contribution

Track 1 validates the intelligent textbook format — hand-curated learning-graph CSVs — as a benchmark-grade substrate for knowledge retrieval at scale across 44 domains. Track 2 validates automated construction of CKG domains from structured external data, showing the retrieval advantage does not depend on manual authorship. Taken together, the two tracks establish Compact Knowledge Graphs as a distinct architecture category — one that is empirically superior to RAG and GraphRAG on structural queries, an order of magnitude cheaper in tokens, and ready for deployment in any domain whose knowledge can be expressed as a directed acyclic graph.

Open Benchmark

The complete benchmark — corpus, queries, evaluation harness, and all results — is released at github.com/Yarmoluk/ckg-benchmark under CC BY 4.0 (data) and MIT (code). A HuggingFace dataset mirror (graphify-md/ckg-benchmark) is forthcoming. We invite the community to add domains, systems, and metrics.