Skip to content

The Enrichment Journey: From Empty Graph to Multi-Perspective Understanding

How the knowledge graph learned from 280 commits and 31 pull requests to reconstruct an architectural evolution


The Experiment

On October 12, 2025, we ran an experiment: Could the knowledge graph system analyze its own development history?

We took 280 git commits spanning 8 days of intense development ↓ Converted them to markdown files with chronological numbering ↓ Ingested into ontology "Knowledge Graph Project History" ↓ Added 31 GitHub pull requests to separate ontology "Knowledge Graph Project Pull Requests" ↓ Queried the unified graph to understand the project's evolution

The question: Would ingesting two perspectives on the same events create truthful enrichment or add noise?

The answer: The graph correctly understood commits and PRs as complementary views, merging evidence and relationships without confusion.

Starting Point: Empty Graph

Before ingestion: - 0 concepts - 0 sources - 0 relationships

The system knew nothing about its own existence.

First Ontology: Commit History (280 Documents)

Extracted 280 markdown files from git log
Commit 1: "Neo4j Knowledge Graph MVF" (Oct 4, 2025)
...through 279 incremental changes...
Commit 280: "Cascade delete job records when ontology is deleted" (Oct 12, 2025)

After ingestion: - 1,206 concepts extracted - 280 sources (one per commit) - 5,561 relationships discovered - 17 relationship types (ENABLES, REQUIRES, CAUSES, PREVENTS, etc.)

What the graph learned:

Neo4j Knowledge Graph MVF (commit 1) ↓ CONTRASTS_WITH Admin Tools Migration ↓ RESULTS_FROM Apache AGE Migration (commits 83-127) ↓ ENABLES RBAC Capabilities + Unified Architecture

Key insight discovered: "Apache AGE Migration" concept appeared in 5 commits with causal relationships showing it PREVENTS "Dual Database Complexity" and ENABLES "Zero Licensing Costs."

Second Ontology: Pull Requests (31 Documents)

Fetched 31 merged PRs via GitHub API
Converted to markdown with merge commit hashes
PR #14: "Apache AGE Migration: Replace Neo4j with PostgreSQL + AGE"
Related commit: c392b3c99a26781c36a9dbe86d3269d7c973b65f

After PR ingestion: - 1,335 concepts (+129 new, many merged with existing) - 316 sources (+36 including 31 PRs and 5 Watts lectures) - 6,508 relationships (+947 new relationships)

What changed:

"Apache AGE Migration" concept now has 6 evidence instances: - 5 from commits (granular implementation) - 1 from PR #14 (strategic summary)

New relationships added from PR perspective:

Apache AGE Migration ↓ ENABLES → Atomic Transactions (not explicit in commits) ↓ PREVENTS → RBAC Limitation in Neo4j Community Edition (root cause) ↓ PREVENTS → Backup/Restore Complexity (operational pain) ↓ REQUIRES → AGE Compatibility Fixes (implementation challenges)

The graph understood PRs describe why (strategic motivation) while commits describe how (tactical implementation). Same architectural change, different granularity.

The Semantic Bridge

We never told the system commits and PRs were related. Yet it discovered the connection through:

1. Shared terminology: - Both mention "Apache AGE," "PostgreSQL," "Neo4j," "RBAC" ↓ Vector embeddings recognized semantic similarity ↓ Concepts merged across ontologies

2. Commit hash references: - PR #14 footer: "Related commit: c392b3c99a26781c36a9dbe86d3269d7c973b65f" - Commit 113: Merge commit c392b3c9 ↓ Same hash creates implicit link ↓ LLM extraction recognized relationship

3. Causal language patterns: - Commits: "Fixes issue," "Implements," "Related to ADR-016" - PRs: "Complete migration," "Addresses blocker," "Enables capability" ↓ Temporal markers extracted ↓ Relationship types inferred (CAUSES, ENABLES, RESULTS_FROM)

Time as Emergent Property

We never encoded timestamps or explicit ordering. Yet the graph reconstructed the timeline:

Problem: Neo4j Community lacks RBAC
  ↓ CAUSES
Decision: ADR-016 Apache AGE Migration (commit 83)
  ↓ RESULTS_FROM
Implementation: Commits 83-127
  ↓ CONFIRMED_BY
PR #14: Strategic summary of migration
  ↓ ENABLES
Capability: Production RBAC + Unified Architecture

How time emerged:

Semantic causation (CAUSES, ENABLES, PREVENTS) ↓ Conceptual dependencies (can't migrate to something that doesn't exist) ↓ Narrative structure (problems → decisions → solutions) ↓ Observable time arrow from meaning, not metadata

Testing the Connection

Query: find_connection_by_search("Neo4j Knowledge Graph MVF", "Apache AGE Migration")

Result (4 hops):

Neo4j Knowledge Graph MVF ↓ CONTRASTS_WITH Admin Tools Migration ↓ RESULTS_FROM Apache AGE client ↓ IMPLIES Testing Framework ↓ SUPPORTS Apache AGE Migration

The path remained identical before and after PR ingestion. The PRs didn't create new paths - they enriched existing ones with strategic context.

The Multi-Perspective Insight

Commits tell you what changed: - "Add age_ops.py: AGEConnection wrapper" - "Update config.py: PostgreSQL connection methods" - "Fix relationship counter accuracy"

Pull requests tell you why it mattered: - "Neo4j Community lacks RBAC ($180K/year for Enterprise)" - "Dual database complexity prevents atomic transactions" - "Unified architecture simplifies operations"

The graph understood these are complementary, not duplicate. It merged evidence and enriched relationships without confusion.

What We Learned

1. Cross-ontology enrichment works

Separate ontologies for related content ↓ Shared concepts automatically bridge them ↓ Evidence accumulates from multiple perspectives ↓ Relationships reveal complementary insights

2. Semantic similarity is sufficient for linking

No explicit foreign keys required ↓ Vector embeddings recognize shared concepts ↓ LLM extraction identifies relationships ↓ Graph merges naturally

3. Time is an emergent property

No timestamps in the graph schema ↓ Causal relationships (CAUSES, ENABLES, RESULTS_FROM) ↓ Narrative structure (problems → solutions) ↓ Observable time arrow reconstructed from semantics

4. Confidence scores encode epistemic weight

Relationship confidence (0.7-0.95) ↓ High confidence = strong causal link ↓ Multiple paths = reinforced narrative ↓ Weak confidence = uncertain ordering

5. Granularity layers without noise

Commits = detailed implementation log ↓ PRs = strategic architectural summary ↓ Graph recognizes hierarchy ↓ Enriches understanding without duplication

Beyond Git: Where Else Does This Apply?

The pattern we discovered generalizes to any structured record content:

Discussion threads: - Individual messages = commit-level detail - Thread summaries = PR-level overview - Shared entities (people, decisions, action items) create bridges - Time emerges from conversational flow (responses, references)

Financial transactions: - Individual transactions = detailed operations - Monthly statements = summaries - Shared entities (accounts, merchants, categories) link records - Time emerges from transaction chains (debits → credits → balances)

Travel journals: - Daily entries = granular experiences - Trip summaries = strategic insights - Shared entities (locations, people, themes) connect days - Time emerges from journey progression (departed → visited → returned)

Medical records: - Appointments = detailed observations - Care plans = strategic treatment approach - Shared entities (symptoms, diagnoses, medications) create continuity - Time emerges from treatment progression (symptoms → diagnosis → treatment → outcomes)

Code reviews: - Individual comments = specific feedback - Review summaries = overall assessment - Shared entities (files, functions, patterns) link discussions - Time emerges from review flow (requested → addressed → approved)

The universal pattern:

Detailed records (commits, messages, transactions, entries) ↓ Summary views (PRs, thread summaries, statements, trip reports) ↓ Shared semantic entities create natural bridges ↓ Graph discovers relationships without explicit schema ↓ Time emerges from causal language patterns ↓ Multiple perspectives enrich understanding

Multi-Perspective Ingestion

The experiment revealed content types that naturally enrich each other:

Granular records - Detailed event logs (commits, messages, transactions) - Fine-grained operations and changes - Tactical implementation details

Summary perspectives - High-level views (PRs, thread summaries, statements) - Strategic motivations and outcomes - Aggregated insights across granular events

Contextual documentation - Reference material (wiki, ADRs, guides) - Domain knowledge and standards - Historical context and rationale

External references - Related systems (issue trackers, forums) - Cross-boundary connections - Broader ecosystem context

These content types enrich the graph regardless of ingestion order because semantic similarity naturally merges related content. You could ingest summaries before details, documentation before code, or mix them randomly - the final graph structure would be semantically equivalent (though non-deterministic LLM output means no two runs produce byte-identical results).

On Order-Independence and Computational Leverage

Why order doesn't matter:

Semantic handles exist in the graph ↓ Vector embeddings + relationship terms ↓ Automatic matching during ingestion ↓ No hard-coded expectations of what should link

When processing a new document chunk, the system queries: - All concepts with similar vector embeddings - All terms that match semantically - All existing relationships and their types

These matches are presented as evidence during LLM extraction. The LLM considers them, without judgment about whether they "should" be related. If the new content references similar concepts, relationships form naturally. If not, the new content stands alone until something else connects to it.

This rhymes with Sutton's Bitter Lesson:

Hard-coded approach (brittle): - Define schema first - Specify allowed relationship types - Enforce ingestion order dependencies - Hard-wire what can link to what

Computational approach (this system): - Vector search finds similarity - LLM extraction discovers relationships - Graph stores whatever emerges - Order-independent because matches are computed, not prescribed

We leverage computation (embedding similarity, LLM reasoning) rather than encoding human assumptions about structure. The system discovers connections through search and learning, not through hard-coded rules about what a "commit" can relate to versus a "pull request."

The implications:

You could ingest documents in random order ↓ Each finds its semantic neighbors automatically ↓ Relationships form when evidence supports them ↓ Semantically equivalent graph emerges from content, not ingestion sequence

This isn't just convenient - it's fundamental to why the approach generalizes. Whether you're ingesting git commits, bank statements, or medical records, the system doesn't need domain-specific ingestion ordering. The semantic handles and computational search do the work.

We can't claim this as a benchmark against the Bitter Lesson, but we can observe: querying all possible matching vectors and terms, then presenting them as evidence for the LLM to consider - this pattern avoids hard-coding human knowledge about what should relate, instead leveraging computation to discover what does relate.

The Meta-Validation

This document itself is meta-evidence: The knowledge graph successfully analyzed its own development history to answer questions like:

  • "How did we migrate from Neo4j to Apache AGE?"
  • "What architectural decisions led to the current system?"
  • "What relationships exist between early MVF and production capabilities?"

The system used itself to understand itself - a self-referential loop validating that:

Structured records + LLM extraction + Graph storage ↓ Persistent semantic understanding ↓ Queryable knowledge across multiple perspectives ↓ Emergent temporal structure from causation ↓ Truth-preserving enrichment from related sources

Not just a database. Not just embeddings. Understanding.


This experiment demonstrated that knowledge graphs can discover multi-dimensional narratives from complementary record sets, reconstructing causality and timeline from semantic relationships alone.