Recursive Upsert Architecture
Version: 2.0 Date: November 22, 2025 Status: Foundational Pattern
Core Innovation: A semantic matching system that intelligently merges or creates concepts based on vector similarity as documents are ingested. Each new document recursively matches against the accumulating knowledge graph, building a unified conceptual structure across multiple sources.
Table of Contents
- The Problem: Linear Information, Multidimensional Reality
- The Solution: Recursive Semantic Upsert
- How It Works: Multi-Stage Matching
- Why It Works Well
- The O(N) Problem
- Evolution: How Later ADRs Improve The Pattern
- Architectural Classification: Vector-Augmented Knowledge Graph
- Trade-offs and Design Decisions
The Problem
As Alan Watts articulated, human intelligence is fundamentally limited by its linear, scanning nature:
"Human intelligence has a very serious limitation. That limitation is that it is a scanning system, of conscious attention, which is linear. That is to say, it examines the world, in lines. Rather as you would pass the beam of a flashlight across a room or a spotlight."
When we ingest documents into a knowledge system, we face two challenges:
- Sequential Processing: Documents arrive one at a time, paragraph by paragraph
- Conceptual Overlap: The same idea appears in different documents with different wording
Traditional approaches either: - Treat everything as new → Massive duplication, no knowledge synthesis - Force exact ID matching → Misses semantic equivalence, fragments knowledge - Use rigid schemas → Fails to capture emergent concepts
We needed a pattern that could accumulate knowledge organically while recognizing semantic equivalence across different expressions of the same idea.
The Solution: Recursive Semantic Upsert
Recursive upsert is a pattern where each new concept extracted from a document is:
- Compared against all existing concepts using semantic similarity
- Merged with a matching concept (if similarity > threshold)
- Created as new (if no match found)
- Evidence added regardless (quote + source provenance)
The "recursive" aspect comes from processing multiple documents sequentially, where each document's concepts are matched against the accumulating graph state from previous documents.
Document 1: Extract concepts → Create all (graph is empty)
→ Graph has 10 concepts
Document 2: Extract concepts → Match against 10 existing
→ Merge 3, Create 5 new
→ Graph has 15 concepts
Document 3: Extract concepts → Match against 15 existing
→ Merge 7, Create 2 new
→ Graph has 17 concepts
The graph learns as it grows. Concepts that appear frequently across documents accumulate more evidence, while rare concepts remain distinct.
How It Works: Multi-Stage Matching
When a new concept is extracted, we run a cascade of matching strategies:
Stage 1: Exact ID Match
IF extracted concept_id exists in graph:
→ Use existing concept (LLM predicted correct ID)
→ Confidence: 100%
This happens when the LLM has seen similar text before and predicts an existing concept_id. We trust the LLM's judgment here.
Stage 2: Vector Similarity Search (Primary)
1. Embed new concept: label + search_terms → 1536-dim vector
2. Cosine similarity against ALL existing concept embeddings
3. IF max_similarity > 0.85:
→ Use existing concept (semantic match)
→ Confidence: similarity_score
This is the core of the pattern. Vector similarity captures semantic equivalence: - "linear scanning system" matches "sequential attention mechanism" - "human-directed evolution" matches "genetic self-modification"
The threshold (0.85) balances: - Higher threshold (>0.90): Fewer merges, more duplicates, safer - Lower threshold (<0.80): More merges, potential false positives, aggressive
Stage 3: Create New Concept
This concept is now available for future matching in subsequent documents.
Stage 4: Evidence Accumulation (Always)
Regardless of match/create:
→ Store quote (exact text from source)
→ Link to source (document + paragraph)
→ Track provenance
Key insight: Even merged concepts retain all evidence from all sources. You can trace back to where each instance appeared.
Why It Works
1. Semantic Tolerance
Unlike exact string matching, vector similarity handles: - Paraphrasing: "linear thinking" ≈ "sequential reasoning" - Synonyms: "contradicts" ≈ "opposes" - Different granularity: "AI safety concerns" ≈ "risks of artificial intelligence"
2. LLM Context Awareness
The LLM sees existing concepts during extraction, so it can: - Predict existing IDs when confident - Create new search terms that match existing concepts - Understand domain-specific terminology
3. Gradual Knowledge Synthesis
The graph evolves as documents are added: - Early documents create foundational concepts - Later documents either reinforce or extend them - Cross-document patterns emerge naturally
4. Complete Provenance
Every concept tracks: - Which documents mentioned it - What exact quotes support it - When it first appeared
This enables: - Evidence-based reasoning - Source verification - Cross-document analysis
The O(N) Problem
The recursive upsert pattern has a significant performance limitation:
Full-Scan Vector Search
Every new concept requires:
1. Compute embedding (OpenAI API call, ~100ms)
2. Load ALL existing concept embeddings (N vectors)
3. Compute cosine similarity for each (N comparisons)
4. Find maximum similarity
5. Compare to threshold
Time Complexity: O(N) per concept Space Complexity: O(N × D) where D = embedding dimensions (1536)
Real-World Impact
With 1,000 concepts in the graph: - Each new concept: ~1,000 similarity calculations - Document with 50 concepts: ~50,000 calculations - 10 documents per day: ~500,000 calculations
Current implementation: Python NumPy, full scan, no indexing Typical performance: 10-50ms per concept (acceptable for current scale)
Why We Accept This Trade-off
- Simplicity: No complex indexing logic, easy to debug
- Accuracy: No false negatives from approximate search
- Current scale: <10,000 concepts, performance is acceptable
- Future migration: Can add HNSW/IVF later if needed (see ADR-055)
Evolution: How Later ADRs Improve The Pattern
The recursive upsert pattern is foundational, but later architectural decisions significantly enhance it:
ADR-028: Grounding Strength as Epistemic Feedback
Problem: Upsert creates concepts, but how do we know if they're valid?
Enhancement: Add grounding strength (-1.0 to 1.0) calculated from: - Evidence consistency across sources - Relationship support from connected concepts - Contradiction detection
Impact on Upsert: - Concepts with high grounding (>0.8) are more likely to match - Concepts with negative grounding may indicate extraction errors - Guides threshold tuning: if low-grounding concepts proliferate, raise threshold
ADR-048: GraphQueryFacade (Namespace Safety)
Problem: As vocabulary metadata moves to the graph, upsert queries must not collide with vocabulary nodes.
Enhancement: Explicit :Concept labels in all queries, preventing namespace pollution.
Impact on Upsert:
// Before (unsafe)
MATCH (n) WHERE n.embedding <-> $vec < 0.85
// After (safe)
MATCH (c:Concept) WHERE c.embedding <-> $vec < 0.85
ADR-055: HNSW Vector Indexing (Future)
Problem: O(N) full scan becomes prohibitive at large scale.
Enhancement: Hierarchical Navigable Small World (HNSW) indexing for approximate nearest neighbor search.
Impact on Upsert: - O(log N) search instead of O(N) - Trade-off: 95-99% recall (might miss some matches) - Enables 100K+ concept graphs
ADR-065: Epistemic Status Classification
Problem: Not all concept merges are equally reliable.
Enhancement: Classify relationships by epistemic status: - AFFIRMATIVE: High grounding, reliable merges - CONTESTED: Mixed evidence, review recommended - CONTRADICTORY: Negative grounding, potential false positive
Impact on Upsert: - Flag uncertain merges for review - Adjust thresholds based on relationship type - Provide confidence metrics to users
Architectural Classification: Vector-Augmented Knowledge Graph
Understanding what this system is helps clarify design decisions and future directions.
What It Is (and Isn't)
The recursive upsert pattern operates within a three-layer architecture:
┌─────────────────────────────────────────────────────────────────────┐
│ PROPERTY GRAPH STORAGE │
│ ───────────────────── │
│ Apache AGE / PostgreSQL │
│ Binary edges: (node)─[rel]→(node) │
│ Rich metadata: embeddings, confidence, timestamps │
│ │
├─────────────────────────────────────────────────────────────────────┤
│ QUERY-TIME SEMANTIC COMPUTATION │
│ ────────────────────────────── │
│ Cypher traversals construct node sets dynamically │
│ No pre-materialized structures—computed on demand │
│ Enables flexible analysis patterns without storage overhead │
│ │
├─────────────────────────────────────────────────────────────────────┤
│ VECTOR OPERATIONS │
│ ──────────────── │
│ Cosine similarity for concept matching │
│ Dot product projections onto semantic axes (ADR-058) │
│ Polarity axis triangulation for grounding strength │
│ Collective epistemic agreement/disagreement analysis │
└─────────────────────────────────────────────────────────────────────┘
Accurate terminology: "Vector-augmented knowledge graph" or "Property graph with query-time semantic computation"
Not accurate: "Hypergraph" (our edges connect exactly 2 nodes; true hyperedges connect N nodes as first-class entities)
Why This Distinction Matters
The system's power comes from combining symbolic queryability with continuous semantic operations:
| Layer | Capability | Example |
|---|---|---|
| Property Graph | Structural traversal | "Find all concepts within 2 hops of X" |
| Query-Time Computation | Dynamic set construction | "Collect nodes along SUPPORTS path, expand to neighbors" |
| Vector Operations | Semantic analysis | "Calculate epistemic agreement across collected set" |
This hybrid enables queries like:
1. Traverse SUPPORTS chain from concept X to Y (graph layer)
2. Collect all nodes on path + 2-hop neighbors (computation layer)
3. Project each onto polarity axis (vector layer)
4. Calculate collective grounding agreement (vector layer)
Connection to Graph Learning Research
The recursive upsert pattern shares conceptual DNA with Adaptive Neighborhood Feature Mixing (ANFM) from graph neural network research:
| ANFM Approach | Recursive Upsert Parallel |
|---|---|
| Adaptive mixing weights via MLP | Threshold-based vector similarity (0.85) |
| Node embeddings for semantic content | Concept embeddings (768-1536 dim) |
| Edge embeddings for relationship semantics | VocabType embeddings for relationship types |
| Rejects uniform aggregation | Uses context-aware matching, not blind merge/create |
Key shared insight: Both systems reject uniform strategies in favor of adaptive, context-aware aggregation. ANFM learns mixing weights through training; our system achieves similar adaptivity through: - Configurable similarity thresholds - Grounding-based feedback loops - Epistemic status classification
Our unique advantage: The recursive aspect. ANFM operates on static graphs; our system builds graphs incrementally where each document matches against the accumulating graph state, creating self-improving feedback loops.
Feedback Loops That Improve Early-Cycle Attachment
Later ADRs don't just extend the pattern—they create feedback loops that improve attachment quality even in early ingestion cycles:
Loop 1: LLM Context Awareness
# LLM sees recent concepts during extraction
existing_concepts = get_document_concepts(ontology, recent_chunks_only=3)
# → Can predict existing IDs, generates matching search terms
Loop 2: Vocabulary Evolution
LLM generates: "validates" → normalized to "VALIDATES"
New type: "ENHANCES" → auto-categorized, embedded, available for future matching
Loop 3: Grounding Feedback (ADR-044, ADR-058)
High-grounding concepts → More likely to match (semantically stable)
Low-grounding concepts → May indicate extraction errors → Adjust threshold
Loop 4: Epistemic Status (ADR-065)
AFFIRMATIVE relationships → Increase merge confidence
CONTESTED relationships → Flag for review
CONTRADICTORY relationships → Valuable dialectical tension, not errors
Future: True Tensor Operations
Current implementation uses CPU-based NumPy for vector operations. Future enhancements could include:
- pgvector for native PostgreSQL vector similarity (ADR-072)
- GPU-accelerated embeddings for batch operations
- Graph Neural Network layers for learned concept clustering
The architecture is designed to support these without fundamental restructuring—the property graph provides stable indexing while the vector layer can be upgraded independently.
Trade-offs and Design Decisions
Threshold Selection (0.85)
Higher (0.90+): - ✅ Fewer false positives (wrong merges) - ❌ More duplicates (missed semantic matches) - Use case: High-precision domains (legal, medical)
Current (0.85): - ✅ Balanced precision/recall - ✅ Good for philosophical/conceptual documents - ❌ Occasional false positives
Lower (0.75-0.80): - ✅ Aggressive merging, fewer duplicates - ❌ Higher false positive rate - Use case: Exploratory research, brainstorming
LLM Context: Show Existing Concepts?
Current approach: YES, show top 50 existing concepts during extraction
Advantages: - LLM can predict existing IDs (Stage 1 match) - LLM generates search terms that match existing concepts - Reduces semantic drift over time
Disadvantages: - Token cost (50 concepts ≈ 500 tokens per extraction) - Potential bias toward existing concepts - Doesn't scale beyond ~100 concepts in context
Alternative: No context, rely purely on vector similarity Future: Hybrid approach with semantic retrieval
Merge vs Create Philosophy
Conservative (current): - Default to creating new concepts unless strong match - Preserves nuance and subtle distinctions - Accepts some duplication
Aggressive: - Default to merging similar concepts - Prioritizes deduplication over precision - Risks losing subtle distinctions
We chose conservative because: 1. Duplicates can be merged later (manual or automated) 2. False merges are harder to split 3. Evidence accumulation works either way
Summary
Recursive upsert is a deceptively simple pattern with profound implications:
- It works because vector similarity captures semantic equivalence
- It scales reasonably well to thousands of concepts (O(N) is acceptable)
- It evolves as later ADRs add grounding, epistemic status, and indexing
- It's foundational to how the system builds knowledge across documents
The key insight: Don't force exact matches, don't accept everything as new—find the semantic middle ground and let evidence accumulate.
Architectural Identity
The system is best described as a vector-augmented knowledge graph—a property graph foundation (Apache AGE) with query-time semantic computation and vector operations. This hybrid architecture:
- Is NOT a hypergraph (our edges connect 2 nodes; true hyperedges connect N nodes)
- IS a property graph with rich embeddings on nodes and relationship types
- Enables hybrid queries combining structural traversal with semantic analysis
The innovation lies in systematically replacing boolean logic with probabilistic values. Where traditional systems ask "does this relationship exist?" (yes/no), we ask "how strongly does this concept ground that one?" (continuous score via vector projection).
Later enhancements (grounding, namespace safety, HNSW indexing, epistemic classification) address the pattern's limitations while preserving its core strength: organic knowledge synthesis through semantic similarity.
Related ADRs
- ADR-016: Apache AGE Migration (graph database foundation)
- ADR-028: Grounding Strength Calculation (epistemic feedback)
- ADR-042: Ollama Integration (affects extraction cost)
- ADR-044: Probabilistic Truth Convergence (grounding calculation)
- ADR-048: GraphQueryFacade (namespace safety)
- ADR-055: HNSW Vector Indexing (scaling beyond O(N))
- ADR-058: Polarity Axis Triangulation (continuous grounding scores)
- ADR-065: Epistemic Status Classification (merge reliability)
- ADR-072: Configuration-Driven Matching (threshold tuning)
For complete system architecture, see ARCHITECTURE.md.