Recursive Upsert
Recursive upsert is the mechanism by which Kappa Graph merges new knowledge into an existing graph rather than appending it as isolated fragments. It operates in two interlocked passes: concept deduplication during ingestion, and relationship-type categorization after the LLM discovers a new edge label.
Why merging is the problem
Documents about the same subject use different words. "Linear scanning system," "sequential attention mechanism," and "step-by-step reasoning" may all express the same concept. If each phrase creates a distinct graph node, the graph accumulates duplicates without accumulating understanding.
Exact-string matching solves nothing here — phrases differ. Schema-fixed vocabularies help for structured data, but fail for open-ended concept extraction where the LLM surfaces terms the schema authors did not anticipate.
Recursive upsert resolves this by using vector similarity as the merge criterion and letting the graph itself grow into the matching surface. Each document's concepts are matched against the accumulating state from all previous documents, not against a fixed dictionary.
Concept matching: a cascade of four stages
When the ingestion pipeline extracts a concept from a document chunk, it runs four stages in order (api/app/lib/ingestion.py).
sequenceDiagram
participant P as Pipeline
participant AGE as AGE Graph
participant VI as Vector Index
participant EV as Evidence Store
P->>AGE: Stage 1: lookup concept_id (exact match)
alt concept_id exists
AGE-->>P: matched existing concept
else no exact match
P->>VI: Stage 2: embed label+description+terms, cosine scan
alt similarity >= 0.85 (strict tier)
VI-->>P: matched existing concept
else 0.75 <= similarity < 0.85 AND labels overlap (label-boosted tier)
VI-->>P: matched existing concept
else no match in either tier
VI-->>P: no match
P->>AGE: Stage 3: create new concept node, store embedding
AGE-->>P: new concept created
end
end
P->>EV: Stage 4: store evidence, source ref, provenance
EV-->>P: evidence recorded
Stage 1: Exact ID match
If the LLM predicted a concept_id that already exists in the graph, the pipeline uses it directly. The LLM receives the top 30 existing concepts as context during extraction (configurable via EXTRACTION_CANDIDATE_K), so it can predict IDs it recognizes. When this match fires, no embedding computation is needed.
Stage 2: Two-tier vector similarity search
If no exact ID match exists, the pipeline embeds the new concept (label + description + search terms) and runs a cosine similarity scan against all existing concept embeddings.
The decision uses two tiers:
| Condition | Action |
|---|---|
| similarity ≥ 0.85 | Match (strict tier) |
| 0.75 ≤ similarity < 0.85 AND labels are identical or one contains the other | Match (label-boosted tier) |
| otherwise | No match |
The label-boosted tier catches paraphrased descriptions of an identically-named concept that would fall just below the strict threshold. Without it, "Machine learning safety" and "Machine learning safety: alignment concerns" would create two nodes when one would be correct.
The current implementation uses a Python NumPy full scan over all stored embeddings. This is O(N) per concept — acceptable at the scales Kappa Graph currently runs (thousands of concepts), and replaceable with HNSW approximate-nearest-neighbor indexing when graphs grow larger (ADR-112).
Stage 3: Create new concept
If no match is found in either tier, the pipeline creates a new concept node and stores its embedding. That node is immediately available as a match target for subsequent chunks and documents.
Stage 4: Evidence accumulation (always)
Regardless of whether the concept was matched to an existing node or created as new, the pipeline stores the supporting evidence: the exact quote from the source, the source document reference, and provenance metadata. Merged concepts retain all evidence from all sources.
How the graph learns
The "recursive" quality of the pattern is that each new document's concepts are matched against a graph that has already absorbed previous documents. This creates a compounding effect:
Document 1 → 10 concepts extracted → 10 created (graph empty)
Document 2 → 8 concepts extracted → 3 merged, 5 created → 15 in graph
Document 3 → 9 concepts extracted → 7 merged, 2 created → 17 in graph
Concepts that appear across many documents accumulate more evidence and higher grounding strength. Rare or peripheral concepts remain as distinct, lightly evidenced nodes. No threshold or schema decides in advance what is central and what is peripheral — the structure emerges from the corpus.
Relationship-type categorization: the edge vocabulary
When the LLM discovers a relationship type not already in the vocabulary — say, "ENHANCES" — the system must decide what kind of relationship it is. This is handled by probabilistic vocabulary categorization (ADR-605).
Discovery during ingestion
The pipeline calls normalize_relationship_type() to fuzzy-match the new type against the existing vocabulary. If no match is found, the type is accepted as new (ADR-603) and written with the temporary category llm_generated:
age_client.add_edge_type(
relationship_type=canonical_type,
category="llm_generated",
added_by="llm_extractor",
is_builtin=False,
auto_categorize=True # triggers ADR-605 categorization
)
Automatic categorization (ADR-605)
The VocabularyCategorizer (api/app/lib/vocabulary_categorizer.py) assigns the new type to one of 11 semantic categories by computing cosine similarity between the new type's embedding and a set of hand-validated seed types.
The 11 categories and their seeds:
| Category | Representative seeds |
|---|---|
| causation | CAUSES, ENABLES, PREVENTS, INFLUENCES, RESULTS_FROM |
| composition | PART_OF, CONTAINS, COMPOSED_OF, SUBSET_OF, COMPLEMENTS |
| logical | IMPLIES, CONTRADICTS, PRESUPPOSES, EQUIVALENT_TO |
| evidential | SUPPORTS, REFUTES, EXEMPLIFIES, MEASURED_BY |
| semantic | SIMILAR_TO, ANALOGOUS_TO, CONTRASTS_WITH, OPPOSITE_OF |
| temporal | PRECEDES, CONCURRENT_WITH, EVOLVES_INTO |
| dependency | DEPENDS_ON, REQUIRES, CONSUMES, PRODUCES |
| derivation | DERIVED_FROM, GENERATED_BY, BASED_ON |
| operation | ANALYZES, CALCULATES, PROCESSES, TRANSFORMS, EVALUATES |
| interaction | INTEGRATES_WITH, COMMUNICATES_WITH, CONNECTS_TO |
| modification | CONFIGURES, UPDATES, ENHANCES, OPTIMIZES, IMPROVES |
For each category, the score is the maximum similarity to any seed in that category (satisficing, not mean). The category with the highest score wins.
Using max rather than mean preserves signal: a type like PREVENTS is strongly similar to CAUSES (causal) and also similar to OPPOSITE_OF (semantic). Averaging would dilute both signals; taking the max correctly identifies causation as primary.
Ambiguity is flagged when the runner-up score also exceeds 0.70 — indicating the type genuinely spans two categories. This is stored rather than collapsed, because a type that is both logical and causal carries richer information than a forced single assignment.
Confidence thresholds
| Confidence | Interpretation |
|---|---|
| ≥ 70% | Auto-categorize; no review needed |
| 50–69% | Auto-categorize; warning logged |
| < 50% | Auto-categorize; flagged for curator review |
If categorization fails (embedding generation error, seed embeddings missing), the type remains llm_generated and ingestion continues. The category can be recomputed later.
What gets written
After categorization, four writes occur in sequence:
- Insert into
kg_api.relationship_vocabularywith categoryllm_generated. - Generate and store the type's embedding vector.
- Compute category scores; update
category,category_confidence,category_scores,category_source='computed'in the vocabulary table. - Create or update a
:VocabTypenode in the Apache AGE graph with the computed category.
The graph metadata node (step 4) enables Cypher traversals to filter or group by relationship category without joining back to the relational table.
Feedback loops
Three feedback loops improve match quality over time, without any explicit retraining step.
LLM context awareness. The LLM receives the top 30 existing concepts during extraction (configurable via EXTRACTION_CANDIDATE_K) and can predict their IDs directly. As the graph grows, this context becomes more useful: the LLM is less likely to invent a novel label for something the graph already tracks.
Vocabulary growth. Each new relationship type the LLM discovers — once categorized — becomes available for fuzzy matching in subsequent ingestion runs. The vocabulary grows with the corpus, reducing the rate of novel type discovery over time.
Grounding feedback (ADR-404, ADR-808, ADR-811). Concepts with high grounding strength (consistent evidence across sources, supported by connected concepts) are more semantically stable. Low grounding may indicate extraction noise. This signal can guide threshold tuning: if low-grounding concepts are proliferating, the merge threshold may be too permissive.
Trade-offs
Conservative merge by default. The strict threshold (0.85) creates new concepts when similarity is ambiguous rather than merging. Duplicates can be merged later — via kg vocab consolidate or manually. False merges are harder to undo, so the system prefers the recoverable error.
O(N) full scan. At current graph sizes, the NumPy full scan adds 10–50ms per concept and is acceptable. ADR-112 documents the path to HNSW indexing when this becomes a bottleneck.
LLM context window. Showing 30 existing concepts to the LLM costs roughly 300 tokens per extraction chunk and does not scale past approximately 100 concepts in context. Future work (ADR-306) explores semantic retrieval to replace the fixed list with a targeted selection.
Architectural layer
Recursive upsert sits at the intersection of three layers that Kappa Graph keeps distinct:
| Layer | Role |
|---|---|
| Property graph (Apache AGE) | Stores concepts, relationships, and vocabulary nodes; provides Cypher traversal |
| Vector operations | Cosine similarity for concept matching and vocabulary categorization |
| Query-time computation | Cypher traversals assemble node sets dynamically; no pre-materialized structures |
The system stores binary edges (exactly two endpoint nodes each). It is a vector-augmented property graph, not a hypergraph.
Operational commands
# View vocabulary with computed categories
kg vocab list
# Inspect category scores for a specific type
kg vocab category-scores RELATIONSHIP_TYPE
# Retry categorization for all llm_generated types
kg vocab refresh-categories
# Retry for a specific type
kg vocab refresh-categories --type RELATIONSHIP_TYPE
# Manually override a category
kg vocab update RELATIONSHIP_TYPE --category causation --manual
Related ADRs
- ADR-404: Grounding Strength Calculation
- ADR-603: Automatic Edge Vocabulary Expansion
- ADR-808: Probabilistic Truth Convergence
- ADR-605: Probabilistic Vocabulary Categorization
- ADR-606: GraphQueryFacade (namespace safety — ensures
:Conceptlabels in all upsert queries) - ADR-112: HNSW Vector Indexing (O(N) → O(log N) migration path)
- ADR-811: Polarity Axis Triangulation
- ADR-610: Epistemic Status Classification
- ADR-306: Configuration-Driven Matching