Vocabulary Lifecycle
Relationship types in Kappa Graph accumulate from document ingestion and require active management to stay coherent. The consolidation pipeline scores, merges, and prunes edge types using a combination of embedding similarity and LLM judgment.
How the vocabulary grows
When the LLM extracts concepts from a document, it also assigns typed edges between them: IMPLIES, SUPPORTS, HISTORICALLY_PRECEDED, and so on. Thirty built-in types cover the most common relationships. The LLM may invent additional custom types when none of the built-in types fit.
Over multiple ingestion runs across different document domains, the vocabulary sprawls. Near-duplicates appear — DEFINED_AS and DEFINED, INCREASES and ENHANCES. Some types accumulate zero edges after the documents that introduced them leave the graph. Others become structurally marginal: present but traversed rarely.
Left unmanaged, a bloated vocabulary degrades query quality. Traversals that should hit the same semantic territory split across multiple edge types. Similarity searches return noisier results.
Why the LLM governs consolidation decisions
The system computes objective scores for every relationship type: embedding similarity to other types, edge count, traversal frequency, bridge importance, polarity positioning, epistemic status. These numbers describe the vocabulary's state precisely.
A threshold rule alone handles the easy cases but fails on semantic edge cases. Directional inverses such as HAS_PART and PART_OF occupy similar embedding space but are not synonyms — merging them corrupts traversal direction. CONTRASTS_WITH and EQUIVALENT_TO share embedding space while capturing opposite semantics.
An LLM given the same numerical context as a human reviewer computes the same function over those scores, and also catches the semantic distinctions that thresholds miss. The combination is more consistent than either threshold rules or human review alone.
The human's role in vocabulary governance is bringing external knowledge into the graph — submitting documents, expanding domains. Vocabulary hygiene decisions over measurable quantities are delegated to the pipeline.
The consolidation pipeline
Consolidation follows a loop: score candidates, present them to the LLM, execute the decision, re-score with the updated vocabulary state, repeat.
sequenceDiagram
participant CLI as CLI / API Client
participant VM as VocabularyManager
participant SD as SynonymDetector
participant PS as PruningStrategy
participant LLM as LLM Reasoning Provider
participant DB as PostgreSQL + AGE
CLI->>VM: consolidate(target, threshold, dry_run)
VM->>DB: Query active vocabulary types
DB-->>VM: Types with edge counts
alt vocab_size > target
loop Until target reached or no candidates
VM->>SD: find_synonyms(types)
SD->>DB: Fetch embeddings (768-dim nomic)
SD-->>VM: Ranked synonym candidates
VM->>PS: evaluate_synonym(candidate, scores)
PS->>LLM: Structured prompt (similarity, edges, usage)
LLM-->>PS: MergeDecision(should_merge, reasoning)
PS-->>VM: ActionRecommendation
alt should_merge and not dry_run
VM->>DB: UPDATE edges SET type = target
VM->>DB: SET deprecated type inactive
end
VM->>DB: Re-query vocabulary (state changed)
end
end
VM->>VM: Prune zero-usage custom types
VM-->>CLI: ConsolidationResult
Three behaviors govern the loop:
- Target-gated. Live mode only runs when vocabulary size exceeds the target.
--target 90with 63 active types is a no-op. Set a target below current size to trigger merges, or use--dry-runto preview candidates regardless of target. - Re-query after each merge. After a merge executes, edge counts shift and some types disappear. The pipeline re-fetches similarity before evaluating the next candidate. Dry-run evaluates all candidates against the initial state without re-querying.
- Prune after merge loop. Zero-usage custom types are removed after the merge loop completes. Built-in types are never pruned.
Decision routing
Candidates are routed through three tiers based on similarity score and LLM judgment:
flowchart TD
A[Synonym Candidate] --> B{Similarity >= 0.90?}
B -->|Yes| C{Zero edges on<br/>deprecated type?}
C -->|Yes| D[Auto-prune]
C -->|No| E[LLM Evaluate]
B -->|No| F{Similarity >= 0.70?}
F -->|Yes| E
F -->|No| G[Skip — not synonyms]
E --> H{LLM Decision}
H -->|Merge| I[Execute merge<br/>Update edges + deactivate]
H -->|Skip| J[Reject with reasoning]
D --> K[Remove from vocabulary]
I --> L[Re-query vocabulary]
L --> A
style D fill:#2d5a27,color:#fff
style I fill:#2d5a27,color:#fff
style G fill:#4a4a4a,color:#fff
style J fill:#4a4a4a,color:#fff
style E fill:#1a3a5c,color:#fff
The LLM receives a structured prompt with embedding similarity, edge counts, and usage context. It returns a merge or skip decision with reasoning. The reasoning appears in the CLI output and is stored in the audit trail.
Cases the LLM handles that thresholds miss:
- Directional inverses — HAS_PART and PART_OF share embedding space but are opposites.
- Semantic distinctions — CONTRASTS_WITH and EQUIVALENT_TO are numerically close but semantically opposite.
- Genuine redundancy — DEFINED_AS and DEFINED, INCREASES and ENHANCES, where the distinction carries no analytical value.
Embedding architecture
All vector similarity in the consolidation pipeline flows through a single embedding path. The reasoning provider (used for extraction and consolidation decisions) and the embedding provider (used for vector similarity) are separate:
flowchart LR
subgraph Callers
SW[SynonymDetector]
CC[CategoryClassifier]
EW[EmbeddingWorker]
ING[Ingestion Pipeline]
end
subgraph Provider Layer
GP["get_provider()"] --> RP[ReasoningProvider]
RP -->|".embedding_provider"| LP
GEP["get_embedding_provider()"] --> LP[LocalEmbeddingProvider<br/>nomic-embed-text-v1.5]
end
subgraph Output
LP -->|sync call| EMB["768-dim embedding"]
end
SW --> GP
CC --> GP
EW --> GP
ING --> GEP
style LP fill:#1a3a5c,color:#fff
style EMB fill:#2d5a27,color:#fff
style RP fill:#4a3a1c,color:#fff
Design constraints enforced by the implementation:
generate_embedding()is synchronous across all providers — never awaited.OpenAIProvider.generate_embedding()delegates toself.embedding_provider. If none is configured, it raisesRuntimeError— there is no silent fallback to a different embedding model, because a dimension mismatch between stored and query vectors would corrupt similarity results.SynonymDetector._cosine_similarity()includes dimension guards that catch mismatches at comparison time with a clear error rather than silently returning garbage similarity scores._get_edge_type_embedding()detects stale cached embeddings (dimension mismatch) and regenerates them automatically.
CLI commands
kg vocab status # Show vocabulary size, zone, aggressiveness
kg vocab list # Show all types with categories, edges, status
kg vocab consolidate # Execute merges + prune (live)
kg vocab consolidate --dry-run # Preview candidates without executing
kg vocab consolidate --target N # Target vocabulary size (default 90; range 30–200)
kg vocab consolidate --threshold 0.90 # Auto-execute threshold (default 0.90)
kg vocab merge TYPE_A TYPE_B # Manual merge
Job management for consolidation runs:
kg job list -s pending # Status aliases: pending, running, done, failed
kg job cleanup -s done --confirm # Delete completed jobs
Status aliases (pending → awaiting_approval, running → processing, done → completed) are resolved by a shared resolveStatusFilter() utility across all job subcommands.
Known limitations
Pending reviews are in-memory only. get_pending_reviews() and approve_action() exist in the API but hold state in memory rather than persisting it. The primary workflow — fully automated consolidation — does not depend on these paths.