Skip to content

ADR-025: Dynamic Relationship Vocabulary Management

Status: Proposed Date: 2025-10-10 Deciders: System Architects Related: ADR-024 (Multi-Schema PostgreSQL Architecture), ADR-004 (Pure Graph Design)

Overview

Imagine you're reading a medieval text about alchemy, and the author describes how one element "transmutes" another. Your knowledge graph system sees this word but only knows modern relationship types like "TRANSFORMS" or "CONVERTS". Should it reject this rich, domain-specific term and force everything into your predefined boxes? Or should it learn and adapt?

This ADR makes a fundamental choice: let the AI discover new relationship types as it reads, rather than limiting it to a fixed list. Think of it like a child learning language—starting with basic words but naturally expanding their vocabulary as they encounter new concepts. When the AI extracts "transmutes" from alchemical texts or "prophesies" from religious documents, the system automatically creates these new relationship types, stores them in a vocabulary table, and tracks how often they're used. This creates a living vocabulary that grows organically with your knowledge base, while still maintaining quality through normalization (preventing "transmutes", "TRANSMUTES", and "transmutation" from becoming separate types) and usage tracking (so you can see which terms are actually useful versus one-off oddities). The vocabulary evolves to match the domains you're studying, rather than forcing every domain into the same rigid framework.


Context

During ingestion, the LLM extraction process produces relationship types that don't match our fixed vocabulary of 30 approved types. These relationships are currently skipped with warnings, resulting in lost semantic connections.

Current Problem

Fixed Vocabulary Limitation:

ALLOWED_RELATIONSHIP_TYPES = {
    'IMPLIES', 'SUPPORTS', 'CONTRADICTS', 'RESULTS_FROM', 'ENABLES',
    'REQUIRES', 'INFLUENCES', 'COMPLEMENTS', 'OVERLAPS', 'EXTENDS',
    # ... 20 more
}

Lost Relationships (from actual ingestion):

⚠ Skipping relationship: invalid type 'ENHANCES' (no match)
⚠ Skipping relationship: invalid type 'INTEGRATES' (no match)
⚠ Skipping relationship: invalid type 'CONNECTS_TO' (no match)
⚠ Skipping relationship: invalid type 'ALIGNS_WITH' (no match)
⚠ Skipping relationship: invalid type 'PROVIDES' (no match)
⚠ Skipping relationship: invalid type 'RECEIVES' (no match)
⚠ Skipping relationship: invalid type 'POWERS' (no match)
⚠ Skipping relationship: invalid type 'EMBEDDED_IN' (no match)
⚠ Skipping relationship: invalid type 'CONTRIBUTES_TO' (no match)
⚠ Skipping relationship: invalid type 'ENABLED_BY' (no match)
⚠ Skipping relationship: invalid type 'MAINTAINS' (no match)
⚠ Skipping relationship: invalid type 'SCALES_WITH' (no match)
⚠ Skipping relationship: invalid type 'FOCUSES_ON' (no match)
⚠ Skipping relationship: invalid type 'ENSURES' (no match)
⚠ Skipping relationship: invalid type 'FEEDS' (no match)
⚠ Skipping relationship: invalid type 'INFORMS' (no match)
⚠ Skipping relationship: invalid type 'VALIDATES' (no match)

Many of these are semantically valid and would enrich the knowledge graph (e.g., ENHANCES, INTEGRATES, CONTRIBUTES_TO).

Why Fixed Vocabulary Exists

From ADR-004 (Pure Graph Design): - Prevents vocabulary explosion (LLMs can produce hundreds of variants) - Ensures semantic consistency across ingestions - Enables reliable graph traversal and queries - Maintains interpretability of relationship types

Decision

Implement a two-tier dynamic relationship vocabulary system:

  1. Capture Layer - Record all skipped relationships for analysis
  2. Vocabulary Management - Curator-approved expansion of relationship types

Architecture Components

1. Skipped Relationships Table (PostgreSQL kg_api schema)

Track all relationships that didn't match the approved vocabulary:

CREATE TABLE kg_api.skipped_relationships (
    id SERIAL PRIMARY KEY,
    relationship_type VARCHAR(100) NOT NULL,
    from_concept_label VARCHAR(500),
    to_concept_label VARCHAR(500),
    job_id VARCHAR(50),
    ontology VARCHAR(200),
    first_seen TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_seen TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    occurrence_count INTEGER DEFAULT 1,
    sample_context JSONB,  -- Store example {"from": "...", "to": "...", "confidence": ...}
    UNIQUE(relationship_type, from_concept_label, to_concept_label)
);

CREATE INDEX idx_skipped_rels_type ON kg_api.skipped_relationships(relationship_type);
CREATE INDEX idx_skipped_rels_count ON kg_api.skipped_relationships(occurrence_count DESC);
CREATE INDEX idx_skipped_rels_first_seen ON kg_api.skipped_relationships(first_seen DESC);

2. Relationship Vocabulary Table (PostgreSQL kg_api schema)

Centralized, version-controlled relationship vocabulary:

CREATE TABLE kg_api.relationship_vocabulary (
    relationship_type VARCHAR(100) PRIMARY KEY,
    description TEXT,
    category VARCHAR(50),  -- e.g., 'causation', 'composition', 'temporal', 'semantic'
    added_by VARCHAR(100),  -- User or system that approved it
    added_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    usage_count INTEGER DEFAULT 0,  -- Count of edges using this type
    is_active BOOLEAN DEFAULT TRUE,
    is_builtin BOOLEAN DEFAULT FALSE,  -- Original 30 types
    synonyms VARCHAR(100)[]  -- Alternative terms that map to this type
);

-- Initialize with existing 30 types
INSERT INTO kg_api.relationship_vocabulary (relationship_type, is_builtin, category, description)
VALUES
    ('IMPLIES', TRUE, 'logical', 'One concept logically implies another'),
    ('SUPPORTS', TRUE, 'evidential', 'One concept provides evidence for another'),
    ('CONTRADICTS', TRUE, 'logical', 'One concept contradicts another'),
    -- ... etc
;

3. Relationship Mapping (Synonym Support)

Allow mapping similar terms to canonical types:

-- Example: Map 'ENHANCES' and 'IMPROVES' to 'SUPPORTS'
UPDATE kg_api.relationship_vocabulary
SET synonyms = ARRAY['ENHANCES', 'IMPROVES', 'STRENGTHENS']
WHERE relationship_type = 'SUPPORTS';

-- Or add as new canonical type:
INSERT INTO kg_api.relationship_vocabulary (relationship_type, category, description)
VALUES ('ENHANCES', 'augmentation', 'One concept enhances or improves another');

Workflow

During Ingestion

def upsert_relationship(from_id, to_id, rel_type, confidence):
    # 1. Check if type is in approved vocabulary
    if rel_type in get_approved_vocabulary():
        create_graph_edge(from_id, to_id, rel_type, confidence)
    else:
        # 2. Check if it's a known synonym
        canonical_type = get_canonical_type(rel_type)
        if canonical_type:
            create_graph_edge(from_id, to_id, canonical_type, confidence)
        else:
            # 3. Log to skipped_relationships for review
            record_skipped_relationship(
                rel_type=rel_type,
                from_label=get_concept_label(from_id),
                to_label=get_concept_label(to_id),
                context={'confidence': confidence, 'job_id': current_job_id}
            )

Vocabulary Expansion (Curator Process)

# CLI command to review skipped relationships
kg vocabulary review

# Output:
# Top Skipped Relationship Types:
# 1. ENHANCES (127 occurrences across 15 documents)
#    Example: "Advanced Analytics" ENHANCES "Decision Making"
# 2. INTEGRATES (89 occurrences across 12 documents)
#    Example: "API Layer" INTEGRATES "Data Pipeline"
# ...

# Approve a new relationship type
kg vocabulary add ENHANCES --category augmentation --description "One concept enhances another"

# Or map to existing type
kg vocabulary alias ENHANCES --maps-to SUPPORTS

Backfill Process

When a new relationship type is approved, optionally backfill:

def backfill_relationship_type(rel_type):
    """
    Find all skipped instances of this relationship type and create edges.
    """
    skipped = get_skipped_by_type(rel_type)

    for skip in skipped:
        # Find concepts by label (fuzzy match if needed)
        from_id = find_concept_by_label(skip.from_concept_label)
        to_id = find_concept_by_label(skip.to_concept_label)

        if from_id and to_id:
            create_graph_edge(from_id, to_id, rel_type, confidence=0.8)

Integration with ADR-024

Add to kg_api schema in ADR-024:

-- Relationship vocabulary management
CREATE TABLE kg_api.skipped_relationships (...);
CREATE TABLE kg_api.relationship_vocabulary (...);

This fits the schema's purpose: "API state (jobs, sessions, rate limits - ephemeral, write-heavy)"

Skipped relationships are ephemeral metadata that informs vocabulary curation.

Decision Rationale

Why This Approach

  1. Data-Driven Vocabulary Growth
  2. Track actual usage patterns from LLM extraction
  3. Identify frequently occurring relationship types
  4. Prioritize vocabulary expansion based on real needs

  5. Maintain Quality Control

  6. Curator approval prevents vocabulary explosion
  7. Synonym mapping reduces redundancy
  8. Category organization maintains semantic structure

  9. No Data Loss

  10. All skipped relationships are recorded
  11. Backfill capability when types are approved
  12. Audit trail of vocabulary evolution

  13. Performance

  14. PostgreSQL tables (not graph) for fast aggregation
  15. Indexed by type and occurrence count
  16. Vocabulary lookup is O(1) hash table in memory

Alternatives Considered

1. Automatic Vocabulary Expansion

  • Rejected: Would lead to uncontrolled vocabulary explosion
  • LLMs can produce hundreds of similar types (ENHANCES, IMPROVES, AUGMENTS, BOOSTS, etc.)
  • Breaks graph query consistency

2. LLM-Based Synonym Mapping

  • Rejected for now: Adds latency and cost to ingestion
  • Could be added later as a batch process
  • Example: Ask LLM "Is ENHANCES semantically similar to SUPPORTS?"

3. Keep Fixed Vocabulary Forever

  • Rejected: Loses valuable semantic information
  • Domain-specific knowledge graphs need domain-specific relationships
  • System should adapt to actual usage patterns

Implementation Plan

Phase 1: Capture Infrastructure (Week 1)

  1. Create PostgreSQL tables in kg_api schema
  2. Update upsert_relationship() to log skipped relationships
  3. Add aggregation queries for vocabulary analysis

Phase 2: CLI Tools (Week 1-2)

  1. kg vocabulary review - Show top skipped types
  2. kg vocabulary add - Approve new relationship type
  3. kg vocabulary alias - Map synonym to canonical type
  4. kg vocabulary stats - Usage statistics

Phase 3: Backfill & Migration (Week 2)

  1. Backfill tool to create edges for approved types
  2. Migration script to populate initial 30 types
  3. Documentation and curator guidelines

Phase 4: Performance Optimization - Edge Usage Cache (Future)

Track frequently traversed edges for performance optimization:

CREATE TABLE kg_api.edge_usage_stats (
    from_concept_id VARCHAR(100),
    to_concept_id VARCHAR(100),
    relationship_type VARCHAR(100),
    traversal_count INTEGER DEFAULT 0,
    last_traversed TIMESTAMPTZ,
    avg_query_time_ms NUMERIC(10,2),
    PRIMARY KEY (from_concept_id, to_concept_id, relationship_type)
);

CREATE INDEX idx_edge_usage_count ON kg_api.edge_usage_stats(traversal_count DESC);
CREATE INDEX idx_edge_usage_type ON kg_api.edge_usage_stats(relationship_type);

-- Hot paths cache (top 1000 most frequently traversed edges)
CREATE MATERIALIZED VIEW kg_api.hot_edges AS
SELECT from_concept_id, to_concept_id, relationship_type, traversal_count
FROM kg_api.edge_usage_stats
WHERE traversal_count > 100
ORDER BY traversal_count DESC
LIMIT 1000;

CREATE INDEX idx_hot_edges_lookup ON kg_api.hot_edges(from_concept_id, to_concept_id);

Concept Access Tracking:

-- Track node-level access patterns for pre-routing and caching
CREATE TABLE kg_api.concept_access_stats (
    concept_id VARCHAR(100) PRIMARY KEY,
    access_count INTEGER DEFAULT 0,
    last_accessed TIMESTAMPTZ,
    avg_query_time_ms NUMERIC(10,2),
    queries_as_start INTEGER DEFAULT 0,  -- How often used as query starting point
    queries_as_result INTEGER DEFAULT 0  -- How often appears in query results
);

CREATE INDEX idx_concept_access_count ON kg_api.concept_access_stats(access_count DESC);

-- Hot concepts cache (top 100 most-accessed concepts)
CREATE MATERIALIZED VIEW kg_api.hot_concepts AS
SELECT concept_id, access_count, queries_as_start
FROM kg_api.concept_access_stats
WHERE access_count > 50
ORDER BY access_count DESC
LIMIT 100;

Use Cases: - Pre-load hot concepts into application memory cache - Pre-route queries starting from popular concepts (fast path) - Identify trending concepts for async insight processing - Optimize query planning by prioritizing frequently accessed nodes - Detect query patterns for index optimization - Smart caching based on actual usage, not time

Low-Overhead Collection:

async def track_concept_access(concept_id: str, query_type: str):
    """
    Non-blocking access tracking - fire and forget.
    Every query collects stats without performance impact.
    """
    # Async upsert (doesn't block query execution)
    asyncio.create_task(
        db.execute(f"""
            INSERT INTO kg_api.concept_access_stats (concept_id, access_count, queries_as_{query_type})
            VALUES ('{concept_id}', 1, 1)
            ON CONFLICT (concept_id) DO UPDATE SET
                access_count = concept_access_stats.access_count + 1,
                last_accessed = NOW(),
                queries_as_{query_type} = concept_access_stats.queries_as_{query_type} + 1
        """)
    )

# Usage in queries:
async def search_concepts(query):
    results = await execute_search(query)
    for result in results:
        track_concept_access(result.concept_id, 'result')  # Fire and forget
    return results

async def find_related(concept_id):
    track_concept_access(concept_id, 'start')  # Track starting point
    return await execute_traversal(concept_id)

Example Query Optimization:

def find_related_concepts(concept_id, max_depth=2):
    # 1. Check hot edges cache first (in-memory Redis/dict)
    cached_neighbors = get_hot_edges_from_cache(concept_id)
    if cached_neighbors and max_depth == 1:
        return cached_neighbors  # Fast path!

    # 2. Fall back to full graph traversal
    return execute_graph_query(concept_id, max_depth)

Phase 5: Advanced Vocabulary Features (Future)

  1. LLM-assisted synonym detection (batch process)
  2. Relationship type embeddings for similarity search
  3. Auto-suggest synonyms during approval
  4. Relationship type analytics dashboard
  5. Edge materialized views for common query patterns

Monitoring & Metrics

Key Metrics

  1. Vocabulary Growth Rate
  2. New types approved per month
  3. Ratio of builtin vs. custom types

  4. Coverage Rate

  5. % of extracted relationships that match vocabulary
  6. % of relationships skipped

  7. Backfill Impact

  8. Edges created through backfill
  9. Concept connectivity improvements

  10. Type Usage Distribution

  11. Most/least used relationship types
  12. Identify candidates for deprecation

Alerts

  • Alert if skipped relationship rate > 30%
  • Alert if new unique types > 100/day (possible extraction issue)
  • Alert if vocabulary size > 200 types (over-expansion)

Vocabulary Pruning & Lifecycle Management

Problem: Vocabulary Bloat

Performance Impact: - Larger vocabulary → slower synonym lookups during ingestion - More types to check → increased decision tree complexity - Noisy graph with rarely-used relationship types

Solution: Automatic Deprecation Process

Natural Freshness via Upsert Mechanism

Critical Insight: Every ingestion refreshes activation naturally!

Unlike neural networks where weights decay without explicit training: - New document ingestion → Concept matching via embeddings - Concept reuse → Creates new edges to existing concept - Edge creation → Increments usage_count and traversal_count - Automatic refresh → Semantically relevant concepts stay active

async def upsert_concept(label, embedding):
    """
    Every concept match refreshes activation - no time-based decay needed!
    """
    # Find existing concept via semantic similarity
    existing = vector_search(embedding, threshold=0.8)

    if existing:
        # REUSE triggers activation refresh
        await track_concept_access(existing.concept_id, 'upsert')
        return existing.concept_id
    else:
        # Create new concept
        return create_concept(label, embedding)

async def create_relationship(from_id, to_id, rel_type):
    """
    Edge creation refreshes both nodes AND relationship type.
    """
    # Create graph edge
    cypher_create_edge(from_id, to_id, rel_type)

    # Refresh activation for BOTH concepts
    await track_concept_access(from_id, 'relationship_source')
    await track_concept_access(to_id, 'relationship_target')

    # Increment relationship type usage
    await increment_vocabulary_usage(rel_type)

Self-Regulating System:

  1. Foundational concepts stay active
  2. Core ideas appear across many documents
  3. Constant matching → constant refresh
  4. Example: "machine learning" → always relevant

  5. Obsolete concepts naturally fade

  6. Old/outdated ideas stop appearing in new documents
  7. No matches → no refreshes → activation decays organically
  8. Example: "floppy disk drivers" → rarely mentioned

  9. Bridge concepts get refreshed

  10. Low-activation bridge to high-activation concept
  11. High-activation concept gets new edges
  12. Bridge traversal → bridge concept refreshed too

  13. Seasonal concepts cycle naturally

  14. Domain-specific ideas wax and wane with document flow
  15. "tax optimization" → high in Q1, low rest of year
  16. No artificial time windows needed!

No Time-Based Decay Required: - ❌ Don't prune based on "last_used" timestamp - ✅ Prune based on structural value (edges × traversals × bridges) - ✅ Let ingestion patterns determine relevance - ✅ Trust the graph to self-regulate

Natural Activation Refresh Loop

New Document Ingestion
    Extract Concepts
Vector Similarity Match ←─────┐
         ↓                     │
    Existing Match?            │
         ↓                     │
    YES → Reuse Concept        │
         ↓                     │
    Create New Edges           │
         ↓                     │
    Increment usage_count ─────┘  [REFRESH LOOP]
Both Endpoints Activated
    Relationship Type +1
Bridge Detection Updated

Key Architectural Properties

1. Semantic-Based Freshness - Relevance = "Does new content reference this concept?" - Not "When was it last accessed?" - The corpus drives activation naturally

2. Zero Configuration - No decay parameters to tune - No time windows to configure - No manual pruning schedules - Graph self-regulates through ingestion patterns

3. Emergent Patterns - Foundational concepts → persistent high activation - Obsolete concepts → gradual natural decay - Seasonal concepts → organic cyclical patterns - Bridge concepts → transitive activation via neighbors

4. Catastrophic Forgetting Prevention - Bridge bonus preserves low-activation connectors - Structural value (topology) > activation alone - Graph topology "remembers" important paths

5. Living Knowledge Representation - Ideas stay "active" when they appear in new contexts - Naturally fade when they stop being relevant - Mirrors how human collective consciousness works - Self-organizing semantic relevance

Pruning Strategy

-- Pruning is value-based, not time-based
-- Find low-value relationship types (bottom of value score ranking)
WITH value_scores AS (
    SELECT
        v.relationship_type,
        v.usage_count as edge_count,
        COALESCE(e.avg_traversal, 0) as avg_traversal,
        -- Value = edges × traversal frequency
        v.usage_count * (1.0 + COALESCE(e.avg_traversal, 0) / 100.0) as value_score
    FROM kg_api.relationship_vocabulary v
    LEFT JOIN (
        SELECT relationship_type, AVG(traversal_count) as avg_traversal
        FROM kg_api.edge_usage_stats
        GROUP BY relationship_type
    ) e ON v.relationship_type = e.relationship_type
    WHERE v.is_builtin = FALSE AND v.is_active = TRUE
)
SELECT * FROM value_scores
ORDER BY value_score ASC  -- Lowest value first (pruning candidates)
LIMIT 10;

Automated Pruning Process

Triggered when vocabulary exceeds max limit:

def prune_to_maintain_window():
    """
    Prune lowest-value relationship types when vocabulary exceeds max.
    """
    active_count = count_active_custom_types()

    if active_count > VOCABULARY_WINDOW['max']:
        prune_count = active_count - VOCABULARY_WINDOW['max']

        # Get lowest-value types (structural value, not time-based)
        candidates = get_custom_types_ordered_by_value()  # Ordered by value ASC

        for candidate in candidates[:prune_count]:  # Take bottom N
            # Check if edges exist in graph
            edge_count = count_graph_edges(candidate.relationship_type)

            if edge_count == 0:
                # Zero edges - safe to completely remove
                delete_relationship_type(
                    candidate.relationship_type,
                    reason=f"No structural value (0 edges, 0 traversals)"
                )
            else:
                # Has edges - deactivate but preserve graph integrity
                mark_inactive(
                    candidate.relationship_type,
                    reason=f"Low structural value (score={candidate.value_score:.2f})"
                )

Pruning Levels:

  1. Deactivation (is_active = FALSE)
  2. Stop accepting new relationships of this type
  3. Existing graph edges remain intact
  4. Can still query existing data
  5. Curator can reactivate if structural importance increases

  6. Removal (delete from vocabulary)

  7. Only if ZERO graph edges exist
  8. Only for non-builtin types
  9. Automatic when value score = 0

Reactivation

# Curator can reactivate if usage pattern changes
kg vocabulary reactivate ENHANCES --reason "New domain requires this type"

Vocabulary Size Limits - Sliding Window Strategy

Maintain a functional vocabulary window with min/max boundaries:

VOCABULARY_WINDOW = {
    'min': 30,   # Core builtin types (never prune)
    'max': 100,  # Soft limit for active custom types
    'total_hard_limit': 500  # Including deprecated
}

Sliding Window Pruning: When vocabulary reaches max limit, automatically prune least valuable types to maintain window:

def maintain_vocabulary_window():
    """
    Keep vocabulary between min and max by pruning least useful types.
    """
    active_count = count_active_vocabulary()

    if active_count > VOCABULARY_WINDOW['max']:
        # Calculate how many to prune
        prune_count = active_count - VOCABULARY_WINDOW['max']

        # Get pruning candidates (excluding builtins)
        candidates = get_custom_types_ordered_by_value()

        # Prune bottom N types
        for candidate in candidates[-prune_count:]:
            if candidate.edge_count == 0:
                delete_type(candidate)  # Safe to remove
            else:
                deprecate_type(candidate)  # Has edges, just deactivate

def get_custom_types_ordered_by_value():
    """
    Order custom types by value score for pruning decisions.

    Value Score = (edge_count * traversal_weight * bridge_bonus)
    - edge_count: How many edges exist in the graph with this type
    - traversal_weight: How frequently these edges are traversed in queries
    - bridge_bonus: Prevents catastrophic forgetting of critical bridge nodes

    Time/age is IRRELEVANT - a graph's value is structural, not temporal.

    CRITICAL INSIGHT: Low-activation nodes can have high structural value!
    A rarely-accessed node might be a BRIDGE to high-activation subgraphs.
    We must remember these bridges even if they're not popular endpoints.
    """
    return db.execute("""
        WITH bridge_scores AS (
            -- Calculate bridge value: low-activation nodes connecting to high-activation nodes
            SELECT
                e.relationship_type,
                COUNT(*) as bridge_count,
                AVG(c_to.access_count) as avg_destination_activation
            FROM kg_api.edge_usage_stats e
            JOIN kg_api.concept_access_stats c_from ON e.from_concept_id = c_from.concept_id
            JOIN kg_api.concept_access_stats c_to ON e.to_concept_id = c_to.concept_id
            WHERE c_from.access_count < 10  -- Low activation source
              AND c_to.access_count > 100    -- High activation destination
            GROUP BY e.relationship_type
        )
        SELECT
            v.relationship_type,
            v.usage_count as edge_count,
            COALESCE(e.avg_traversal, 0) as avg_traversal,
            COALESCE(b.bridge_count, 0) as bridge_count,
            COALESCE(b.avg_destination_activation, 0) as bridge_value,
            -- Value score: edge count × traversal × (1 + bridge bonus)
            v.usage_count *
            (1.0 + COALESCE(e.avg_traversal, 0) / 100.0) *
            (1.0 + COALESCE(b.bridge_count, 0) / 10.0) as value_score
        FROM kg_api.relationship_vocabulary v
        LEFT JOIN (
            SELECT relationship_type, AVG(traversal_count) as avg_traversal
            FROM kg_api.edge_usage_stats
            GROUP BY relationship_type
        ) e ON v.relationship_type = e.relationship_type
        LEFT JOIN bridge_scores b ON v.relationship_type = b.relationship_type
        WHERE v.is_builtin = FALSE AND v.is_active = TRUE
        ORDER BY value_score DESC
    """)

def calculate_bridge_importance(concept_id):
    """
    Prevents catastrophic forgetting by identifying bridge nodes.

    A concept might have LOW activation (rarely accessed) but HIGH value
    because it bridges to important subgraphs.

    Example:
    - Concept "distributed consensus" has 5 accesses (LOW)
    - But it connects to "raft algorithm" (500 accesses, HIGH)
    - And "paxos protocol" (300 accesses, HIGH)
    - → Don't prune! It's a critical bridge.
    """
    return db.execute(f"""
        SELECT
            c.concept_id,
            c.access_count as own_activation,
            COUNT(neighbor.concept_id) as high_value_neighbors,
            AVG(neighbor.access_count) as avg_neighbor_activation
        FROM kg_api.concept_access_stats c
        JOIN kg_api.edge_usage_stats e ON c.concept_id = e.from_concept_id
        JOIN kg_api.concept_access_stats neighbor ON e.to_concept_id = neighbor.concept_id
        WHERE c.concept_id = '{concept_id}'
          AND neighbor.access_count > 100  -- High activation threshold
        GROUP BY c.concept_id, c.access_count
    """)

Pruning Strategy: 1. Below min (30): Never prune (builtin types protected) 2. Between min-max (30-100): Stable, no pruning 3. Above max (100+): Auto-prune lowest value types to return to max 4. Hard limit (500 total): Block new types, force curator review

Example Scenario:

Current state:
- Builtin: 30 (protected)
- Custom active: 95 (within window)
- Deprecated: 50 (historical)

New type requested: "ENHANCES"
Action: Approve (still below max of 100)

After 10 more approvals:
- Builtin: 30
- Custom active: 105 (exceeds max!)

Auto-pruning triggers:
1. Calculate value scores for all 75 custom types
2. Prune bottom 5 types to return to max (100)
3. Types with zero edges: deleted
4. Types with edges: deprecated (is_active = FALSE)

Benefits: - ✅ Prevents vocabulary bloat automatically - ✅ Keeps most valuable types (frequently used, recently accessed) - ✅ No manual intervention needed for steady-state operations - ✅ Curator only needed for exceptions or hard limit reached

Security & Governance

Access Control

  • Read vocabulary: All users (needed for ingestion)
  • Approve new types: Curators only (role: kg_curator)
  • Modify builtins: System admins only

Audit Trail

CREATE TABLE kg_api.vocabulary_audit (
    id SERIAL PRIMARY KEY,
    relationship_type VARCHAR(100),
    action VARCHAR(50),  -- 'added', 'aliased', 'deprecated', 'backfilled'
    performed_by VARCHAR(100),
    performed_at TIMESTAMPTZ DEFAULT NOW(),
    details JSONB
);

Semantic Drift and Ambiguity Prevention

Risk: As the vocabulary grows, there is a risk that the meaning of relationship types could become ambiguous or drift over time, especially if multiple curators are involved.

Mitigation: The relationship_vocabulary table defined in this ADR must be treated as a formal semantic registry. This is critical for maintaining semantic consistency without significant token overhead during ingestion.

Requirements:

  1. Clear, Unambiguous Descriptions
  2. Every relationship type MUST include a precise description of its meaning
  3. Description should specify the semantic relationship between concepts
  4. Include usage examples to prevent misinterpretation
  5. Example:

    INSERT INTO kg_api.relationship_vocabulary
    (relationship_type, description, category)
    VALUES (
        'ENHANCES',
        'One concept improves or strengthens another concept. The source concept adds value, capability, or effectiveness to the target concept without fundamentally changing it.',
        'augmentation'
    );
    

  6. Single Source of Truth

  7. The relationship_vocabulary table is the authoritative definition
  8. All curators, developers, and LLM extraction prompts reference this registry
  9. No informal or undocumented relationship interpretations
  10. Version control all changes via vocabulary_audit table

  11. Accessibility

  12. Vocabulary accessible via API: GET /vocabulary/{relationship_type}
  13. CLI command: kg vocabulary show ENHANCES
  14. Included in curator training and documentation
  15. Displayed during approval workflow for context

  16. Curation Guidelines

  17. Before approving new type, check for semantic overlap with existing types
  18. Consider synonym mapping if meaning is substantially similar
  19. Document decision rationale in vocabulary_audit.details
  20. Require curator to confirm they've read existing descriptions

Token Efficiency: - Descriptions stored in database, NOT in prompts - LLM extraction uses relationship type names only (minimal tokens) - Full semantic context queried only during curation - Result: Rich semantic registry without prompt bloat

Example Curator Workflow:

# Review candidate with semantic context
kg vocabulary review --show-similar ENHANCES

# Output shows existing types with similar semantics:
# SUPPORTS: "One concept provides evidence for another"
# STRENGTHENS: "One concept reinforces another" (SYNONYM of SUPPORTS)
#
# Curator decision: ENHANCES is semantically distinct
# - SUPPORTS = evidential relationship
# - ENHANCES = augmentation relationship
# → Approve as new type

kg vocabulary add ENHANCES \
  --category augmentation \
  --description "One concept improves or strengthens another without fundamentally changing it" \
  --example "Advanced Analytics ENHANCES Decision Making"

Complexity of Backfilling

Risk: When a new relationship type is approved, the process of backfilling it—finding all previously skipped instances and creating the corresponding edges in the graph—can be a complex and computationally expensive operation, especially on a large graph.

Mitigation: Backfilling should be implemented as a dedicated, asynchronous background job that can be scheduled during off-peak hours. The system should also allow for selective backfilling, prioritizing the most frequent or important relationships first.

Implementation Approach:

async def backfill_relationship_type(rel_type: str, options: dict):
    """
    Asynchronous background job for backfilling approved relationship types.

    Args:
        rel_type: The approved relationship type to backfill
        options: Configuration for backfill strategy
            - mode: 'full' | 'selective' | 'dry-run'
            - priority: 'frequency' | 'ontology' | 'manual'
            - batch_size: Number of edges to create per transaction
            - throttle_ms: Delay between batches to reduce load
    """
    # Get all skipped instances for this type
    skipped = await db.execute(f"""
        SELECT relationship_type, from_concept_label, to_concept_label,
               occurrence_count, job_id, ontology
        FROM kg_api.skipped_relationships
        WHERE relationship_type = '{rel_type}'
        ORDER BY occurrence_count DESC  -- High frequency first
    """)

    total = len(skipped)
    created = 0
    failed = 0

    # Process in batches to avoid transaction timeouts
    batch_size = options.get('batch_size', 100)
    throttle = options.get('throttle_ms', 50)

    for i in range(0, total, batch_size):
        batch = skipped[i:i+batch_size]

        for skip in batch:
            try:
                # Find concepts by label (fuzzy match with embedding similarity)
                from_id = await find_concept_by_label(skip.from_concept_label)
                to_id = await find_concept_by_label(skip.to_concept_label)

                if from_id and to_id:
                    if options.get('mode') == 'dry-run':
                        logger.info(f"[DRY-RUN] Would create: {from_id} -{rel_type}-> {to_id}")
                    else:
                        await create_graph_edge(from_id, to_id, rel_type, confidence=0.8)
                        created += 1
                else:
                    failed += 1
                    logger.warning(f"Could not resolve concepts for backfill: {skip}")

            except Exception as e:
                failed += 1
                logger.error(f"Backfill error: {e}")

        # Throttle between batches to reduce database load
        await asyncio.sleep(throttle / 1000.0)

        # Update progress
        progress = ((i + len(batch)) / total) * 100
        await update_job_progress(f"backfill_{rel_type}", progress)

    # Log completion
    await log_backfill_completion(rel_type, created, failed, total)

Selective Backfilling Strategies:

  1. By Frequency (default)
  2. Backfill most common relationships first
  3. ORDER BY occurrence_count DESC
  4. Creates high-value edges before low-value edges

  5. By Ontology

  6. Curator selects specific ontology/domain to backfill
  7. WHERE ontology = 'Production ML Models'
  8. Useful for focused graph enrichment

  9. By Job ID

  10. Backfill only specific ingestion jobs
  11. WHERE job_id IN (...)
  12. Allows targeted corrections

  13. Dry-Run Mode

  14. Preview backfill impact before execution
  15. Shows: "Would create 1,247 edges of type ENHANCES"
  16. Curator can approve after review

CLI Commands:

# Preview backfill impact
kg vocabulary backfill ENHANCES --dry-run

# Full backfill (scheduled as background job)
kg vocabulary backfill ENHANCES --schedule off-peak

# Selective backfill by ontology
kg vocabulary backfill ENHANCES --ontology "ML Concepts" --batch-size 50

# Prioritize by frequency
kg vocabulary backfill ENHANCES --mode frequency --top 1000

Performance Considerations:

  • Batch Processing: Process in configurable batches (default 100 edges)
  • Throttling: Delay between batches prevents database saturation
  • Off-Peak Scheduling: Run during low-traffic periods (3-6 AM)
  • Transaction Management: Commit per-batch to avoid long-running transactions
  • Progress Tracking: Real-time progress updates via job status API
  • Rollback Support: Can revert backfill if issues detected

Resource Impact Mitigation:

# Check graph size before backfill
def estimate_backfill_cost(rel_type: str):
    """
    Estimate resource cost before running backfill.
    """
    stats = db.execute(f"""
        SELECT
            COUNT(*) as edge_count,
            COUNT(DISTINCT ontology) as ontology_count,
            SUM(occurrence_count) as total_occurrences
        FROM kg_api.skipped_relationships
        WHERE relationship_type = '{rel_type}'
    """)

    estimated_time = (stats.edge_count * 50) / 1000  # ~50ms per edge

    return {
        'edges_to_create': stats.edge_count,
        'estimated_duration_minutes': estimated_time,
        'ontologies_affected': stats.ontology_count,
        'recommendation': 'scheduled' if stats.edge_count > 1000 else 'immediate'
    }

Documentation Impact

New Documentation Needed

  1. Curator Guide: How to review and approve relationship types
  2. Vocabulary Guidelines: Naming conventions, categories
  3. API Documentation: New endpoints for vocabulary management
  4. Migration Guide: How to backfill approved types

Updates Required

  • ADR-024: Add vocabulary tables to kg_api schema
  • CLAUDE.md: Add vocabulary management section
  • API docs: Document /vocabulary/* endpoints

Open Questions

  1. Backfill Strategy: Automatic or manual trigger?
  2. Recommendation: Manual trigger with dry-run preview

  3. Synonym Detection: Use LLM or manual mapping only?

  4. Recommendation: Start manual, add LLM batch process later

  5. Category Taxonomy: Fixed categories or user-defined?

  6. Recommendation: Start with fixed (causation, composition, temporal, semantic, augmentation, evidential, logical), allow custom later

  7. Deprecation Policy: How to handle unused types?

  8. Recommendation: Automatic deprecation process (see below)

Success Criteria

  1. Zero Lost Relationships: All extracted relationships either matched or logged
  2. Curator Workflow: < 5 minutes to review and approve a batch of types
  3. Vocabulary Quality: < 10% synonym overlap (e.g., ENHANCES and SUPPORTS both active)
  4. Performance: Vocabulary lookup adds < 1ms to ingestion per relationship

References

  • ADR-004: Pure Graph Design (original vocabulary rationale)
  • ADR-024: Multi-Schema PostgreSQL Architecture (database infrastructure)
  • openCypher specification: Relationship type constraints
  • Neo4j vocabulary management best practices