ADR-072: Concept Matching Strategies and Configuration

Context

Concept similarity matching during ingestion currently uses hardcoded parameters and an exhaustive search strategy that will not scale. The system has evolved since initial implementation:

Current State (Hardcoded)

# api/api/lib/ingestion.py:336
matches = age_client.vector_search(
    embedding=embedding,
    threshold=0.85,  # ❌ Hardcoded
    top_k=5          # ❌ Hardcoded
)

Issues: 1. No pgvector: Full O(n) Python numpy scan on every concept match 2. Hardcoded threshold: 0.85 may not suit all use cases (research vs brainstorming) 3. Exhaustive search: Checks all concepts, ignores graph structure 4. Not configurable: Users can't tune for their corpus characteristics

New Capabilities Available

Since initial implementation, the platform now supports:

pgvector Extension - Can be added to PostgreSQL for native vector indexing
Degree Centrality Data - Graph structure reveals hub concepts
ADR-041 Configuration Pattern - Database-first configuration with hot-reload
ADR-071 Epsilon-Greedy Pattern - Proven degree-biased search strategies

Performance Bottleneck

With 10k+ concepts, ingestion slows dramatically: - Current: 100 concepts/chunk × 500ms search = 50 seconds/chunk - With pgvector: 100 concepts × 10ms = 1 second/chunk - 50x speedup potential

Search Strategy Insight

ADR-071 discovered that degree centrality bias improves both performance and quality for graph traversal. The same principle applies to concept matching:

High-degree concepts are: - More semantically stable (validated by many sources) - More likely canonical (referenced frequently) - Better deduplication targets (consolidation nodes)

Decision

Implement three-tier concept matching system with database-first configuration and pgvector indexing.

1. Database Configuration Schema

-- Migration: Add concept matching configuration
CREATE TABLE IF NOT EXISTS kg_api.concept_match_config (
    config_id SERIAL PRIMARY KEY,
    strategy VARCHAR(50) NOT NULL DEFAULT 'exhaustive',
        CHECK (strategy IN ('exhaustive', 'degree_biased', 'degree_only')),
    similarity_threshold FLOAT NOT NULL DEFAULT 0.85,
        CHECK (similarity_threshold >= 0.0 AND similarity_threshold <= 1.0),
    top_k INTEGER NOT NULL DEFAULT 5,
        CHECK (top_k >= 1 AND top_k <= 20),
    degree_percentile FLOAT NOT NULL DEFAULT 0.75,
        CHECK (degree_percentile >= 0.0 AND degree_percentile <= 1.0),
    evidence_weight FLOAT,
        CHECK (evidence_weight IS NULL OR (evidence_weight >= 0.0 AND evidence_weight <= 1.0)),
    description TEXT,
    is_active BOOLEAN NOT NULL DEFAULT TRUE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

-- Insert default configuration (exhaustive, current behavior)
INSERT INTO kg_api.concept_match_config (
    strategy,
    similarity_threshold,
    top_k,
    degree_percentile,
    evidence_weight,
    description,
    is_active
) VALUES (
    'exhaustive',
    0.85,
    5,
    0.75,
    NULL,
    'Default configuration: exhaustive search with 0.85 threshold',
    TRUE
);

2. Three Search Strategies

Following ADR-071 epsilon-greedy pattern:

Strategy	Description	Performance	Quality	Use Case
exhaustive	Search all concepts (current)	O(n) slow	Best	Default, small graphs (<10k)
degree_biased	80% high-degree + 20% random	2-3x faster	Good	Growing graphs, balanced
degree_only	Only high-degree concepts	Fastest	Conservative	Large mature graphs (>50k)

Degree Centrality Filter: - Calculate degree for each concept: count(relationships) - Filter to top N percentile (default: 75th percentile = top 25%) - Only search filtered set (degree_only) or bias search toward it (degree_biased)

3. pgvector Integration

-- Add pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Add vector column for fast similarity search
ALTER TABLE ag_catalog.concept
ADD COLUMN embedding_vec vector(1536);

-- Populate from existing JSON embeddings
UPDATE ag_catalog.concept
SET embedding_vec = embedding::vector
WHERE embedding IS NOT NULL;

-- Create index (IVFFlat with cosine distance)
CREATE INDEX concept_embedding_idx
ON ag_catalog.concept
USING ivfflat (embedding_vec vector_cosine_ops)
WITH (lists = 100);

Query Pattern with Degree Centrality (following ADR-071):

-- Degree-biased search (calculate degree inline, no caching needed)
MATCH (c:Concept)-[r]-()
WHERE c.embedding IS NOT NULL
WITH c, count(r) AS degree
ORDER BY degree DESC
LIMIT 1000  -- Pre-filter to top N by degree

-- Then apply pgvector similarity on filtered set
-- (Combined Cypher + SQL for best performance)

pgvector Similarity Search:

-- Fast similarity search with pgvector
SELECT c.concept_id, c.label, c.description,
       1 - (c.embedding_vec <=> $1::vector) AS similarity
FROM ag_catalog.concept c
WHERE c.embedding_vec IS NOT NULL
  AND 1 - (c.embedding_vec <=> $1::vector) >= $threshold
ORDER BY c.embedding_vec <=> $1::vector
LIMIT $top_k;

4. Configuration Loading

Following ADR-041 pattern (database-first with fallbacks):

# api/api/lib/concept_matcher.py

@dataclass
class ConceptMatchConfig:
    """Configuration for concept similarity matching."""
    strategy: str = "exhaustive"
    similarity_threshold: float = 0.85
    top_k: int = 5
    degree_percentile: float = 0.75
    evidence_weight: Optional[float] = None

    @classmethod
    def from_database(cls, db_client) -> "ConceptMatchConfig":
        """Load active configuration from database."""
        query = """
        SELECT strategy, similarity_threshold, top_k,
               degree_percentile, evidence_weight
        FROM kg_api.concept_match_config
        WHERE is_active = TRUE
        ORDER BY updated_at DESC
        LIMIT 1
        """
        try:
            result = db_client.execute(query)
            if result:
                return cls(**result[0])
        except Exception as e:
            logger.warning(f"Failed to load concept match config: {e}")

        # Fallback to defaults
        return cls()

# Load once at startup, cache, reload on config change
_config_cache: Optional[ConceptMatchConfig] = None

def get_concept_match_config(db_client) -> ConceptMatchConfig:
    """Get active concept matching configuration (cached)."""
    global _config_cache
    if _config_cache is None:
        _config_cache = ConceptMatchConfig.from_database(db_client)
    return _config_cache

def reload_concept_match_config(db_client):
    """Force reload of configuration (called after admin updates)."""
    global _config_cache
    _config_cache = None
    return get_concept_match_config(db_client)

5. API Endpoints

GET /admin/concept-match/config - View current configuration

{
  "strategy": "exhaustive",
  "similarity_threshold": 0.85,
  "top_k": 5,
  "degree_percentile": 0.75,
  "evidence_weight": null,
  "description": "Default configuration: exhaustive search with 0.85 threshold"
}

GET /admin/concept-match/strategies - List available strategies

href="#__codelineno-7-1">{ "strategies": [ { "name": "exhaustive", "description": "Search all concepts with embeddings", "performance": "O(n) - slower at scale", "quality": "Best - finds optimal match", "recommended_for": "Small graphs (<10k concepts), strict deduplication" }, { "name": "degree_biased", "description": "80% high-degree concepts + 20% random exploration", "performance": "2-3x faster than exhaustive", "quality": "Good - biases toward stable hub concepts", "recommended_for": "Growing graphs, balanced performance/quality" }, { "name": "degree_only", "description": "Only search high-degree hub concepts", "performance": "Fastest - searches top 25% by degree", "quality": "Conservative - may miss novel connections", "recommended_for": "Large mature graphs (>50k concepts)" } ], "parameters": { "similarity_threshold": { "type": "float", "range": [0.0, 1.0], "default": 0.85, "description": "Minimum cosine similarity for concept matching" }, "top_k": { "type": "integer", "range": [1, 20], "default": 5, "description": "Maximum candidate concepts to evaluate" }, "degree_percentile": { "type": "float", "range": [0.0, 1.0], "default": 0.75, "description": "Degree centrality threshold (0.75 = top 25%)" }, "evidence_weight": { "type": "float", "range": [0.0, 1.0], "default": null, "description": "Weight for evidence-aware matching (null = disabled)" } } }

POST /admin/concept-match/config - Update configuration

{
  "strategy": "degree_biased",
  "similarity_threshold": 0.87,
  "top_k": 10,
  "degree_percentile": 0.80,
  "description": "Optimized for large research corpus"
}

6. Ingestion Integration

# api/api/lib/ingestion.py

def process_chunk(...):
    # Load configuration once per ingestion job (cached)
    match_config = get_concept_match_config(age_client)

    for concept in extraction["concepts"]:
        # Generate embedding (unchanged)
        embedding = embedding_worker.generate_concept_embedding(embedding_text)

        # Vector search with configured strategy
        matches = age_client.vector_search_with_strategy(
            embedding=embedding,
            threshold=match_config.similarity_threshold,
            top_k=match_config.top_k,
            strategy=match_config.strategy,
            degree_percentile=match_config.degree_percentile
        )

        # Rest of upsert logic unchanged
        if matches:
            # Link to existing
        else:
            # Create new

7. Operator CLI Integration

# View current configuration
docker exec kg-operator python /workspace/operator/configure.py concept-match status

# Update configuration
docker exec kg-operator python /workspace/operator/configure.py concept-match \
  --strategy degree_biased \
  --threshold 0.87 \
  --top-k 10

# List available strategies
docker exec kg-operator python /workspace/operator/configure.py concept-match strategies

Consequences

Positive

✅ Massive Performance Improvement - pgvector: 100x faster than Python numpy full scan - degree_biased: 2-3x additional speedup via filtering - Combined: 50 seconds/chunk → 1 second/chunk potential

✅ Configurability Without Complexity - Database-first: single source of truth - No per-job overrides (simplicity) - Hot-reload capability via cache invalidation

✅ Proven Pattern - Follows ADR-041 (database-first configuration) - Applies ADR-071 lessons (degree-biased search) - Consistent with platform architecture

✅ Future-Ready - Evidence-aware matching (when ADR-068 complete) - Per-job overrides (if needed later) - Query tuning based on graph characteristics

✅ Backward Compatible - Default: exhaustive with 0.85 threshold (current behavior) - pgvector migration: transparent performance boost - No API changes for ingestion endpoints

Negative

❌ pgvector Dependency - Requires PostgreSQL extension installation - Adds operational complexity (index maintenance) - Migration effort for existing embeddings

❌ Configuration Complexity - Users must understand strategy trade-offs - Wrong configuration could hurt deduplication quality - Requires documentation and guidance

❌ Degree Calculation Overhead - Must calculate/cache degree centrality - May need periodic recalculation as graph evolves - Additional query complexity

Neutral

⚠️ No Per-Job Override - Simplifies implementation (consistency) - May need future ADR if use cases demand flexibility - Current decision: wait for actual need

⚠️ Strategy Selection Guidance - Need metrics to recommend strategy per graph size - "10k concepts" thresholds are estimates - Real-world tuning required

Implementation Sequence

Phase 1: pgvector Foundation

Add pgvector extension to PostgreSQL
Add embedding_vec vector column
Migrate existing embeddings: embedding::vector
Create IVFFlat index with cosine distance
Update vector_search() to use pgvector
Measure performance improvement

Phase 2: Database Configuration

Create concept_match_config table with migration
Insert default configuration (exhaustive)
Implement ConceptMatchConfig dataclass
Add configuration loading with caching
Create operator CLI commands
Test configuration hot-reload

Phase 3: Degree Centrality Helpers

Implement inline degree calculation (following ADR-071 pattern)
Implement get_degree_threshold() for percentile filtering
Test degree queries with various graph sizes
No caching needed - calculate on-demand in Cypher
Document performance characteristics

Phase 4: Search Strategies

Implement vector_search_with_strategy() dispatcher
Implement exhaustive strategy (pgvector only)
Implement degree_only strategy (pgvector + filter)
Implement degree_biased strategy (epsilon-greedy)
Add strategy enum and validation
Test all three strategies with real corpus

Phase 5: API Endpoints

Create GET /admin/concept-match/config
Create GET /admin/concept-match/strategies
Create POST /admin/concept-match/config
Add request/response Pydantic models
Add authentication (admin-only endpoints)
Test configuration updates and reload

Phase 6: Ingestion Integration

Update process_chunk() to load configuration
Pass configuration to vector_search_with_strategy()
Test ingestion with all three strategies
Measure deduplication quality (hit rate)
Document strategy recommendations
Update CLAUDE.md with new workflow

Phase 7: Monitoring & Tuning

Add telemetry for search times per strategy
Add metrics for deduplication hit rate
Create dashboard for configuration impact
Document tuning guidelines
Gather real-world performance data
Write operational runbook

Evidence-Aware Matching (Future)

This ADR leaves room for ADR-073 (Evidence-Aware Concept Matching) which will add:

evidence_weight: Optional[float] = None  # Reserved for future use

When evidence_weight is set:

# Concept similarity (current)
concept_sim = cosine(new_concept_emb, existing_concept_emb)

# Evidence similarity (ADR-073, future)
evidence_chunks = get_source_chunks_for_concept(concept_id)
evidence_sim = max([cosine(new_concept_emb, chunk_emb) for chunk in evidence_chunks])

# Ensemble score
final_score = (1 - evidence_weight) * concept_sim + evidence_weight * evidence_sim

This requires ADR-068 (Source Text Embeddings) to be implemented first.

Alternatives Considered

Option A: Per-Job Configuration Override

Pros: Maximum flexibility, users tune per ingestion Cons: Complexity, inconsistency, hard to reason about graph evolution Decision: Rejected - Database-first for simplicity and consistency

Option B: Automatic Strategy Selection

Pros: System auto-selects based on graph size Cons: Magic behavior, unclear to users, hard to debug Decision: Rejected - Explicit configuration with operator CLI

Option C: Only Add pgvector (No Strategies)

Pros: Simpler, less code Cons: Misses opportunity to apply ADR-071 lessons Decision: Rejected - Degree-biased search proven valuable

Option D: Embedding Model in Configuration

Pros: Could switch embedding models Cons: Breaks existing embeddings, requires full regeneration Decision: Rejected - Embedding model separate concern (ADR-041)

ADR-041: AI Extraction Provider Configuration - Database-first configuration pattern
ADR-049: Rate Limiting and Concurrency - Per-provider semaphores and configuration
ADR-068: Source Text Embeddings - Future: Evidence-aware matching
ADR-071: Parallel Graph Query Optimization - Epsilon-greedy degree-biased search pattern
ADR-071a: Parallel Implementation Findings - Discovery that query optimization > parallelization

References

pgvector: https://github.com/pgvector/pgvector
PostgreSQL Vector Extension: https://www.postgresql.org/docs/current/indexes-types.html
IVFFlat Index: https://github.com/pgvector/pgvector#ivfflat
Cosine Distance Operator: https://github.com/pgvector/pgvector#distance-operators

Last Updated: 2025-12-04 Next Steps: Review with stakeholders, validate approach, begin Phase 1 implementation