ADR-030: Concept Deduplication Quality Validation
Status: Accepted Date: 2025-10-12 Author: System Architecture Related: ADR-016 (Apache AGE), ADR-024 (PostgreSQL)
Overview
Imagine you're building a knowledge base about Buddhism by reading multiple books on the subject. As you read, you take notes and create concept cards. When you encounter "Buddhism" in the first book, you create a card. But when the second book mentions "Buddhist Philosophy," should you create a new card or recognize it's the same concept? This decision happens thousands of times as your knowledge base grows, and getting it wrong leads to a fragmented, confusing mess.
This is exactly the challenge our knowledge graph faces during document ingestion. The system uses AI to extract concepts from documents and relies on embedding-based similarity matching to decide whether a new mention is genuinely new or just another way of referring to something already in the graph. Currently, we use an 80% similarity threshold: if two concepts are more than 80% similar in their semantic meaning, we treat them as the same concept.
The problem is, we don't systematically validate whether this deduplication process actually works well over time. Does the quality degrade as the graph grows? Do related documents properly share concepts, or do we end up with "Buddhism," "Buddhist Philosophy," and "Buddhist Teachings" cluttering our search results? This ADR establishes a rigorous test suite to answer these questions and ensure the deduplication system maintains quality.
Context
The knowledge graph relies on embedding-based concept deduplication to prevent creating duplicate concepts when ingesting related documents. The quality of this deduplication directly impacts:
- Graph coherence: Duplicate concepts fragment the knowledge graph
- Query quality: Search returns multiple versions of the same concept
- Relationship density: Connections between concepts are lost if duplicates exist
- Storage efficiency: Graph grows unnecessarily with redundant concepts
- User experience: Confusing results when "Buddhism" and "Buddhist Philosophy" are separate
Current Deduplication Approach
Mechanism:
- Extract concepts from document chunks using LLM
- Generate embeddings for each concept (OpenAI text-embedding-3-small)
- Compare new concepts against existing graph via cosine similarity
- Match threshold: 80% similarity → reuse existing concept
- Below threshold → create new concept
Two-Level Deduplication:
1. File-level: Content hash prevents re-ingesting same file (bypass with --force)
2. Concept-level: Embedding similarity prevents duplicate concepts
The Problem
We lack systematic validation that deduplication maintains quality over time:
- Temporal stability: Does matching degrade as graph grows?
- Domain consistency: Do related documents properly share concepts?
- Re-ingestion quality: Does forcing re-ingestion create duplicates?
- Threshold tuning: Is 80% the optimal similarity threshold?
- Label variance: Does "Buddhism" match "Buddhist philosophy" as expected?
Decision
Establish a Concept Deduplication Quality Test Suite to validate that embedding-based matching prevents concept duplication across document ingestion sequences.
Test Methodology: Temporal Re-ingestion Analysis
Hypothesis: If deduplication works correctly, re-ingesting an early document after ingesting related documents should show: - High concept reuse rate (70-90%) - Minimal new concept creation - No synonym/duplicate concepts in search results
Test Protocol:
# Phase 1: Baseline Ingestion
# Ingest first document in domain
kg ingest file -o "TestOntology" file1.md --wait
# Record initial state
kg database stats > test_baseline.txt
INITIAL_CONCEPTS=$(kg database stats --json | jq '.nodes.concepts')
# Phase 2: Domain Expansion
# Ingest related documents in same domain
kg ingest file -o "TestOntology" file2.md --wait
kg ingest file -o "TestOntology" file3.md --wait
kg ingest file -o "TestOntology" file4.md --wait
kg ingest file -o "TestOntology" file5.md --wait
# Record expanded state
kg database stats > test_expanded.txt
EXPANDED_CONCEPTS=$(kg database stats --json | jq '.nodes.concepts')
# Phase 3: Temporal Re-ingestion Test
# Force re-ingest of first file
kg ingest file -o "TestOntology" file1.md --force --wait
# Extract job statistics
JOB_ID=$(kg job list --limit 1 --json | jq -r '.[0].job_id')
kg job status $JOB_ID --json > test_reingestion.json
# Analyze results
FINAL_CONCEPTS=$(kg database stats --json | jq '.nodes.concepts')
HIT_RATE=$(jq -r '.progress.hit_rate' test_reingestion.json)
NEW_CONCEPTS=$(jq -r '.result.concepts_created' test_reingestion.json)
# Phase 4: Duplicate Detection
# Search for known concepts that should be unique
kg search query "Buddhism" --limit 10 --json > test_buddhism_search.json
BUDDHISM_COUNT=$(jq '. | length' test_buddhism_search.json)
Success Criteria
Quantitative Metrics:
| Metric | Target | Indicates |
|---|---|---|
| Re-ingestion hit rate | ≥ 70% | Concepts properly matched |
| New concepts created | ≤ 10 | Minimal duplication |
| Concept count delta | ≤ 5% | Graph stability |
| Search result uniqueness | 1-2 results | No synonyms |
| Similarity score | ≥ 80% | Threshold working |
Qualitative Validation:
- Concept Unity: Searching for "Buddhism" returns 1-2 highly related concepts, not 5-10 variants
- Label Consistency: Similar concepts use consistent labels ("Buddhism" not "Buddhist Philosophy", "Buddhist Teachings", etc.)
- Relationship Preservation: Re-ingestion adds relationships to existing concepts, not new duplicates
- Evidence Accumulation: Concept evidence count increases (more source quotes), not concept count
Failure Indicators
❌ Synonym Explosion:
kg search query "Buddhism" --limit 10
# Returns:
# - Buddhism (80% match)
# - Buddhist Philosophy (78% match)
# - Buddhist Teachings (75% match)
# - Buddha's Philosophy (72% match)
# - Buddhism Religion (85% match)
❌ Low Hit Rate: Re-ingestion shows <50% concept reuse
❌ Concept Drift: Same content creates different concepts over time
❌ Graph Bloat: Concept count grows significantly on re-ingestion
Implementation
Test Suite Location
tests/integration/
├── test_concept_deduplication.py # Main test suite
├── test_temporal_reingestion.py # Re-ingestion validation
├── test_similarity_thresholds.py # Threshold tuning
└── fixtures/
├── watts_lecture_01.md # Known test documents
├── watts_lecture_02.md
└── expected_concepts.json # Ground truth
Automated Validation Script
#!/bin/bash
# scripts/validate-deduplication.sh
set -e
ONTOLOGY="DeduplicationTest"
TEST_DIR="tests/fixtures/philosophy"
echo "=== Concept Deduplication Validation ==="
echo
# Cleanup previous test
kg ontology delete "$ONTOLOGY" --force 2>/dev/null || true
# Phase 1: Initial ingestion
echo "Phase 1: Ingesting initial document..."
kg ingest file -o "$ONTOLOGY" "$TEST_DIR/file1.md" --wait
INITIAL=$(kg database stats --json | jq '.nodes.concepts')
echo "Initial concepts: $INITIAL"
# Phase 2: Domain expansion
echo "Phase 2: Ingesting related documents..."
for file in "$TEST_DIR"/file{2..5}.md; do
kg ingest file -o "$ONTOLOGY" "$file" --wait
done
EXPANDED=$(kg database stats --json | jq '.nodes.concepts')
echo "Expanded concepts: $EXPANDED"
# Phase 3: Re-ingestion test
echo "Phase 3: Re-ingesting first document with --force..."
kg ingest file -o "$ONTOLOGY" "$TEST_DIR/file1.md" --force --wait > /tmp/reingest.log
JOB_ID=$(grep "Job submitted:" /tmp/reingest.log | awk '{print $3}')
# Wait for completion
sleep 5
# Extract metrics
FINAL=$(kg database stats --json | jq '.nodes.concepts')
STATS=$(kg job status "$JOB_ID" --json)
HIT_RATE=$(echo "$STATS" | jq -r '.progress.hit_rate // "0%"' | tr -d '%')
NEW_CONCEPTS=$(echo "$STATS" | jq -r '.result.concepts_created // 0')
echo
echo "=== Results ==="
echo "Final concepts: $FINAL"
echo "Concept growth: $(($FINAL - $EXPANDED))"
echo "Re-ingestion hit rate: ${HIT_RATE}%"
echo "New concepts created: $NEW_CONCEPTS"
# Validation
if [ "$HIT_RATE" -ge 70 ]; then
echo "✓ Hit rate meets threshold (≥70%)"
else
echo "✗ Hit rate below threshold: ${HIT_RATE}% < 70%"
exit 1
fi
if [ "$NEW_CONCEPTS" -le 10 ]; then
echo "✓ New concepts acceptable (≤10)"
else
echo "✗ Too many new concepts: $NEW_CONCEPTS > 10"
exit 1
fi
GROWTH=$(($FINAL - $EXPANDED))
if [ "$GROWTH" -le $((EXPANDED / 20)) ]; then # 5% threshold
echo "✓ Concept count stable (<5% growth)"
else
echo "✗ Concept count grew significantly: +$GROWTH"
exit 1
fi
echo
echo "=== Deduplication Quality: PASSED ==="
CI/CD Integration
# .github/workflows/deduplication-validation.yml
name: Concept Deduplication Quality
on:
push:
paths:
- 'src/api/lib/llm_extractor.py'
- 'src/api/lib/ingestion.py'
- 'tests/fixtures/**'
jobs:
validate-deduplication:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup environment
run: |
docker-compose up -d
pip install -r requirements.txt
cd client && npm install && npm run build
- name: Run deduplication validation
run: ./scripts/validate-deduplication.sh
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Monitoring and Alerting
Production Metrics
Track in production to detect degradation:
-- Average concept reuse rate by ontology
SELECT
ontology,
AVG(
(result->>'concepts_reused')::int * 100.0 /
NULLIF((result->>'concepts_created')::int + (result->>'concepts_reused')::int, 0)
) as avg_hit_rate,
COUNT(*) as job_count
FROM kg_api.ingestion_jobs
WHERE status = 'completed'
AND created_at > NOW() - INTERVAL '7 days'
GROUP BY ontology
ORDER BY avg_hit_rate DESC;
Alert Thresholds: - Hit rate drops below 50% for ontology with >5 documents - Concept count grows >20% in single re-ingestion - Search returns >3 concepts for high-frequency terms
Threshold Tuning Experiments
Test Different Similarity Thresholds
| Threshold | Expected Behavior | Risk |
|---|---|---|
| 70% | Aggressive matching, fewer concepts | False positives (merge unrelated) |
| 80% (current) | Balanced approach | Current baseline |
| 90% | Conservative, more concepts | False negatives (create duplicates) |
| 95% | Strict matching, label-sensitive | Many duplicates |
Experiment Protocol: 1. Ingest same corpus with different thresholds 2. Compare concept counts and search quality 3. Manual review of concept matches at each threshold 4. A/B test with domain experts
Consequences
Positive
✅ Systematic validation of core deduplication functionality ✅ Early detection of quality degradation ✅ Objective metrics for tuning thresholds ✅ Regression prevention via CI/CD integration ✅ User confidence in graph coherence
Negative
⚠️ Test maintenance: Requires curated test documents ⚠️ API costs: Re-ingestion tests consume OpenAI credits ⚠️ Time overhead: Full validation takes 5-10 minutes ⚠️ Threshold brittleness: May need adjustment per domain
Trade-offs
Precision vs. Recall: - Higher threshold (90%): Fewer false merges, more duplicates - Lower threshold (70%): Fewer duplicates, more false merges - Current 80%: Balanced middle ground
Graph Size vs. Quality: - Aggressive deduplication: Smaller, denser graph (better navigation) - Conservative deduplication: Larger, sparser graph (preserves nuance)
Related Decisions
- ADR-016: Apache AGE migration (graph storage layer)
- ADR-024: PostgreSQL multi-schema (job tracking)
- ADR-002: Node fitness scoring (concept quality metrics)
- ADR-005: Source text tracking (evidence preservation)
Future Considerations
- Adaptive thresholds: Per-ontology threshold tuning based on domain
- Concept merging: Tools to manually merge duplicate concepts
- Similarity explainability: Show why concepts matched/didn't match
- Embedding model upgrades: Test impact of newer embedding models
- Multilingual matching: Handle concepts in multiple languages
- Temporal analysis: Track deduplication quality over graph lifetime
References
- Concept matching implementation:
src/api/lib/ingestion.py:match_existing_concepts() - Embedding generation:
src/api/lib/ai_providers.py:generate_embedding() - Job statistics:
kg job status <id>shows hit rate - Current threshold: 80% cosine similarity (hardcoded)
Validation Status
- [ ] Test suite implemented
- [ ] Automated validation script created
- [ ] CI/CD integration complete
- [ ] Production monitoring established
- [ ] Threshold tuning experiments run
- [ ] Documentation updated in user guides
Next Steps:
1. Implement tests/integration/test_concept_deduplication.py
2. Create validation script at scripts/validate-deduplication.sh
3. Run baseline validation on existing test corpus
4. Document findings in test report