ADR-039: Local Embedding Service with Hybrid Client/Server Architecture

Overview

When you ask an LLM about concepts in your knowledge graph, you face a fundamental challenge: every search query requires generating an embedding vector to compare against your stored concepts. If you're using OpenAI's embedding API, each of these searches costs money and adds network latency. More critically, your query text is sent to external servers—a privacy concern for sensitive documents. Think of it like paying a toll every time you want to search your own data, while also revealing what you're searching for to a third party.

This becomes particularly problematic for the interactive visualization app, where users might type queries in real-time. A hundred keystrokes could trigger a hundred API calls, costing dollars and creating noticeable lag. The current system has painted itself into a corner: it depends entirely on external APIs for embedding generation, making it unusable without internet access and subject to rate limits during heavy use.

Here's the good news: modern open-source embedding models like Nomic-embed-text-v1.5 can run locally with quality comparable to OpenAI's offerings. Better yet, they can run in the browser using transformers.js, enabling instant, private searches without server round-trips. The challenge is architecting this thoughtfully—you can't just mix embeddings from different models (they live in incompatible vector spaces), and you need a clear migration path from cloud to local without breaking existing deployments.

This ADR proposes a hybrid architecture where the system can use local embeddings (via sentence-transformers on the server or transformers.js in the browser) while maintaining compatibility with cloud providers. The key insight is treating this as a configuration choice, not a code branch—the same /embedding/generate endpoint works whether you're using OpenAI's API or a local model, with the server managing model consistency transparently.

Context

Current State: OpenAI Embedding Dependency

The system currently depends on OpenAI's API for all embedding generation:

Embedding Use Cases: 1. Concept extraction: Generate embeddings for concepts during document ingestion 2. Semantic search: Generate query embeddings for similarity search 3. Concept matching: Find duplicate concepts via vector similarity (deduplication) 4. Interactive search: Real-time embedding generation as users type in visualization app

Current Cost & Limitations: - Every search query = 1 OpenAI API call (~$0.0001 per call) - Network latency: 100-300ms per embedding request - API dependency: System unusable without internet or OpenAI access - Privacy concerns: Query text sent to external service - Rate limits: Potential throttling during high usage

Research Findings

Transformers.js Browser Embedding Support: - Nomic-embed-text-v1/v2 and BGE models fully supported since v2.15.0 (Feb 2024) - Can generate embeddings directly in browser with no server required - Models run via ONNX runtime in WebAssembly

Model Specifications:

Model	Dimensions	Context	Size	Quantization
nomic-embed-text-v1.5	768 (64-768 via Matryoshka)	8K tokens	~275MB	Int8, Binary, GGUF
BGE-large-en-v1.5	1024	512 tokens	~1.3GB	Int8, ONNX
OpenAI text-embedding-3-small	1536	8K tokens	N/A (API)	N/A

Critical Finding: Quantization Compatibility

Research confirms that quantized embeddings (4-bit, int8) CAN be compared with full-precision embeddings (float16, float32), but with accuracy degradation:

Int8 quantization: <1% accuracy loss, 4x memory reduction
4-bit quantization: 2-5% accuracy loss, 8x memory reduction
Cosine similarity shift: Lower precision shifts similarity distribution leftward (lower scores)
Best practice: Two-pass search system:
Fast candidate pruning: Quantized scan (browser-side, 100+ QPS)
Accurate reranking: Full-precision comparison (server-side, top-K only)

Key Constraint: Embedding Model Consistency

Embeddings from different models produce incompatible vector spaces. A system MUST use the same model for: - All stored concept embeddings - All query embeddings - All similarity comparisons

Mixing models (e.g., storing nomic embeddings but querying with BGE) produces meaningless similarity scores.

Decision

Implement a hybrid local embedding architecture with a single, model-aware API endpoint that abstracts provider selection while ensuring embedding consistency.

Architecture Overview

┌───────────────────────────────────────────────────────────────┐
│                    Embedding Configuration                    │
│                 (PostgreSQL embedding_config)                 │
│                                                               │
│  {                                                            │
│    provider: "local" | "openai",                              │
│    model: "nomic-embed-text-v1.5" | "text-embedding-3-small", │
│    dimensions: 768 | 1536,                                    │
│    precision: "float16" | "float32"                           │
│  }                                                            │
└───────────────────────────────────────────────────────────────┘
                              │
                              │ Config read at startup
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              FastAPI Server: /embedding/generate            │
│                                                             │
│  1. Read global embedding config                            │
│  2. Route to appropriate provider:                          │
│     - LocalEmbeddingProvider (sentence-transformers)        │
│     - OpenAIProvider (API call)                             │
│  3. Return embedding with metadata                          │
│                                                             │
│  Response: {                                                │
│    embedding: [0.123, -0.456, ...],                         │
│    model: "nomic-embed-text-v1.5",                          │
│    dimensions: 768,                                         │
│    provider: "local"                                        │
│  }                                                          │
└─────────────────────────────────────────────────────────────┘
                              │
                    ┌─────────┴─────────┐
                    │                   │
         ┌──────────▼──────────┐  ┌─────▼──────────┐
         │  OpenAI Provider    │  │ Local Provider │
         │  (Current behavior) │  │ (New)          │
         └─────────────────────┘  └────────────────┘
                    │                   │
                    │         ┌─────────┴───────────┐
                    │         │                     │
              API Request   Server                  │
              (internet)    (sentence-              │
                            transformers)           │
                                                    │
                                         ┌──────────▼──────────┐
                                         │ Browser (Optional)  │
                                         │ transformers.js     │
                                         │                     │
                                         │ Quantized model     │
                                         │ (int8 or int4)      │
                                         │                     │
                                         │ Two-pass search:    │
                                         │ 1. Local fast scan  │
                                         │ 2. Server rerank    │
                                         └─────────────────────┘

Component Design

1. Embedding Configuration Storage

Database Table: embedding_config

CREATE TABLE IF NOT EXISTS embedding_config (
    id SERIAL PRIMARY KEY,
    provider VARCHAR(50) NOT NULL,  -- 'local' or 'openai'
    model VARCHAR(200) NOT NULL,    -- 'nomic-embed-text-v1.5', 'text-embedding-3-small'
    dimensions INTEGER NOT NULL,
    precision VARCHAR(20) NOT NULL, -- 'float16', 'float32'
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    active BOOLEAN DEFAULT TRUE,
    UNIQUE(active) WHERE active = TRUE  -- Only one active config
);

Environment Variables:

# .env
EMBEDDING_PROVIDER=local  # or 'openai'
EMBEDDING_MODEL=nomic-embed-text-v1.5  # or 'text-embedding-3-small'
EMBEDDING_PRECISION=float16
EMBEDDING_DIMENSIONS=768

2. Single Embedding API Endpoint

Endpoint: POST /embedding/generate

# Request
{
    "text": "recursive depth-first traversal algorithms",
    "client_hint": {
        "supports_local": true,
        "supports_quantized": true,
        "preferred_precision": "int8"  # optional hint from browser
    }
}

# Response
{
    "embedding": [0.123, -0.456, 0.789, ...],  # 768 or 1536 dimensions
    "metadata": {
        "provider": "local",
        "model": "nomic-embed-text-v1.5",
        "dimensions": 768,
        "precision": "float16",
        "server_config_id": 42  # For consistency validation
    }
}

Endpoint: GET /embedding/config

Returns current embedding configuration so clients know what model to use:

# Response
{
    "provider": "local",
    "model": "nomic-embed-text-v1.5",
    "dimensions": 768,
    "precision": "float16",
    "supports_browser": true,  # transformers.js compatible
    "quantized_variants": ["int8", "int4"],  # Available for browser
    "config_id": 42
}

3. LocalEmbeddingProvider Implementation

File: api/app/lib/ai_providers.py

class LocalEmbeddingProvider(AIProvider):
    """
    Local embedding generation using sentence-transformers.
    Supports nomic-embed-text and BGE models.
    """

    def __init__(self, model_name: str = "nomic-ai/nomic-embed-text-v1.5"):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer(model_name)
        self.model_name = model_name

    def generate_embedding(self, text: str, precision: str = "float16") -> List[float]:
        """Generate embedding locally using sentence-transformers."""
        embedding = self.model.encode(text, normalize_embeddings=True)

        if precision == "float16":
            embedding = embedding.astype(np.float16)

        return embedding.tolist()

    def get_dimensions(self) -> int:
        """Return embedding dimensions for this model."""
        return self.model.get_sentence_embedding_dimension()

4. Browser-Side Embeddings (Optional Enhancement)

File: viz-app/src/lib/embeddings.ts

import { pipeline } from '@xenova/transformers';

class BrowserEmbeddingService {
  private model: any = null;
  private serverConfig: EmbeddingConfig | null = null;

  async initialize() {
    // Fetch server config to ensure model consistency
    this.serverConfig = await fetchEmbeddingConfig();

    // Only load browser model if server uses compatible local model
    if (this.serverConfig.supports_browser) {
      try {
        // Load quantized version of server's model
        this.model = await pipeline('feature-extraction',
          `nomic-ai/${this.serverConfig.model}`,
          { quantized: true }  // Uses int8 quantization
        );
      } catch (error) {
        console.warn('Browser embedding failed, falling back to server:', error);
        this.model = null;
      }
    }
  }

  async generateEmbedding(text: string): Promise<number[]> {
    // Try browser-side first (fast)
    if (this.model) {
      const output = await this.model(text, { pooling: 'mean', normalize: true });
      return Array.from(output.data);
    }

    // Fallback to server (always works)
    return await fetchServerEmbedding(text);
  }

  async twoPassSearch(query: string, candidates: Concept[]): Promise<Concept[]> {
    // Pass 1: Fast local scan with quantized embeddings
    const queryEmbedding = await this.generateEmbedding(query);
    const topK = this.localCosineSimilarity(queryEmbedding, candidates)
                     .sort((a, b) => b.score - a.score)
                     .slice(0, 20);  // Top 20 candidates

    // Pass 2: Server-side rerank with full precision
    return await fetchServerRerank(query, topK);
  }
}

Client Behavior Matrix:

Server Config	Browser Capability	Client Behavior
OpenAI	Any	Always use `/embedding/generate` API (current behavior)
Local (nomic)	Supports transformers.js	Two-pass: Browser quantized scan → Server rerank
Local (nomic)	Limited resources	Fall back to server-only
Local (nomic)	No browser support	Server-only (like OpenAI)

5. Migration Strategy

Leverage Existing Tool: The system already has an embedding migration command:

kg embedding migrate --model nomic-embed-text-v1.5

Migration Steps: 1. Install sentence-transformers in API server 2. Set EMBEDDING_PROVIDER=local in .env 3. Restart API server (new config applied) 4. Run migration to re-embed all existing concepts 5. Verify consistency with test queries

Migration is One-Way Per Model: - Switching from OpenAI → nomic requires full re-embedding (incompatible spaces) - Switching from nomic → BGE requires full re-embedding (incompatible spaces) - Once migrated, ALL clients must use the same model

6. Model Recommendation

Recommended for Most Use Cases: nomic-embed-text-v1.5

Criterion	nomic-embed-text-v1.5	BGE-large-en-v1.5	OpenAI text-embedding-3-small
Dimensions	768 (flexible 64-768)	1024	1536
Context	8K tokens	512 tokens	8K tokens
Model Size	~275MB	~1.3GB	N/A (API)
Browser Support	✅ Excellent	✅ Good (large)	❌ API only
Cost	Free (local)	Free (local)	$0.02 / 1M tokens
Latency	<50ms (local)	<100ms (local)	100-300ms (API)
Privacy	✅ Fully local	✅ Fully local	❌ External API
Quantization	Int8, Int4, Binary	Int8	N/A

Why nomic-embed-text-v1.5: - Smaller model = faster loading in browser - 8K context matches our chunking strategy (1000 words ~= 1500 tokens) - Matryoshka learning allows dimension flexibility (future optimization) - Strong transformers.js support - Proven performance on MTEB benchmark

Model Acquisition and Storage

Model Source: HuggingFace Model Hub

Default Behavior: - sentence-transformers downloads models from HuggingFace on first use - Models cached locally to avoid re-downloading - Requires internet access for initial download only - Subsequent loads use cached version (offline capable)

Model Identifiers:

# Full HuggingFace model names
"nomic-ai/nomic-embed-text-v1.5"      # Recommended
"BAAI/bge-base-en-v1.5"               # Alternative
"BAAI/bge-large-en-v1.5"              # High-accuracy option

Storage Location and Persistence

Default Cache Location:

~/.cache/huggingface/hub/models--<org>--<model-name>/

Example:

~/.cache/huggingface/hub/models--nomic-ai--nomic-embed-text-v1.5/
├── blobs/              # Model weights and tokenizer files
├── refs/               # Git-style references
└── snapshots/          # Versioned model snapshots

Disk Space Requirements: | Model | Cached Size | Runtime RAM | |-------|-------------|-------------| | nomic-embed-text-v1.5 | ~275MB | ~400MB | | bge-base-en-v1.5 | ~400MB | ~500MB | | bge-large-en-v1.5 | ~1.3GB | ~1.5GB |

Storage grows with multiple models: Each model downloaded is cached separately.

Docker Volume Configuration

Problem: Docker containers lose downloaded models on rebuild/restart.

Solution: Mount persistent volume for HuggingFace cache.

docker-compose.yml:

services:
  api:
    image: knowledge-graph-api
    volumes:
      # Application code
      - ./src:/app/src

      # Persistent model cache (critical for local embeddings)
      - huggingface-cache:/root/.cache/huggingface
    environment:
      # Optional: Use custom cache location
      TRANSFORMERS_CACHE: /app/models
      HF_HOME: /app/models

volumes:
  postgres_data:
  huggingface-cache:  # Persistent across container restarts

Alternative: Custom cache directory mounted from host:

volumes:
  # Share host's HuggingFace cache with container
  - ~/.cache/huggingface:/root/.cache/huggingface

Benefits: - Models downloaded once, persist across container rebuilds - Faster API startup (no re-download) - Shared cache if running multiple containers - Development and production use same cache

Model Versioning and Pinning

Unpinned (Latest):

# Downloads latest version from HuggingFace
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")

Pinned to Specific Revision:

# Pin to specific git commit for reproducibility
model = SentenceTransformer(
    "nomic-ai/nomic-embed-text-v1.5",
    revision="c35f52e75c6d8068a51e0524f03a30da4e31eac9"  # Git commit hash
)

Database Configuration:

-- Track exact model version used
INSERT INTO kg_api.embedding_config (
    model_name,
    -- Store full reference including revision
    model_name = 'nomic-ai/nomic-embed-text-v1.5@c35f52e7'
)

Recommendation: - Development: Use latest (auto-update on model improvements) - Production: Pin to specific revision (reproducibility, avoid surprise changes)

Model Download Strategies

Strategy 1: Download on First Use (Default)

Workflow: 1. API starts, checks cache 2. If model not cached, downloads from HuggingFace (~30-60s for nomic) 3. Caches model 4. Loads into memory 5. Subsequent starts use cache (fast)

Pros: - ✅ Zero configuration - ✅ Always get latest model - ✅ Works out of the box

Cons: - ⚠️ First startup slow (30-60s download time) - ⚠️ Requires internet access on first use - ⚠️ API health check fails during download

Strategy 2: Pre-Download During Deployment

Workflow:

# Pre-download models during container build or deployment
python3 -c "
from sentence_transformers import SentenceTransformer
print('Downloading nomic-embed-text-v1.5...')
SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')
print('Model cached successfully')
"

Add to Dockerfile:

# Pre-download model during image build
RUN python3 -c "from sentence_transformers import SentenceTransformer; \
    SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')"

Pros: - ✅ Fast API startup (model already cached) - ✅ No internet required at runtime - ✅ Health checks pass immediately - ✅ Production-ready

Cons: - ⚠️ Larger Docker image (~+300MB) - ⚠️ Slower image builds - ⚠️ Must rebuild image to update model

Strategy 3: Separate Init Container (Kubernetes)

Workflow:

# Init container downloads model to shared volume
initContainers:
  - name: download-models
    image: python:3.11
    command:
      - python3
      - -c
      - |
        from sentence_transformers import SentenceTransformer
        SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')
    volumeMounts:
      - name: model-cache
        mountPath: /root/.cache/huggingface

Pros: - ✅ Separation of concerns (download vs serve) - ✅ Can update models without rebuilding app image - ✅ Health checks pass on main container

Cons: - ⚠️ Only applicable to Kubernetes deployments - ⚠️ More complex orchestration

Recommendation by Deployment Type

Development (Local): - Strategy 1: Download on first use - Use host's HuggingFace cache - Accept slow first startup

Single Docker Container: - Strategy 1 or 2 - Use persistent volume for cache - Consider pre-download if startup speed critical

Production (Docker Compose): - Strategy 2: Pre-download in Dockerfile - Use persistent volume as backup - Pin model version for reproducibility

Production (Kubernetes): - Strategy 3: Init container - Separate model updates from app deployments - Use persistent volume claims

Model Management Commands

View cached models:

ls -lh ~/.cache/huggingface/hub/

Clear cache (free disk space):

rm -rf ~/.cache/huggingface/hub/models--nomic-ai--nomic-embed-text-v1.5

Check model disk usage:

du -sh ~/.cache/huggingface/

Verify model availability:

from pathlib import Path
cache_dir = Path.home() / ".cache" / "huggingface" / "hub"
models = list(cache_dir.glob("models--*"))
print(f"Cached models: {len(models)}")
for model in models:
    print(f"  {model.name}")

Health Check Considerations

API health check should verify model availability:

@app.get("/health")
async def health():
    """Health check with model availability verification"""

    # Check if model manager initialized
    try:
        manager = get_embedding_model_manager()
        model_loaded = manager.is_loaded()
    except RuntimeError:
        model_loaded = False

    return {
        "status": "healthy" if model_loaded else "degraded",
        "embedding_model_loaded": model_loaded,
        "provider": os.getenv("EMBEDDING_PROVIDER", "openai")
    }

Degraded status during model download: - API responds with HTTP 200 but status: "degraded" - Embedding endpoints return 503 Service Unavailable - Allows container to stay running while model downloads - Health check passes once model loaded

Disk Space Monitoring

Recommended: Monitor cache directory size in production.

Alert thresholds: - Warning: >5GB (multiple large models cached) - Critical: >10GB (potential disk space issue)

Cleanup strategy: - Remove unused models manually - Or implement LRU cache pruning (delete least-recently-used models)

Consequences

Positive

Cost Elimination: Zero ongoing costs for embeddings (no OpenAI API calls)
Reduced Latency: Local generation: 10-50ms vs OpenAI API: 100-300ms
Privacy: All embeddings generated locally, no query text sent externally
Offline Capability: System works without internet access
Browser Performance: Two-pass search enables <100ms interactive search
Consistency Guarantee: Single /embedding/config endpoint ensures model alignment
Provider Flexibility: Easy to switch between local and OpenAI via config
No API Rate Limits: Unlimited embedding generation

Negative

Initial Setup Complexity: sentence-transformers adds ~2GB model downloads
Memory Overhead: Model loaded in API server (~300MB-1.3GB RAM)
Migration Required: Existing OpenAI embeddings incompatible with local models
Browser Bundle Size: Quantized model adds ~100MB to viz app (lazy loaded)
Quantization Trade-off: Browser search slightly less accurate than server (2-5% degradation)

Neutral

Embedding Quality: Nomic and BGE comparable to OpenAI text-embedding-3-small on benchmarks
Configuration Complexity: Single global config enforces consistency but reduces flexibility
Two Codepaths: Must maintain both OpenAI and local provider implementations
Model Updates: sentence-transformers models can be updated independently of code

Alternatives Considered

Alternative 1: Client-Only Embeddings (No Server Abstraction)

Approach: Each client generates embeddings with its own model choice.

Rejected Because: - Incompatible vector spaces break similarity search - No way to enforce model consistency across clients - Database would contain mixed embeddings (unusable)

Alternative 2: Multiple Embedding Endpoints (Per-Model)

Approach: Separate endpoints for each model: /embedding/openai, /embedding/nomic, /embedding/bge

Rejected Because: - Increases complexity for clients (must choose endpoint) - Doesn't enforce consistency (clients could use wrong endpoint) - Harder to switch models system-wide - More API surface area to maintain

Alternative 3: Dual-Precision Storage (Both Full and Quantized)

Approach: Store both float32 and int8 embeddings in database.

Rejected Because: - Doubles storage requirements (~2x database size) - Adds complexity to ingestion pipeline - Quantization can be done on-the-fly with minimal overhead - Not necessary for two-pass search (quantize at query time)

Alternative 4: Pgvector for Similarity Search (Future Enhancement)

Approach: Use pgvector extension for indexed vector similarity instead of full-scan numpy.

Deferred to Future ADR: - Decision point determined by actual usage patterns and scale (not speculation) - Requires schema changes (vector column type) - Needs index tuning (HNSW, IVFFlat parameters) - ADR-038 documents full-scan as "genuinely unusual" architectural choice - Migration to pgvector should be its own decision when/if scale warrants it - Compatible with this ADR (provider abstraction unchanged)

Configuration Update Strategy (Worker Recycling)

Challenge: sentence-transformers models are loaded into memory (300MB-1.3GB). Changing embedding configuration requires model reload. How to apply changes without extended downtime?

Phase 1: Manual API Restart (MVP - Current Implementation)

Approach: - Embedding config stored in kg_api.embedding_config table - API reads config from database at startup - Configuration changes via PUT /admin/embedding/config - Changes require manual API server restart to apply

Workflow:

# Update config via API
curl -X PUT http://localhost:8000/admin/embedding/config \
  -d '{"model_name": "BAAI/bge-large-en-v1.5", "num_threads": 8}'

# Restart API to apply
./scripts/services/stop-api.sh && ./scripts/services/start-api.sh

Characteristics: - ✅ Simple implementation - ✅ No risk of memory leaks - ✅ Clean process state after restart - ⚠️ ~2-5 second downtime during restart - ⚠️ In-flight requests dropped

When to use: Single-instance deployments, infrequent config changes

Phase 2: Hot Reload with Signal Handling (Future Enhancement)

Approach: - Signal handling: SIGHUP or POST /admin/embedding/config/reload triggers config reload - Graceful model swap in running process - No process restart required

Implementation Strategy:

async def reload_model(new_config: dict):
    """
    Hot reload model without process restart.

    Process:
    1. Load new model in parallel (brief 2x memory spike)
    2. Atomic swap of model reference
    3. Allow in-flight requests to finish with old model (5s grace period)
    4. Explicit cleanup: del old_model, gc.collect()
    5. Log reload time
    """
    logger.info(f"🔄 Hot reloading model: {new_config['model_name']}")
    start = time.time()

    # Load new model (2x memory temporarily)
    new_model = SentenceTransformer(new_config['model_name'])

    # Atomic swap
    old_model = self.model
    self.model = new_model
    self.model_name = new_config['model_name']

    # Grace period for in-flight requests
    await asyncio.sleep(5)

    # Explicit cleanup
    del old_model
    gc.collect()

    elapsed = time.time() - start
    logger.info(f"✅ Model reloaded in {elapsed:.2f}s")

Workflow:

# Update config and reload (single operation)
curl -X POST http://localhost:8000/admin/embedding/config/reload \
  -d '{"model_name": "BAAI/bge-large-en-v1.5", "num_threads": 8}'

# API responds with reload status
{
  "status": "reloading",
  "estimated_time_seconds": 5,
  "old_model": "nomic-ai/nomic-embed-text-v1.5",
  "new_model": "BAAI/bge-large-en-v1.5"
}

Characteristics: - ✅ Zero downtime (in-flight requests complete) - ✅ Fast reload (~2-5 seconds) - ✅ No external tools required - ⚠️ Temporary 2x memory usage during reload - ⚠️ Risk of memory leaks if model not properly released - ⚠️ Requires testing to verify garbage collection

When to use: Single-instance deployments with uptime requirements

Phase 3: Multi-Worker Rolling Restart (Future - Production Scale)

Approach: - Process pool with multiple workers (e.g., gunicorn with 4-8 workers) - Rolling restart: reload one worker at a time - Load balancer routes traffic to healthy workers - True zero downtime

Implementation Strategy:

# Supervisor manages multiple workers
gunicorn api.app.main:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000

# Rolling restart command (gunicorn built-in)
kill -HUP <gunicorn-pid>  # Gracefully reload all workers

Reload Process: 1. Worker 1: Reload model, rejoin pool 2. Worker 2: Reload model while 1,3,4 serve traffic 3. Worker 3: Reload model while 1,2,4 serve traffic 4. Worker 4: Reload model while 1,2,3 serve traffic

Characteristics: - ✅ True zero downtime - ✅ No memory spike (workers reload sequentially) - ✅ Production-grade reliability - ✅ Built-in gunicorn support - ⚠️ More complex deployment architecture - ⚠️ Requires load balancer or proxy - ⚠️ 4-8x memory overhead (multiple workers)

When to use: Multi-user production deployments, high availability requirements

Decision: Start with Phase 1, Add Phase 2 if Hot Reload Performs Well

Rationale: - Phase 1 is simple and sufficient for most use cases - Phase 2 can be added later if hot reload proves stable - Phase 3 only needed at significant scale (>100 concurrent users) - Discover actual reload performance before committing to complexity

Metrics to track: - Model load time: time to load model into memory - Memory cleanup: RAM freed after old model deletion - Request impact: number of requests affected during reload

Implementation Checklist

Phase 1: Server-Side Local Embeddings (Core Functionality) - [ ] Add sentence-transformers to requirements.txt - [ ] Create embedding_config database table - [ ] Implement LocalEmbeddingProvider class - [ ] Add POST /embedding/generate endpoint - [ ] Add GET /embedding/config endpoint - [ ] Update .env.example with embedding variables - [ ] Test local embedding generation - [ ] Update API documentation

Phase 2: Embedding Configuration Management - [ ] Add config validation on API startup - [ ] Implement config consistency checks - [ ] Add config to health check endpoint - [ ] Update admin module with config commands

Phase 3: Migration Tool Extension - [ ] Extend kg embedding migrate to support local models - [ ] Add progress tracking for re-embedding - [ ] Add validation checks (embedding dimensions) - [ ] Update CLI documentation

Phase 4: Browser-Side Embeddings (Optional Enhancement) - [ ] Add transformers.js to viz-app dependencies - [ ] Implement BrowserEmbeddingService - [ ] Add two-pass search to visualization app - [ ] Add resource detection and fallback logic - [ ] Performance testing and optimization

ADR-012 (API Server): Embedding endpoint extends FastAPI server architecture
ADR-013 (Unified Client): kg CLI extended with embedding migration commands
ADR-016 (Apache AGE): Embedding config stored in PostgreSQL alongside graph
ADR-030 (Concept Deduplication): Quality tests must account for model changes
ADR-034 (Graph Visualization): Browser embeddings enhance interactive search
ADR-038 (O(n) Full-Scan Search): Local embeddings compatible with current search architecture

Future Consideration: - ADR-0XX (Pgvector Migration): Indexed vector search to replace full-scan similarity

References

Research Sources: - Transformers.js v2.15.0 release notes (Feb 2024) - Nomic AI: "Nomic Embed Text v2" blog post - HuggingFace: "Binary and Scalar Embedding Quantization" blog - OpenCypher specification (ISO/IEC 39075:2024) - MTEB benchmark (Massive Text Embedding Benchmark)

Model Documentation: - https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 - https://huggingface.co/BAAI/bge-large-en-v1.5 - https://platform.openai.com/docs/guides/embeddings

Quantization Research: - "4bit-Quantization in Vector-Embedding for RAG" (arXiv:2501.10534v1) - Zilliz Vector Database: Int8 Quantization Effects on Sentence Transformers

Last Updated: 2025-10-18

ADR-039: Local Embedding Service with Hybrid Client/Server Architecture

Overview

Context

Current State: OpenAI Embedding Dependency

Research Findings

Decision

Architecture Overview

Component Design

1. Embedding Configuration Storage

2. Single Embedding API Endpoint

3. LocalEmbeddingProvider Implementation

4. Browser-Side Embeddings (Optional Enhancement)

5. Migration Strategy

6. Model Recommendation

Model Acquisition and Storage

Model Source: HuggingFace Model Hub

Storage Location and Persistence

Docker Volume Configuration

Model Versioning and Pinning

Model Download Strategies

Recommendation by Deployment Type

Model Management Commands

Health Check Considerations

Disk Space Monitoring

Consequences

Positive

Negative

Neutral

Alternatives Considered

Alternative 1: Client-Only Embeddings (No Server Abstraction)

Alternative 2: Multiple Embedding Endpoints (Per-Model)

Alternative 3: Dual-Precision Storage (Both Full and Quantized)

Alternative 4: Pgvector for Similarity Search (Future Enhancement)

Configuration Update Strategy (Worker Recycling)

Phase 1: Manual API Restart (MVP - Current Implementation)

Phase 2: Hot Reload with Signal Handling (Future Enhancement)

Phase 3: Multi-Worker Rolling Restart (Future - Production Scale)

Decision: Start with Phase 1, Add Phase 2 if Hot Reload Performs Well

Implementation Checklist

Related ADRs

References