ADR-042: Local LLM Inference for Concept Extraction

Overview

Imagine you're working on a knowledge base containing sensitive medical records, classified research, or proprietary business data. Every time you ingest a document, the system currently sends chunks of that text to OpenAI or Anthropic for concept extraction. Your private data leaves your infrastructure, travels across the internet, gets processed by someone else's computers, and then returns. This isn't just a privacy concern—it's often a compliance dealbreaker for regulated industries.

Beyond privacy, there's the cost problem. Cloud API extraction scales linearly with your document volume. Ingesting a thousand-page technical manual? That's hundreds of API calls at several cents each. The bills add up quickly, and there's no ceiling—you pay for every chunk, every time. Plus, you're completely dependent on internet connectivity and the provider staying online. No network? No ingestion.

Here's where the landscape has shifted: modern open-source LLMs running on consumer GPUs can now perform concept extraction with quality approaching GPT-4. Tools like Ollama make it dead simple to run models like Llama 3.1, Mistral, or Qwen locally. You download a model once, and then extraction is free—no API calls, no data leaving your system, no network dependency.

This ADR enables local LLM inference as a first-class extraction provider. The system integrates with Ollama (or any OpenAI-compatible local server) just like it integrates with cloud providers, using the same unified configuration system from ADR-041. Switch from GPT-4 to local Mistral with a single API call—no code changes, no architectural rewrites, just configuration. The key insight is that extraction is an abstract operation: "given text, return concepts"—it doesn't care whether those concepts came from a cloud API or a GPU in your server room.

Context

Currently, the system requires cloud API access (OpenAI or Anthropic) for concept extraction during document ingestion. This creates several challenges:

Current Limitations

External Dependency
System cannot function without API access
Network failures block ingestion
Subject to provider outages
Cost Considerations
API costs scale linearly with volume
Large ingestion jobs can be expensive
No cost ceiling for usage
Privacy Concerns
Sensitive documents must be sent to third-party APIs
Compliance issues for regulated industries (HIPAA, GDPR, etc.)
No air-gapped deployment possible
Latency
Network round-trips for each chunk (~1-3 seconds overhead)
Rate limiting can slow batch ingestion
Geographic latency for non-US regions
Vendor Lock-in
Tied to specific providers' model availability
Cannot use latest open-source models
Model deprecation risk

Current Extraction Architecture

Processing Pipeline:

Document → Chunking → LLM Extraction → Graph Upsert
           (1000w)    (GPT-4o/Claude)   (PostgreSQL+AGE)

Chunking System (api/app/lib/chunker.py): - Target: 1000 words/chunk (configurable: 800-1500) - Overlap: 200 words between chunks for context - Smart Boundaries: Paragraph > Sentence > Pause > Hard cut - Average Document: 5000-50000 words = 5-50 chunks

LLM Requirements (per chunk): - Input Tokens: ~1500-2500 tokens - System prompt: ~500-700 tokens (includes relationship types) - Chunk text: ~1000-1500 tokens - Existing concepts list: 0-300 tokens (variable) - Output Tokens: ~500-2000 tokens (JSON structure) - Total: ~2000-4500 tokens per chunk

Extraction Output (JSON structure):

{
  "concepts": [{
    "concept_id": "concept_001",
    "label": "Concept Name",
    "search_terms": ["term1", "term2", "term3"]
  }],
  "instances": [{
    "concept_id": "concept_001",
    "quote": "Exact quote from text"
  }],
  "relationships": [{
    "from_concept_id": "concept_001",
    "to_concept_id": "concept_002",
    "relationship_type": "IMPLIES",  // 30-90 dynamic types (ADR-025)
    "confidence": 0.9
  }]
}

Dynamic Vocabulary Challenge (ADR-025): - Relationship types grow from 30 (baseline) to 30-90 (curator-approved) - Prompt size varies: ~150 tokens (30 types) to ~450 tokens (90 types) - Local models must handle variable-length relationship lists - JSON structure must support any valid type from active vocabulary

Success Criteria for Local Inference

Quality: 90-95%+ of GPT-4o extraction quality
Reliability: 99%+ valid JSON responses
Performance: < 30 seconds per chunk (acceptable for batch ingestion)
Resource Efficiency: Run alongside PostgreSQL and embedding model
Deployment Simplicity: Easy installation and model management

Decision

Extend the existing extraction provider system to support local inference backends.

Architectural Approach

Follow the same pattern established for embeddings (ADR-039):

Provider Abstraction
Add ollama, vllm, and llama-cpp as new provider types
Extend ai_extraction_config table to support local providers
Same API/CLI as existing providers (openai, anthropic)
Configuration Pattern
Users can choose between remote (openai, anthropic) and local (ollama, vllm) providers
Similar to embeddings: kg admin extraction set --provider ollama --model mistral:7b-instruct
Hot reload support (if model is already loaded)
Provider-specific settings (base_url, temperature, etc.)
Deployment Options
Docker Compose (Recommended): Self-contained stack with Ollama service
External Endpoint: Point to existing local inference server
System Installation: User installs Ollama/vLLM themselves

Technology Choices

Primary: Ollama (Default Local Provider)

Why Ollama:

Simplest Deployment
Docker image available: ollama/ollama
OpenAI-compatible API (drop-in replacement)
Automatic model management
JSON mode support
Model listing API: GET /api/tags
Uses llama.cpp Under the Hood
Ollama wraps llama.cpp for inference
Gets llama.cpp's performance and quantization
Adds management layer (download, update, list models)
Better API ergonomics than raw llama.cpp

Docker Compose Integration

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=1

Flexibility
Users can run Ollama externally and point to it
Or use included Docker Compose service
Or install Ollama system-wide
Model discovery via API (GET /api/tags)

Secondary: vLLM (Optional - Enterprise)

Why Support vLLM: - Highest throughput for GPU deployments - Tensor parallelism for 70B+ models - Production-grade with load balancing - Users may already have vLLM running

Integration:

# docker-compose.yml (optional vLLM service)
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    command: --model meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.9

Tertiary: llama.cpp (Future)

Why Consider: - Pure CPU inference - Extremely low resource usage - Good for edge deployments

Integration: Via llama-cpp-python or standalone server

Configuration Schema

Extend ai_extraction_config table (follows embedding pattern from ADR-039):

-- Existing columns
provider VARCHAR(50)       -- "openai", "anthropic", "ollama", "vllm", "llama-cpp"
model_name VARCHAR(200)    -- "gpt-4o", "claude-sonnet-4", "mistral:7b-instruct"
supports_vision BOOLEAN
supports_json_mode BOOLEAN
max_tokens INTEGER

-- New columns for local providers
base_url VARCHAR(255)              -- "http://localhost:11434" or "http://ollama:11434"
temperature FLOAT DEFAULT 0.1      -- Lower for consistent JSON
top_p FLOAT DEFAULT 0.9
gpu_layers INTEGER DEFAULT -1      -- -1 = auto, 0 = CPU only (llama.cpp)
num_threads INTEGER DEFAULT 4      -- CPU threads (llama.cpp)

Deployment Scenarios

Scenario 1: Docker Compose All-in-One (Recommended)

We provide hardware-optimized docker-compose profiles:

docker-compose.ollama.yml (Main Ollama Config)

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: kg-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama
    environment:
      - OLLAMA_NUM_PARALLEL=2
      - OLLAMA_MAX_LOADED_MODELS=1
    restart: unless-stopped
    profiles:
      - nvidia
      - intel
      - amd
      - cpu

  # NVIDIA GPU variant
  ollama-nvidia:
    extends: ollama
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    profiles:
      - nvidia

  # Intel GPU variant (Arc, Iris Xe)
  ollama-intel:
    extends: ollama
    image: ollama/ollama:latest
    devices:
      - /dev/dri:/dev/dri  # Intel GPU device
    environment:
      - OLLAMA_INTEL_GPU=1
      - NEOReadDebugKeys=1
      - ClDeviceGlobalMemSizeAvailablePercent=100
    profiles:
      - intel

  # AMD GPU variant (ROCm)
  ollama-amd:
    extends: ollama
    image: ollama/ollama:rocm
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
      - render
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0  # Adjust for your AMD GPU
    profiles:
      - amd

  # CPU-only variant (optimized for AVX2/AVX512)
  ollama-cpu:
    extends: ollama
    image: ollama/ollama:latest
    environment:
      - OLLAMA_NUM_THREADS=8
      - OLLAMA_USE_MMAP=true
    cpus: 8
    mem_limit: 16g
    profiles:
      - cpu

volumes:
  ollama-models:

Usage by Hardware Type

# NVIDIA GPU (default for most users)
docker-compose -f docker-compose.yml -f docker-compose.ollama.yml --profile nvidia up -d

# Intel GPU (Arc, Iris Xe)
docker-compose -f docker-compose.yml -f docker-compose.ollama.yml --profile intel up -d

# AMD GPU (ROCm-compatible)
docker-compose -f docker-compose.yml -f docker-compose.ollama.yml --profile amd up -d

# CPU-only (no GPU)
docker-compose -f docker-compose.yml -f docker-compose.ollama.yml --profile cpu up -d

Main docker-compose.yml Integration

services:
  api:
    build: .
    depends_on:
      - postgres
      - ollama  # Any ollama variant
    environment:
      - EXTRACTION_PROVIDER=ollama
      - EXTRACTION_BASE_URL=http://kg-ollama:11434
      - EXTRACTION_MODEL=mistral:7b-instruct

Usage:

# Start everything
docker-compose up -d

# Pull model (first time)
docker exec ollama ollama pull mistral:7b-instruct

# Configure extraction
kg admin extraction set --provider ollama --model mistral:7b-instruct

# Test
kg ingest file -o "Test" -y document.txt

Scenario 2: External Ollama Instance

User already has Ollama running elsewhere:

# Point to existing Ollama
kg admin extraction set \
  --provider ollama \
  --model mistral:7b-instruct \
  --base-url http://my-gpu-server:11434

# System connects to external endpoint

Scenario 3: System-Wide Ollama Installation

User installs Ollama on host machine:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull model
ollama pull mistral:7b-instruct

# Configure to use localhost
kg admin extraction set --provider ollama --model mistral:7b-instruct
# (base_url defaults to http://localhost:11434)

Scenario 4: vLLM for Enterprise

# Start vLLM container
docker run -d --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-70B-Instruct

# Configure extraction
kg admin extraction set \
  --provider vllm \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --base-url http://localhost:8000

Hardware Deployment Profiles

Based on development machine specifications and realistic deployment scenarios:

Reference Hardware (Development Machine)

CPU:    AMD Ryzen 9 9950X3D (16 cores, 32 threads)
RAM:    123 GB DDR5
GPU:    NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
Disk:   1.9 TB NVMe SSD

Profile 1: Budget CPU-Only (Entry Level)

CPU:    Intel i7-12700K or AMD Ryzen 7 5800X (12 cores, 20 threads)
RAM:    32 GB DDR4 (recommend 48-64 GB)
GPU:    None
Disk:   512 GB NVMe SSD
Cost:   ~$800-1000

Resource Allocation:
- PostgreSQL + AGE:      4-8 GB RAM, 2-4 CPU cores
- Embedding Model:       2-4 GB RAM, 2 CPU cores (quantized)
- Extraction Model:      12-16 GB RAM, 6-8 CPU cores (quantized)
- FastAPI + Workers:     2-4 GB RAM, 2 CPU cores
- System Overhead:       4-6 GB RAM
Total:                   24-38 GB RAM

Recommended Models:
- Mistral 7B Instruct (Q4_K_M: ~4GB)
- Llama 3.1 8B (Q4_K_M: ~4.5GB)
- Phi-3 Medium 14B (Q4_K_M: ~8GB)

Performance:
- ~2-5 tokens/second
- ~30-90 seconds per chunk
- ~5-75 minutes per document (10-50 chunks)
- Best for: Personal use, low-volume ingestion, testing

Profile 2: Mid-Range GPU (Prosumer)

CPU:    AMD Ryzen 9 7900X (12 cores, 24 threads)
RAM:    64 GB DDR5
GPU:    NVIDIA RTX 4070 (12 GB VRAM) or RTX 3060 (12 GB VRAM)
Disk:   1 TB NVMe SSD
Cost:   ~$1500-2000

Resource Allocation:
- PostgreSQL + AGE:      8-12 GB RAM, 3-4 CPU cores
- Embedding Model (GPU): 2-3 GB VRAM, 1 GB RAM
- Extraction Model (GPU):8-10 GB VRAM, 4-6 GB RAM
- FastAPI + Workers:     3-5 GB RAM, 2-3 CPU cores
- System Overhead:       6-8 GB RAM
Total:                   23-32 GB RAM, 10-13 GB VRAM

Recommended Models:
- Mistral 7B Instruct (FP16: ~14GB or 8-bit: ~7GB)
- Llama 3.1 8B Instruct (FP16: ~16GB or 8-bit: ~8GB)
- Qwen2.5 7B Instruct (FP16: ~14GB)
- Mixtral 8x7B (Q4: ~24GB model size, needs CPU offload)

Performance:
- ~20-40 tokens/second
- ~10-20 seconds per chunk
- ~2-17 minutes per document (10-50 chunks)
- Best for: Small teams, moderate volume, development

Profile 3: High-End GPU (Production - Reference Machine)

CPU:    AMD Ryzen 9 9950X3D (16 cores, 32 threads)
RAM:    128 GB DDR5
GPU:    NVIDIA RTX 4060 Ti (16 GB VRAM) or RTX 4080 (16 GB VRAM)
Disk:   2 TB NVMe SSD
Cost:   ~$2500-3500

Resource Allocation:
- PostgreSQL + AGE:      12-16 GB RAM, 4-6 CPU cores
- Embedding Model (GPU): 2-3 GB VRAM, 512 MB RAM
- Extraction Model (GPU):12-14 GB VRAM, 6-8 GB RAM
- FastAPI + Workers:     4-6 GB RAM, 2-3 CPU cores
- System Overhead:       8-10 GB RAM
Total:                   32-43 GB RAM, 14-17 GB VRAM

Recommended Models:
- Llama 3.1 8B Instruct (FP16: ~16GB)
- Mistral 7B Instruct (FP16: ~14GB)
- Qwen2.5 7B Instruct (FP16: ~14GB, excellent reasoning)
- Qwen2.5 14B Instruct (8-bit: ~14GB, highest quality)
- Phi-3.5 Mini Instruct (FP16: ~7.6GB, fastest)
- Gemma 2 9B Instruct (8-bit: ~9GB)

Performance:
- ~30-60 tokens/second (7-8B models)
- ~20-40 tokens/second (14B models)
- ~5-15 seconds per chunk
- ~1-13 minutes per document (10-50 chunks)
- Best for: Production deployments, high-volume ingestion

Profile 4: Professional GPU (Enterprise)

CPU:    AMD Threadripper PRO 5975WX (32 cores, 64 threads)
RAM:    256 GB DDR4 ECC
GPU:    NVIDIA RTX 6000 Ada (48 GB VRAM) or A100 (40-80 GB VRAM)
Disk:   4 TB NVMe RAID
Cost:   ~$8000-15000

Resource Allocation:
- PostgreSQL + AGE:      32-48 GB RAM, 8-12 CPU cores
- Embedding Model (GPU): 2-3 GB VRAM, 1 GB RAM
- Extraction Model (GPU):40-45 GB VRAM, 12-16 GB RAM
- FastAPI + Workers:     8-12 GB RAM, 4-6 CPU cores
- System Overhead:       16-24 GB RAM
Total:                   69-102 GB RAM, 42-48 GB VRAM

Recommended Models:
- Llama 3.1 70B Instruct (8-bit: ~35GB or 4-bit: ~20GB)
- Qwen2.5 72B Instruct (8-bit: ~36GB, best reasoning)
- Mixtral 8x22B (Q4: ~42GB)
- DeepSeek Coder 33B (8-bit: ~17GB, code-focused)
- Hybrid: 70B + 8B routing by complexity

Performance:
- ~40-100 tokens/second (7-8B models)
- ~10-30 tokens/second (70B models)
- ~3-10 seconds per chunk
- ~0.5-8 minutes per document (10-50 chunks)
- Best for: Enterprise, highest quality extraction

Profile 5: Cloud/Bare Metal (Hyperscale)

CPU:    Dual EPYC 7763 (128 cores, 256 threads)
RAM:    512 GB - 1 TB DDR4 ECC
GPU:    4x NVIDIA A100 (80 GB VRAM each) or 8x A40
Disk:   10+ TB NVMe RAID
Cost:   ~$50000-100000

Resource Allocation:
- PostgreSQL + AGE:      64-128 GB RAM, 16-24 CPU cores
- Embedding Model:       4-6 GB VRAM, 2 GB RAM
- Extraction Models:     2-3 GPUs for parallel 70B models
- Vision Model:          1 GPU @ 20-40GB VRAM
- FastAPI + Workers:     16-32 GB RAM, 8-12 CPU cores
Total:                   128-210 GB RAM, tensor parallelism

Recommended Deployment:
- vLLM with multiple models:
  - 2x Llama 3.1 70B Instruct (load balanced)
  - 1x Qwen2.5 72B (fallback/comparison)
  - 1x Llama 3.2 Vision 90B (multimodal)
- Model routing:
  - Complexity-based (simple → 8B, complex → 70B)
  - Content-based (code → Qwen/DeepSeek, general → Llama)
  - Real-time load balancing

Performance:
- ~200+ tokens/second aggregate
- <5 seconds per chunk
- <5 minutes per document (parallel ingestion)
- Best for: Large enterprises, 24/7 production, batch processing

Model Size Recommendations by Profile

Profile	VRAM	Best Model Size	Quantization	Expected Quality
CPU-Only	0 GB	7B	4-bit (Q4_K_M)	Good (85-90% of GPT-4o)
Mid-Range	12 GB	7-8B	FP16 or 8-bit	Very Good (90-95% of GPT-4o)
High-End	16 GB	7-14B	FP16	Excellent (95-98% of GPT-4o)
Professional	48 GB	70B	8-bit or 4-bit	Near-GPT-4o (98-100%)
Enterprise	320+ GB	70B+	FP16 or multiple	GPT-4o equivalent or better

Performance Analysis

Document Ingestion Timeline

Example: 25,000-word technical document

Profile	Model	Chunking	Extraction (25 chunks)	Total	Cost
GPT-4o API	gpt-4o	2s	50s (parallel)	~52s	$0.25
Anthropic API	claude-sonnet-4	2s	45s (parallel)	~47s	$0.20
CPU-Only	Mistral 7B Q4	2s	37min (serial)	~37min	$0
Mid-Range GPU	Llama 8B FP16	2s	7min (serial)	~7min	$0
High-End GPU	Qwen 14B 8-bit	2s	4min (serial)	~4min	$0
Professional	Llama 70B 8-bit	2s	3min (serial)	~3min	$0
Enterprise	2x 70B parallel	2s	90s (parallel)	~92s	$0

Key Insights: - Cloud APIs: Fastest for single documents (~1min), but costs scale linearly - CPU-Only: Slow but functional for batch/overnight processing - Mid-Range GPU: Sweet spot for most users (7min acceptable for batch) - High-End GPU (16GB): Production-ready performance (~4min/document) - Enterprise: Approaches cloud speed with parallel processing

Batch Ingestion Analysis

Scenario: 100 documents @ 10,000 words each (1000 chunks total)

Profile	Model	Total Time	Throughput	Total Cost
GPT-4o API	gpt-4o	~17 hours*	6 docs/hour	$100
Anthropic API	claude-sonnet-4	~15 hours*	7 docs/hour	$80
CPU-Only	Mistral 7B Q4	~62 hours	2 docs/hour	$0
Mid-Range GPU	Llama 8B FP16	~12 hours	8 docs/hour	$0
High-End GPU	Qwen 14B 8-bit	~7 hours	14 docs/hour	$0
Professional	Llama 70B 8-bit	~5 hours	20 docs/hour	$0
Enterprise	2x 70B parallel	~2.5 hours	40 docs/hour	$0

*Rate limits and throttling included

Break-Even Analysis: - Mid-Range GPU ($1500): Pays for itself after ~1,200 documents (vs GPT-4o) - High-End GPU ($3000): Pays for itself after ~2,400 documents - No ongoing costs - only electricity (~$0.10-0.50/hour for GPU)

Implementation Plan

Phase 1: Ollama Integration (MVP) - Week 1-2

Goals: - Basic local extraction with Ollama - Support 7-8B models (Mistral, Llama, Qwen) - JSON mode for structured output - Configuration and CLI commands

Tasks: 1. Create OllamaProvider class extending AIProvider - Implement extract_concepts() using Ollama API - JSON mode configuration - Error handling and retries

Add extraction config fields (migration 007):

ALTER TABLE kg_api.ai_extraction_config
ADD COLUMN backend VARCHAR(50),           -- "ollama", "vllm", "openai", "anthropic"
ADD COLUMN base_url VARCHAR(255),         -- "http://localhost:11434"
ADD COLUMN temperature FLOAT DEFAULT 0.1, -- Low for consistent JSON
ADD COLUMN top_p FLOAT DEFAULT 0.9,
ADD COLUMN gpu_layers INTEGER DEFAULT -1, -- -1 = auto, 0 = CPU only
ADD COLUMN num_threads INTEGER DEFAULT 4; -- CPU threads

CLI commands:

kg admin extraction set --provider local --backend ollama --model mistral:7b-instruct
kg admin extraction test  # Test current config

Ollama installation documentation

Deliverables: - Working local extraction with Ollama - Documentation and user guide - Performance benchmarks

Phase 2: Quality Validation - Week 2-3

Goals: - Validate extraction quality vs GPT-4o - Test relationship type accuracy - JSON reliability testing - Edge case handling

Tasks: 1. Quality Testing Suite - 100 test documents (diverse domains) - Compare local vs GPT-4o extraction - Relationship type accuracy metrics - JSON parsing success rate

Benchmarking
Tokens/second across models
Memory usage profiling
CPU vs GPU performance
Concurrent extraction + embedding
Error Handling
Malformed JSON recovery
Timeout handling
Retry logic
Fallback to cloud if needed

Acceptance Criteria: - 99%+ valid JSON responses - 90%+ relationship type accuracy - 90-95% extraction quality vs GPT-4o baseline

Phase 3: Advanced Features - Week 3-4

Goals: - Model switching and hot reload - Resource optimization - Hybrid mode (local + cloud fallback) - Multi-model support

Tasks: 1. Model Management - Hot reload local models - Model size detection and warnings - Automatic quantization selection

Hybrid Mode
Try local first, fallback to cloud on error
Configurable fallback threshold
Cost tracking for hybrid mode
Performance Optimization
Batch processing for parallel chunks
Model quantization recommendations
Memory usage optimization

Deliverables: - Production-ready local extraction - Hybrid cloud/local mode - Performance tuning guide

Phase 4: Enterprise Features (Future)

Goals: - vLLM backend support - Multi-model deployment - Vision support (code translation, image description)

Tasks: 1. vLLM Integration - Alternative backend for GPU deployments - Tensor parallelism for 70B+ models - Load balancing across models

Vision Models
Llama 3.2 Vision for describe_image()
Multimodal extraction pipeline
Code diagram understanding
Advanced Routing
Complexity-based model selection
Content-type routing (code → Qwen, general → Llama)
Cost/quality optimization

Consequences

Positive

Self-Hosted Capability
Complete air-gapped deployment possible
No external dependencies for extraction
Full control over models and versions
Cost Reduction
Zero ongoing API costs after hardware investment
Predictable infrastructure costs
ROI after ~1,000-2,000 documents
Privacy & Compliance
Sensitive documents never leave premises
HIPAA, GDPR, SOC2 compliant deployments
No data sharing with third parties
Performance
No network latency
Parallel processing on local hardware
Batch ingestion without rate limits
Flexibility
Use latest open-source models
Custom fine-tuned models
Model switching based on workload
Hybrid Capability
Can combine local + cloud
Fallback to cloud for peak loads
Cost optimization per document

Negative

Hardware Requirements
Minimum 32GB RAM, ideally 64GB+
GPU strongly recommended for production
Storage for model files (5-50GB per model)
Initial Setup Complexity
Ollama installation required
Model downloading (one-time, 5-50GB)
GPU drivers and CUDA (if using GPU)
Quality Trade-offs
7-8B models: 90-95% of GPT-4o quality
14B models: 95-98% of GPT-4o quality
70B models needed to match GPT-4o
Maintenance
Model updates manual
Monitoring resource usage
Troubleshooting local inference issues
Performance Variability
Depends heavily on hardware
CPU-only deployments slow (30-90s/chunk)
Concurrent load affects other services

Risks & Mitigation

Risk	Impact	Mitigation
Poor JSON reliability	High	Extensive testing, retry logic, schema validation
Relationship type accuracy	High	Quality benchmarking, 70B models for production
Resource contention	Medium	Resource limits, monitoring, load balancing
Model availability	Low	Ollama handles downloads, model caching
User confusion	Medium	Clear documentation, CLI helpers, error messages

Security Considerations

Local Model Files
Models stored in ~/.ollama/models (Linux/macOS)
Large files (5-50GB) - ensure disk space
Consider model file integrity checks
Network Access
Ollama API on localhost:11434 by default
Can expose for distributed deployments (add auth)
Firewall rules for remote access
Resource Limits
Set memory limits to prevent OOM
GPU allocation management
Process isolation for Ollama
Data Privacy
All inference happens locally
No telemetry by default
Audit logs for extraction requests

Open Questions

Default Model: Which model should be default? (Mistral 7B vs Llama 8B vs Qwen 7B)
Fallback Strategy: Auto-fallback to cloud or explicit user choice?
Installation: Bundle Ollama installer or document separate install?
Vision Support: Include in Phase 1 or defer to Phase 4?
Quantization: Auto-detect optimal quantization based on VRAM?
Monitoring: Built-in performance metrics or rely on external tools?

Success Metrics

Quality Metrics

✅ 99%+ valid JSON responses
✅ 90%+ relationship type accuracy (compared to GPT-4o baseline)
✅ 95%+ concept extraction quality (F1 score vs GPT-4o)
✅ <5% quote extraction errors (exact match failures)

Performance Metrics

✅ < 30 seconds/chunk on mid-range GPU (acceptable for batch)
✅ < 15 seconds/chunk on high-end GPU (production ready)
✅ < 10 minutes for 10,000-word document (end-to-end)

Adoption Metrics

✅ 50% of users try local inference within 6 months
✅ 25% of production deployments use local inference
✅ Positive user feedback on cost savings

Reliability Metrics

✅ 99.9% uptime for local inference service
✅ <1% error rate during extraction
✅ Graceful degradation under load

ADR-039: Local Embedding Service - Similar architectural decision for embeddings
ADR-041: AI Extraction Config - Existing config system this extends
ADR-025: Dynamic Relationship Vocabulary - Variable prompt size requirement
ADR-040: Database Schema Migrations - Migration 007 for new config fields

References

Ollama: https://ollama.com
vLLM: https://github.com/vllm-project/vllm
llama.cpp: https://github.com/ggerganov/llama.cpp
HuggingFace TGI: https://github.com/huggingface/text-generation-inference
Llama 3.1 Model Card: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct
Mistral AI: https://mistral.ai/
Qwen 2.5: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

Document Status: Accepted - Implemented Author: System Architects Date: 2025-10-22 Implemented: 2025-10-22

Implementation Status

Phase 1 (MVP) - Completed: - ✅ OllamaProvider class extending AIProvider (api/app/lib/ai_providers.py) - ✅ Database migration 007 for extraction config fields (schema/migrations/007_add_local_extraction_providers.sql) - ✅ CLI commands for Ollama configuration (client/src/cli/ai-config.ts) - ✅ Hardware-optimized Docker Compose profiles (docker-compose.ollama.yml) - NVIDIA GPU profile - AMD GPU (ROCm) profile - Intel GPU profile - CPU-only profile - ✅ Startup/stop scripts (scripts/start-ollama.sh, scripts/stop-ollama.sh)

Phase 2-4 (Future): - ⏳ Quality validation testing suite - ⏳ Hybrid cloud/local fallback mode - ⏳ vLLM backend support - ⏳ Vision model integration