Compare Extraction Quality

Different AI providers produce different graph characteristics from the same document. This page helps you choose a provider based on measured trade-offs between concept volume, schema compliance, speed, and cost.

All providers flow through the same Kappa Graph ingestion pipeline — the same chunker, relationship normalizer, and embeddings. Quality differences are model-dependent, not architectural (ADR-806).

Quick decision table

Goal	Provider	Why
Production knowledge base	OpenAI GPT-4o	46 concepts, 88% canonical, 2 s/chunk
Maximum concept extraction	Ollama Qwen3 14B	57 concepts, 74% canonical, 16 GB VRAM
Strictest schema compliance	Ollama Qwen 2.5 14B	92% canonical, zero cost
Large private corpus (1 000+ docs)	Ollama Qwen 2.5 14B	92% canonical, privacy-preserving, zero cost
Quick prototyping	OpenAI GPT-4o	30× faster than local models, no setup
Avoid	Ollama Mistral 7B	38% canonical — vocabulary pollution

Measured results

The figures below come from a single test document (a 6-chunk philosophical lecture transcript) run through each provider with identical settings. The test establishes relative rankings; absolute numbers will vary with document type.

Summary statistics

Metric	Mistral 7B	Qwen 2.5 14B	Qwen3 14B	GPT-OSS 20B	GPT-4o
Concepts extracted	32	24	57	48	46
Evidence instances	36	27	70	49	48
Total relationships	134	98	61	190	172
Unique relationship types	16	13	22	17	17
Canonical adherence	38%	92%	74%	65%	88%
Non-canonical types created	10	1	5	6	2
Inference speed	~10 s/chunk	~15 s/chunk	~60 s/chunk	~20 s/chunk	~2 s/chunk
Cost per document	$0.00	$0.00	$0.00	$0.00	~$0.10

What "canonical adherence" means for your graph

Each time a model emits a relationship type that is not in the canonical 30-type taxonomy, the fuzzy matcher tries to correct it. When no good match exists, the system accepts the new type and adds it to the vocabulary (ADR-603). After 100 documents:

Provider	Estimated relationship types	Canonical	Non-canonical
GPT-4o	~35	30	~5
Qwen 2.5 14B	~32	30	~2
Qwen3 14B	~40	30	~10
GPT-OSS 20B	~50	30	~20
Mistral 7B	~80	30	~50

Vocabulary that grows unboundedly makes queries less predictable and vocabulary consolidation more expensive.

Provider profiles

OpenAI GPT-4o

Extraction philosophy: balanced coverage with high schema compliance.

46 concepts, 172-edge graph, 88% canonical adherence
2 s/chunk — 30× faster than Qwen3 at 60 s/chunk
Cost: ~$0.10 per 6-chunk document ($10 per 100 documents)
Only 2 non-canonical relationship types in the test run

Best for production deployments, shared knowledge bases, and any workload where speed or schema quality cannot be compromised.

Ollama Qwen3 14B

Extraction philosophy: high concept volume with moderate schema compliance.

57 concepts — highest of all models tested
74% canonical adherence; 5 non-canonical types
60 s/chunk; fits in 16 GB VRAM (consumer GPU)
Zero inference cost after hardware is provisioned

Qwen3 (released January 2025) extracts 2.4× more concepts than Qwen 2.5 from identical hardware. The 60-second wait per chunk is the main trade-off.

Ollama Qwen 2.5 14B

Extraction philosophy: conservative, precise, canonical.

24 concepts, 98-edge graph, 92% canonical adherence — highest of all models tested
Only 1 non-canonical relationship type in the test run
15 s/chunk; fits in 16 GB VRAM
Zero inference cost

Best for large private corpora, sensitive documents, or any context where long-term schema cleanliness matters more than concept volume.

GPT-OSS 20B (Ollama)

Extraction philosophy: maximum relationship density.

48 concepts, 190-edge graph — densest relationship count of all models tested
65% canonical adherence; 6 non-canonical types
~20 s/chunk; requires 20 GB+ VRAM or CPU+GPU split
Zero inference cost

GPT-OSS is a reasoning model. Reasoning models generate extended internal analysis before producing output, which increases token usage and reduces output consistency for structured extraction tasks. An instruction model of equivalent or smaller size will usually produce better JSON output. Use GPT-OSS when relationship density is the priority and schema compliance can be traded away.

Ollama Mistral 7B

38% canonical adherence; 10 non-canonical types
Creates vocabulary pollution that compounds across documents
Sparse graph — many isolated concepts with zero relationships

Do not use Mistral 7B for production ingestion. If you need a local model with a small VRAM footprint, Qwen 2.5 14B on 16 GB is the better choice.

Cost and time at scale

1 000-document corpus (~6 000 chunks)

Provider	Time	Cost	Concepts	Canonical
GPT-4o	~3.3 hours	~$100	~46 000	88%
Qwen3 14B	~100 hours	$0	~57 000	74%
Qwen 2.5 14B	~25 hours	$0	~24 000	92%
GPT-OSS 20B	~33 hours	$0	~48 000	65%
Mistral 7B	~17 hours	$0	~32 000	38%

Break-even by corpus size

Under 50 documents — GPT-4o costs under $5 and saves significant time. Start here.
50–500 documents — Local models become competitive. Choose Qwen 2.5 14B for canonical quality or Qwen3 14B for concept volume.
500+ documents — Local models are strongly recommended. GPT-4o at 500 documents costs ~$50.

Switch and test providers

Extraction configuration is hot-reloadable. No API restart is required when you change provider or model.

# Switch to GPT-4o
kg admin extraction set --provider openai --model gpt-4o

# Switch to Qwen3 14B (Ollama)
kg admin extraction set --provider ollama --model qwen3:14b

# Switch to Qwen 2.5 14B (Ollama)
kg admin extraction set --provider ollama --model qwen2.5:14b-instruct

# Verify active config
kg admin extraction config

Run a comparison yourself

Pick a document you know well — a few pages of dense text works well.

# Pull local models (one-time)
docker exec kg-ollama ollama pull qwen3:14b
docker exec kg-ollama ollama pull qwen2.5:14b-instruct

# Test GPT-4o
kg admin extraction set --provider openai --model gpt-4o
kg ontology delete "quality_test"
kg ingest file -o "quality_test" -y your-document.txt
kg database stats

# Test Qwen3 14B
kg admin extraction set --provider ollama --model qwen3:14b
kg ontology delete "quality_test"
kg ingest file -o "quality_test" -y your-document.txt
kg database stats   # expect ~60 s/chunk

# Test Qwen 2.5 14B
kg admin extraction set --provider ollama --model qwen2.5:14b-instruct
kg ontology delete "quality_test"
kg ingest file -o "quality_test" -y your-document.txt
kg database stats

Compare: concept count, relationship count, unique relationship types, and canonical adherence in the stats output.

Configure AI Providers — set API keys, switch providers, configure Ollama
Consolidate Vocabulary — repair schema drift from non-canonical types
ADR-806 — local LLM inference architecture
ADR-603 — relationship type acceptance and vocabulary growth policy