Ingesting Documents

How to add documents to your knowledge graph.

Overview

Ingestion is how documents become knowledge. When you ingest a document:

The system stores the original text
Splits it into manageable chunks
Extracts concepts from each chunk
Discovers relationships between concepts
Calculates grounding based on evidence

Supported Formats

Format	Extension	Notes
Plain text	`.txt`	Direct processing
Markdown	`.md`	Preserves structure
PDF	`.pdf`	Text extraction
Word	`.docx`	Text extraction
Web pages	URL	Fetches and processes

Using the CLI

Basic Ingestion

kg ingest /path/to/document.pdf

This creates a job that requires approval (to confirm cost estimate).

Auto-Approve

Skip the approval step:

kg ingest --auto-approve /path/to/document.pdf

Specify an Ontology

Organize documents into collections:

kg ingest --ontology "research-papers" /path/to/paper.pdf

If the ontology doesn't exist, it's created automatically.

Ingest Multiple Files

kg ingest --ontology "project-docs" doc1.md doc2.md doc3.pdf

Ingest a Directory

kg ingest --ontology "codebase" --recursive /path/to/docs/

Check Job Status

kg job list
kg job status <job-id>

Using the Web Interface

Navigate to Ingest in the top menu
Upload file(s) using the file picker or drag-and-drop
Select ontology (or create new)
Review cost estimate - shows expected tokens and cost
Approve to start processing
Monitor progress in the Jobs view

Using the API

curl -X POST "http://localhost:8000/ingest" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf" \
  -F "ontology=research"

Response includes a job ID for tracking.

What Happens During Ingestion

1. Document Storage

The original document is stored in Garage (object storage). You can always retrieve it later.

2. Chunking

Documents are split into chunks of roughly 1000 words, with overlap to preserve context across boundaries. This ensures: - Each chunk is small enough to process - Ideas that span page breaks aren't lost

3. Concept Extraction

Each chunk is analyzed by an AI model (GPT-4, Claude, etc.) to extract: - Concepts: The key ideas - Types: What kind of idea (claim, definition, entity, etc.) - Relationships: How concepts in this chunk relate

4. Matching

New concepts are compared to existing ones. If a concept already exists: - They're merged (same idea, more evidence) - Grounding increases (more sources confirm it)

If concepts conflict: - Both are kept - Contradiction is noted - Sources are preserved for both views

5. Grounding Calculation

After extraction, grounding scores are calculated: - How many sources mention this concept? - Do they agree or disagree? - How strong is the evidence?

Cost Estimation

Before processing, the system estimates: - Number of chunks - Expected tokens (input + output) - Approximate cost (based on your AI provider's pricing)

This is why jobs require approval by default - so you can review before incurring costs.

Ontologies

Ontologies are collections of related knowledge. Use them to: - Separate different topics (research vs meeting-notes) - Query specific domains - Control who can access what

# Create by ingesting with a new name
kg ingest --ontology "climate-research" paper1.pdf

# List ontologies
kg ontology list

# Query specific ontology
kg search --ontology "climate-research" "temperature effects"

Tips

Start Small

Ingest a few documents first to understand the output before processing large collections.

Use Meaningful Ontology Names

You'll query by ontology later. Names like "research-2024" are clearer than "stuff".

Review Extractions

After ingesting, search for concepts and verify they match what you expected. This helps you understand how the system interprets your documents.

Re-ingest if Needed

If extraction quality improves (new models, updated prompts), you can re-ingest documents. The system deduplicates based on content hashes.

Troubleshooting

Job Stuck in Pending

Approve it:

kg job approve <job-id>

Or use --auto-approve when ingesting.

Extraction Seems Wrong

Check which AI provider you're using:

kg config show

Different models have different extraction quality.

Out of Memory

Large documents with many chunks can exhaust memory. Try: - Splitting into smaller files - Reducing MAX_CONCURRENT_JOBS in configuration - Using a machine with more RAM

Next Steps

Exploring Knowledge - Navigate what you've ingested
Understanding Grounding - Interpret confidence scores