ADR-057 Appendix: Single-Stage vs Two-Stage Image Processing
Date: 2025-11-03 Related ADR: ADR-057a: Multimodal Image Ingestion
Question
Should we process images in one stage (image → concepts directly) or two stages (image → prose → concepts)?
Single-Stage: Image → [Vision LLM] → Concepts JSON
Two-Stage: Image → [Vision LLM] → Prose Markdown
↓
Prose → [Extraction LLM] → Concepts JSON
Approach 1: Single-Stage (Image → Concepts Directly)
How It Works
async def ingest_image_single_stage(image_bytes: bytes, ontology: str, visual_context: dict):
"""
Single LLM call: Image + visual context → Concepts JSON
"""
# Generate image embedding (for similarity search)
image_embedding = await nomic_vision.embed_image(image_bytes)
# Search for visually similar images
similar_images = await search_similar_images(image_embedding, ontology)
visual_context = await build_visual_context(similar_images)
# Build prompt with visual context
prompt = f"""
Extract semantic concepts from this image for the "{ontology}" ontology.
## Visual Context
This image is visually similar to:
{format_visual_context(visual_context)}
## Task
Analyze the image and extract concepts with relationships.
Output JSON:
{{
"concepts": [
{{
"label": "concept name",
"relationships": [
{{"target": "other concept", "type": "IMPLIES", "strength": 0.9, "reason": "..."}}
]
}}
]
}}
"""
# SINGLE CALL: Image → Concepts JSON
concepts_json = await vision_backend.extract_concepts_from_image(
image_bytes,
prompt,
format="json"
)
# Parse and upsert concepts
concepts = parse_concepts_json(concepts_json)
return await upsert_concepts(concepts, ontology)
Pros
✅ Faster: One LLM call instead of two (save 3-10 seconds) ✅ Cheaper: One API call instead of two (~50% cost reduction) ✅ More direct: Vision model sees image directly, no "telephone game" ✅ Simpler prompt flow: Single comprehensive instruction ✅ Less state management: Don't need to store intermediate prose
Cons
❌ No prose preservation: Can't see "what the vision model said" for debugging ❌ Less flexible: Can't adjust extraction separately from vision ❌ Harder to debug: If concepts are wrong, can't see intermediate prose ❌ Can't search prose: No full-text search on descriptions ❌ Monolithic: Vision + extraction logic coupled ❌ Re-extraction impossible: Can't re-run concept extraction without re-processing image
Approach 2: Two-Stage (Image → Prose → Concepts)
How It Works
async def ingest_image_two_stage(image_bytes: bytes, ontology: str, visual_context: dict):
"""
Two LLM calls:
1. Image → Prose description
2. Prose + visual context → Concepts JSON
"""
# Generate image embedding
image_embedding = await nomic_vision.embed_image(image_bytes)
# Search for visually similar images
similar_images = await search_similar_images(image_embedding, ontology)
visual_context = await build_visual_context(similar_images)
# STAGE 1: Image → Prose Description
prose_prompt = """
Analyze this image and describe it in markdown format.
Include:
- All visible text verbatim
- Diagrams, charts, and visual elements
- Relationships between elements
- Structure (headings, lists, tables)
Output pure markdown.
"""
prose_description = await vision_backend.describe_image(
image_bytes,
prose_prompt
)
# Store prose in Source node
source = await create_source(
full_text=prose_description,
ontology=ontology
)
# STAGE 2: Prose + Visual Context → Concepts
concepts = await extract_and_upsert_concepts(
text=prose_description,
source_id=source.source_id,
ontology=ontology,
additional_context=visual_context # Visual context injection
)
return concepts
Pros
✅ Prose preservation: Full description stored in Source node ✅ Debuggable: Can see what vision model "saw" in the image ✅ Flexible: Can re-run concept extraction without re-processing image ✅ Searchable: Prose descriptions are full-text searchable ✅ Modular: Vision and extraction logic decoupled ✅ Consistent with text pipeline: Uses same extraction logic ✅ Re-extraction possible: Update extraction prompt, re-run on existing prose
Cons
❌ Slower: Two LLM calls instead of one (add 3-10 seconds) ❌ More expensive: Two API calls (~2× cost) ❌ Intermediate state: Need to store prose between stages ❌ Potential information loss: Vision model → prose → concepts (two transformations) ❌ More complex: Two prompts to maintain
Research: What Do Others Do?
GPT-4V / Vision API Patterns
OpenAI's GPT-4V documentation recommends:
"Vision models can directly answer questions about images or generate structured output. For complex extraction tasks, consider a two-step approach: first describe the image, then extract structured data from the description."
Common patterns observed: 1. Simple Q&A: Single-stage (image → answer) 2. OCR/Transcription: Single-stage (image → text) 3. Structured extraction: Two-stage (image → description → JSON) 4. Complex reasoning: Two-stage (description allows model to "think" before extracting)
Claude 3 / Vision Patterns
Anthropic's Claude 3.5 Sonnet documentation:
"For best results with complex visual analysis, use a chain-of-thought approach: Ask the model to first describe what it sees, then perform the extraction task."
Reasoning: The intermediate description acts as a "thinking step" that improves extraction quality.
LangChain / LlamaIndex Patterns
Both frameworks support both patterns, but recommend two-stage for: - Knowledge extraction - Document processing - Multi-step reasoning
And single-stage for: - Simple classification - Direct Q&A - Speed-critical applications
Industry Consensus
Two-stage is preferred when: - Output quality matters more than speed - Debugging is important - Re-processing is expensive (large images, slow models) - Need to audit what the vision model "saw" - Building knowledge bases (our use case!)
Single-stage is preferred when: - Latency is critical (real-time applications) - Cost is primary concern - Task is simple (classification, simple extraction) - No need for prose preservation
Our Use Case: Knowledge Graph Construction
What We Need
- High-quality concepts: Accuracy > speed
- Ground truth preservation: Must verify what was extracted
- Re-extraction capability: Update extraction logic without re-processing images
- Debugging: Need to see what vision model interpreted
- Consistency: Same extraction logic for text and images
- Search: Users should be able to search prose descriptions
What Matters Less
- Latency: Ingestion is asynchronous (job queue)
- Cost: Knowledge extraction is one-time, high-value
- Simplicity: System is already complex (worth it for flexibility)
Comparison Table
| Criterion | Single-Stage | Two-Stage | Winner |
|---|---|---|---|
| Speed | ~5-10s per image | ~10-20s per image | Single |
| Cost | 1× LLM call | 2× LLM calls | Single |
| Quality | Direct from image | Chain-of-thought | Two |
| Debuggability | No intermediate state | Prose visible | Two |
| Re-extraction | Must re-process image | Re-run on prose | Two |
| Search prose | N/A | Full-text search | Two |
| Consistency | Custom logic | Same as text | Two |
| Audit trail | Opaque | Transparent | Two |
| Flexibility | Monolithic | Modular | Two |
Score: Single-Stage: 2/9, Two-Stage: 7/9
Recommendation: Two-Stage with Option for Single-Stage
Default: Two-Stage
Use two-stage approach by default because:
- Quality matters: We're building a knowledge base, not a real-time system
- Debugging essential: Users need to verify extraction accuracy
- Re-extraction valuable: Can improve extraction prompt without re-processing images
- Consistency: Uses same extraction logic as text ingestion
- Search: Prose descriptions add value for full-text search
Optional: Single-Stage Mode
Provide single-stage as an optional optimization for users who: - Have cost constraints - Need faster ingestion - Trust the extraction quality - Don't need prose descriptions
# config/ingestion.yaml
image_processing:
# Processing mode
mode: "two_stage" # Options: two_stage (default), single_stage
# Two-stage settings
two_stage:
preserve_prose: true
prose_searchable: true
enable_re_extraction: true
# Single-stage settings (when mode=single_stage)
single_stage:
direct_concept_extraction: true
no_intermediate_storage: true
Hybrid Approach: Best of Both
async def ingest_image_hybrid(image_bytes: bytes, ontology: str, mode: str = "two_stage"):
"""
Hybrid approach: Support both modes with shared infrastructure.
"""
# Always generate image embedding (needed for both modes)
image_embedding = await nomic_vision.embed_image(image_bytes)
similar_images = await search_similar_images(image_embedding, ontology)
visual_context = await build_visual_context(similar_images)
if mode == "single_stage":
# FAST PATH: Image → Concepts directly
concepts = await vision_backend.extract_concepts_from_image(
image_bytes,
visual_context,
ontology
)
# Store minimal Source (no prose)
source = await create_source(
full_text="[Single-stage extraction - no prose description]",
ontology=ontology,
extraction_mode="single_stage"
)
else: # mode == "two_stage" (default)
# QUALITY PATH: Image → Prose → Concepts
prose = await vision_backend.describe_image(image_bytes)
# Store prose in Source
source = await create_source(
full_text=prose,
ontology=ontology,
extraction_mode="two_stage"
)
# Extract concepts from prose with visual context
concepts = await extract_and_upsert_concepts(
text=prose,
source_id=source.source_id,
ontology=ontology,
additional_context=visual_context
)
return concepts
Chain-of-Thought: Why Two-Stage Works Better
The "Thinking Step" Effect
When a vision model generates prose before extracting concepts, it performs implicit reasoning:
Single-Stage:
Image → [What concepts do I see?] → Concepts
(Direct leap from visual features to abstract concepts)
Two-Stage:
Image → [What do I see?] → Prose → [What concepts are in this text?] → Concepts
(Intermediate "thinking" step grounds the extraction)
Example: Flowchart Analysis
Single-Stage Output:
Two-Stage Output:
Prose: "This flowchart shows a recursive descent parser.
It starts with a 'parse' function that calls itself
when encountering nested structures..."
Concepts:
{
"concepts": [
{"label": "recursive descent parser"},
{"label": "self-referential control flow"},
{"label": "nested structure handling"}
]
}
Why better? The prose description forces the model to: 1. Identify specific algorithm (not just "process flow") 2. Notice recursion (from "calls itself") 3. Extract domain-specific concept ("recursive descent parser")
Cost-Benefit Analysis
Two-Stage Costs (per image)
Time: - Stage 1 (vision → prose): ~5-10s - Stage 2 (prose → concepts): ~3-5s - Total: 8-15s per image
Money (GPT-4o): - Stage 1: $0.0075 (vision) + $0.0025 (text output) = $0.01 - Stage 2: $0.005 (text input) + $0.0025 (text output) = $0.0075 - Total: ~$0.0175 per image
For 1000 images: 15,000 seconds (4.2 hours), $17.50
Two-Stage Benefits
Debugging value: - Can inspect prose for all 1000 images - Identify systemic issues in vision model interpretation - Fix extraction prompt, re-run on prose (no re-processing)
Re-extraction value: - Update concept extraction prompt (new relationship types, improved logic) - Re-run on all 1000 images in ~1 hour (prose → concepts only) - Saved: 4.2 hours of vision processing + $10 in vision API costs
Search value: - Prose descriptions are full-text searchable - "Find images that mention 'recursive algorithms'" - Returns images even if "recursive algorithms" wasn't extracted as a concept
Audit value: - Users can verify: "Did the vision model see what I see?" - Builds trust in extraction quality - Identifies cases where vision model misinterpreted image
Break-Even Analysis
Two-stage pays for itself after: - 1 re-extraction: Saves 50% of processing time/cost - 1 debugging session: Prose inspection saves hours of manual review - 1 search query: Prose search finds images that concept search misses
Conclusion: For knowledge base construction, two-stage is worth the cost.
Final Recommendation
Adopt Two-Stage as Default
Reasons: 1. ✅ Industry consensus for knowledge extraction 2. ✅ Quality > speed for our use case 3. ✅ Debugging essential for trust 4. ✅ Re-extraction saves time long-term 5. ✅ Prose search adds value 6. ✅ Consistent with text pipeline 7. ✅ Chain-of-thought improves accuracy
Provide Single-Stage as Opt-In
For users who: - Have tight cost constraints - Need faster batch processing - Trust extraction quality - Don't need debugging
Implementation Priority
Phase 1 (MVP): Two-stage only - Implement image → prose → concepts - Prove the concept works - Gather feedback on prose quality
Phase 2 (Optimization): Add single-stage option - Implement direct image → concepts - Make it configurable - Let users choose based on their needs
Phase 3 (Intelligence): Adaptive mode - Automatically choose mode per image - Use single-stage for simple images (screenshots, charts) - Use two-stage for complex images (diagrams, dense documents)
Conclusion
We recommend the two-stage approach (image → prose → concepts) because:
- Higher quality: Chain-of-thought reasoning improves extraction
- Debuggable: Prose inspection enables quality verification
- Flexible: Re-extraction without re-processing
- Searchable: Full-text search on descriptions
- Consistent: Same extraction pipeline as text
- Industry standard: Aligned with OpenAI/Anthropic recommendations
The additional cost (~$0.01 per image, 5-10s latency) is justified by: - Improved extraction quality - Long-term time savings from re-extraction - Enhanced user trust through transparency - Additional search capabilities
This aligns with our philosophy: high-quality knowledge extraction over speed optimization.