ADR-057 Appendix: Single-Stage vs Two-Stage Image Processing

Date: 2025-11-03 Related ADR: ADR-057a: Multimodal Image Ingestion

Question

Should we process images in one stage (image → concepts directly) or two stages (image → prose → concepts)?

Single-Stage:  Image → [Vision LLM] → Concepts JSON

Two-Stage:     Image → [Vision LLM] → Prose Markdown
                                        ↓
                           Prose → [Extraction LLM] → Concepts JSON

Approach 1: Single-Stage (Image → Concepts Directly)

How It Works

async def ingest_image_single_stage(image_bytes: bytes, ontology: str, visual_context: dict):
    """
    Single LLM call: Image + visual context → Concepts JSON
    """

    # Generate image embedding (for similarity search)
    image_embedding = await nomic_vision.embed_image(image_bytes)

    # Search for visually similar images
    similar_images = await search_similar_images(image_embedding, ontology)
    visual_context = await build_visual_context(similar_images)

    # Build prompt with visual context
    prompt = f"""
Extract semantic concepts from this image for the "{ontology}" ontology.

## Visual Context
This image is visually similar to:
{format_visual_context(visual_context)}

## Task
Analyze the image and extract concepts with relationships.

Output JSON:
{{
  "concepts": [
    {{
      "label": "concept name",
      "relationships": [
        {{"target": "other concept", "type": "IMPLIES", "strength": 0.9, "reason": "..."}}
      ]
    }}
  ]
}}
"""

    # SINGLE CALL: Image → Concepts JSON
    concepts_json = await vision_backend.extract_concepts_from_image(
        image_bytes,
        prompt,
        format="json"
    )

    # Parse and upsert concepts
    concepts = parse_concepts_json(concepts_json)
    return await upsert_concepts(concepts, ontology)

Pros

✅ Faster: One LLM call instead of two (save 3-10 seconds) ✅ Cheaper: One API call instead of two (~50% cost reduction) ✅ More direct: Vision model sees image directly, no "telephone game" ✅ Simpler prompt flow: Single comprehensive instruction ✅ Less state management: Don't need to store intermediate prose

Cons

❌ No prose preservation: Can't see "what the vision model said" for debugging ❌ Less flexible: Can't adjust extraction separately from vision ❌ Harder to debug: If concepts are wrong, can't see intermediate prose ❌ Can't search prose: No full-text search on descriptions ❌ Monolithic: Vision + extraction logic coupled ❌ Re-extraction impossible: Can't re-run concept extraction without re-processing image

Approach 2: Two-Stage (Image → Prose → Concepts)

How It Works

async def ingest_image_two_stage(image_bytes: bytes, ontology: str, visual_context: dict):
    """
    Two LLM calls:
    1. Image → Prose description
    2. Prose + visual context → Concepts JSON
    """

    # Generate image embedding
    image_embedding = await nomic_vision.embed_image(image_bytes)

    # Search for visually similar images
    similar_images = await search_similar_images(image_embedding, ontology)
    visual_context = await build_visual_context(similar_images)

    # STAGE 1: Image → Prose Description
    prose_prompt = """
Analyze this image and describe it in markdown format.

Include:
- All visible text verbatim
- Diagrams, charts, and visual elements
- Relationships between elements
- Structure (headings, lists, tables)

Output pure markdown.
"""

    prose_description = await vision_backend.describe_image(
        image_bytes,
        prose_prompt
    )

    # Store prose in Source node
    source = await create_source(
        full_text=prose_description,
        ontology=ontology
    )

    # STAGE 2: Prose + Visual Context → Concepts
    concepts = await extract_and_upsert_concepts(
        text=prose_description,
        source_id=source.source_id,
        ontology=ontology,
        additional_context=visual_context  # Visual context injection
    )

    return concepts

Pros

✅ Prose preservation: Full description stored in Source node ✅ Debuggable: Can see what vision model "saw" in the image ✅ Flexible: Can re-run concept extraction without re-processing image ✅ Searchable: Prose descriptions are full-text searchable ✅ Modular: Vision and extraction logic decoupled ✅ Consistent with text pipeline: Uses same extraction logic ✅ Re-extraction possible: Update extraction prompt, re-run on existing prose

Cons

❌ Slower: Two LLM calls instead of one (add 3-10 seconds) ❌ More expensive: Two API calls (~2× cost) ❌ Intermediate state: Need to store prose between stages ❌ Potential information loss: Vision model → prose → concepts (two transformations) ❌ More complex: Two prompts to maintain

Research: What Do Others Do?

GPT-4V / Vision API Patterns

OpenAI's GPT-4V documentation recommends:

"Vision models can directly answer questions about images or generate structured output. For complex extraction tasks, consider a two-step approach: first describe the image, then extract structured data from the description."

Common patterns observed: 1. Simple Q&A: Single-stage (image → answer) 2. OCR/Transcription: Single-stage (image → text) 3. Structured extraction: Two-stage (image → description → JSON) 4. Complex reasoning: Two-stage (description allows model to "think" before extracting)

Claude 3 / Vision Patterns

Anthropic's Claude 3.5 Sonnet documentation:

"For best results with complex visual analysis, use a chain-of-thought approach: Ask the model to first describe what it sees, then perform the extraction task."

Reasoning: The intermediate description acts as a "thinking step" that improves extraction quality.

LangChain / LlamaIndex Patterns

Both frameworks support both patterns, but recommend two-stage for: - Knowledge extraction - Document processing - Multi-step reasoning

And single-stage for: - Simple classification - Direct Q&A - Speed-critical applications

Industry Consensus

Two-stage is preferred when: - Output quality matters more than speed - Debugging is important - Re-processing is expensive (large images, slow models) - Need to audit what the vision model "saw" - Building knowledge bases (our use case!)

Single-stage is preferred when: - Latency is critical (real-time applications) - Cost is primary concern - Task is simple (classification, simple extraction) - No need for prose preservation

Our Use Case: Knowledge Graph Construction

What We Need

High-quality concepts: Accuracy > speed
Ground truth preservation: Must verify what was extracted
Re-extraction capability: Update extraction logic without re-processing images
Debugging: Need to see what vision model interpreted
Consistency: Same extraction logic for text and images
Search: Users should be able to search prose descriptions

What Matters Less

Latency: Ingestion is asynchronous (job queue)
Cost: Knowledge extraction is one-time, high-value
Simplicity: System is already complex (worth it for flexibility)

Comparison Table

Criterion	Single-Stage	Two-Stage	Winner
Speed	~5-10s per image	~10-20s per image	Single
Cost	1× LLM call	2× LLM calls	Single
Quality	Direct from image	Chain-of-thought	Two
Debuggability	No intermediate state	Prose visible	Two
Re-extraction	Must re-process image	Re-run on prose	Two
Search prose	N/A	Full-text search	Two
Consistency	Custom logic	Same as text	Two
Audit trail	Opaque	Transparent	Two
Flexibility	Monolithic	Modular	Two

Score: Single-Stage: 2/9, Two-Stage: 7/9

Recommendation: Two-Stage with Option for Single-Stage

Default: Two-Stage

Use two-stage approach by default because:

Quality matters: We're building a knowledge base, not a real-time system
Debugging essential: Users need to verify extraction accuracy
Re-extraction valuable: Can improve extraction prompt without re-processing images
Consistency: Uses same extraction logic as text ingestion
Search: Prose descriptions add value for full-text search

Optional: Single-Stage Mode

Provide single-stage as an optional optimization for users who: - Have cost constraints - Need faster ingestion - Trust the extraction quality - Don't need prose descriptions

# config/ingestion.yaml

image_processing:
  # Processing mode
  mode: "two_stage"  # Options: two_stage (default), single_stage

  # Two-stage settings
  two_stage:
    preserve_prose: true
    prose_searchable: true
    enable_re_extraction: true

  # Single-stage settings (when mode=single_stage)
  single_stage:
    direct_concept_extraction: true
    no_intermediate_storage: true

Hybrid Approach: Best of Both

async def ingest_image_hybrid(image_bytes: bytes, ontology: str, mode: str = "two_stage"):
    """
    Hybrid approach: Support both modes with shared infrastructure.
    """

    # Always generate image embedding (needed for both modes)
    image_embedding = await nomic_vision.embed_image(image_bytes)
    similar_images = await search_similar_images(image_embedding, ontology)
    visual_context = await build_visual_context(similar_images)

    if mode == "single_stage":
        # FAST PATH: Image → Concepts directly
        concepts = await vision_backend.extract_concepts_from_image(
            image_bytes,
            visual_context,
            ontology
        )

        # Store minimal Source (no prose)
        source = await create_source(
            full_text="[Single-stage extraction - no prose description]",
            ontology=ontology,
            extraction_mode="single_stage"
        )

    else:  # mode == "two_stage" (default)
        # QUALITY PATH: Image → Prose → Concepts
        prose = await vision_backend.describe_image(image_bytes)

        # Store prose in Source
        source = await create_source(
            full_text=prose,
            ontology=ontology,
            extraction_mode="two_stage"
        )

        # Extract concepts from prose with visual context
        concepts = await extract_and_upsert_concepts(
            text=prose,
            source_id=source.source_id,
            ontology=ontology,
            additional_context=visual_context
        )

    return concepts

Chain-of-Thought: Why Two-Stage Works Better

The "Thinking Step" Effect

When a vision model generates prose before extracting concepts, it performs implicit reasoning:

Single-Stage:
  Image → [What concepts do I see?] → Concepts
  (Direct leap from visual features to abstract concepts)

Two-Stage:
  Image → [What do I see?] → Prose → [What concepts are in this text?] → Concepts
  (Intermediate "thinking" step grounds the extraction)

Example: Flowchart Analysis

Single-Stage Output:

{
  "concepts": [
    {"label": "process flow"},
    {"label": "decision point"}
  ]
}

Two-Stage Output:

Prose: "This flowchart shows a recursive descent parser.
        It starts with a 'parse' function that calls itself
        when encountering nested structures..."

Concepts:
{
  "concepts": [
    {"label": "recursive descent parser"},
    {"label": "self-referential control flow"},
    {"label": "nested structure handling"}
  ]
}

Why better? The prose description forces the model to: 1. Identify specific algorithm (not just "process flow") 2. Notice recursion (from "calls itself") 3. Extract domain-specific concept ("recursive descent parser")

Cost-Benefit Analysis

Two-Stage Costs (per image)

Time: - Stage 1 (vision → prose): ~5-10s - Stage 2 (prose → concepts): ~3-5s - Total: 8-15s per image

Money (GPT-4o): - Stage 1: $0.0075 (vision) + $0.0025 (text output) = $0.01 - Stage 2: $0.005 (text input) + $0.0025 (text output) = $0.0075 - Total: ~$0.0175 per image

For 1000 images: 15,000 seconds (4.2 hours), $17.50

Two-Stage Benefits

Debugging value: - Can inspect prose for all 1000 images - Identify systemic issues in vision model interpretation - Fix extraction prompt, re-run on prose (no re-processing)

Re-extraction value: - Update concept extraction prompt (new relationship types, improved logic) - Re-run on all 1000 images in ~1 hour (prose → concepts only) - Saved: 4.2 hours of vision processing + $10 in vision API costs

Search value: - Prose descriptions are full-text searchable - "Find images that mention 'recursive algorithms'" - Returns images even if "recursive algorithms" wasn't extracted as a concept

Audit value: - Users can verify: "Did the vision model see what I see?" - Builds trust in extraction quality - Identifies cases where vision model misinterpreted image

Break-Even Analysis

Two-stage pays for itself after: - 1 re-extraction: Saves 50% of processing time/cost - 1 debugging session: Prose inspection saves hours of manual review - 1 search query: Prose search finds images that concept search misses

Conclusion: For knowledge base construction, two-stage is worth the cost.

Final Recommendation

Adopt Two-Stage as Default

Reasons: 1. ✅ Industry consensus for knowledge extraction 2. ✅ Quality > speed for our use case 3. ✅ Debugging essential for trust 4. ✅ Re-extraction saves time long-term 5. ✅ Prose search adds value 6. ✅ Consistent with text pipeline 7. ✅ Chain-of-thought improves accuracy

Provide Single-Stage as Opt-In

For users who: - Have tight cost constraints - Need faster batch processing - Trust extraction quality - Don't need debugging

Implementation Priority

Phase 1 (MVP): Two-stage only - Implement image → prose → concepts - Prove the concept works - Gather feedback on prose quality

Phase 2 (Optimization): Add single-stage option - Implement direct image → concepts - Make it configurable - Let users choose based on their needs

Phase 3 (Intelligence): Adaptive mode - Automatically choose mode per image - Use single-stage for simple images (screenshots, charts) - Use two-stage for complex images (diagrams, dense documents)

Conclusion

We recommend the two-stage approach (image → prose → concepts) because:

Higher quality: Chain-of-thought reasoning improves extraction
Debuggable: Prose inspection enables quality verification
Flexible: Re-extraction without re-processing
Searchable: Full-text search on descriptions
Consistent: Same extraction pipeline as text
Industry standard: Aligned with OpenAI/Anthropic recommendations

The additional cost (~$0.01 per image, 5-10s latency) is justified by: - Improved extraction quality - Long-term time savings from re-extraction - Enhanced user trust through transparency - Additional search capabilities

This aligns with our philosophy: high-quality knowledge extraction over speed optimization.