Skip to content

ADR-071a: Parallel Graph Query Implementation Findings

Status: Accepted Date: 2025-12-04 Parent: ADR-071: Parallel Graph Query Optimization

Context

This subletter documents the actual implementation of ADR-071's parallel graph query system and the critical performance discoveries made during testing.

Implementation Environment: - System: 32-core machine - Database: PostgreSQL + Apache AGE - Test workload: Polarity axis analysis (Modern Operating Model ↔ Traditional Operating Model) - Dataset: 40 1-hop neighbors, ~75 2-hop neighbors - Baseline: 4:21 (261 seconds) for max_hops=2 query

Implementation

Critical Bug Fix

The initial implementation failed with "graph 'concept' does not exist" errors. The root cause:

Broken Code (manual query wrapping):

query = f"""
    SELECT * FROM ag_catalog.cypher('concept', $$
        MATCH (seed:Concept)-[]-(neighbor:Concept)
        WHERE seed.concept_id IN $seed_ids
        RETURN DISTINCT neighbor.concept_id as concept_id
    $$) as (concept_id agtype);
"""

Problem: Manually wrapping queries with ag_catalog.cypher('concept', ...) used wrong graph name.

Fix (use AGEClient._execute_cypher):

# Plain Cypher query - AGEClient._execute_cypher() handles wrapping
query = f"""
    MATCH (seed:Concept)-[]-(neighbor:Concept)
    WHERE seed.concept_id IN $seed_ids
    RETURN DISTINCT neighbor.concept_id as concept_id
    LIMIT {self.config.per_worker_limit}
"""

results = self.client._execute_cypher(query, params={'seed_ids': seed_ids})

Impact: This single fix enabled the 3x speedup. The parallelization infrastructure works correctly, but the speedup primarily comes from batched query optimization, not from parallelization itself.

Configuration Changes

AGEClient Connection Pool:

# Increased from 10 to support parallel workers
self.pool = psycopg2.pool.SimpleConnectionPool(
    1,   # minconn
    20,  # maxconn (supports up to 16 parallel workers + buffer)
    ...
)

GraphParallelizer Configuration:

config = ParallelQueryConfig(
    max_workers=8,      # ThreadPoolExecutor size
    chunk_size=100,     # Optimal: 1-2 workers (see findings)
    timeout_seconds=120.0,
    per_worker_limit=max_candidates * 2
)

Performance Testing Results

Complete Performance Curve

Tested configurations from 1 to 8 workers on clean system (no CPU contention):

Workers Chunk Size Phase 1 Phase 2 Total Success Speedup Overhead
Baseline N/A - ~180s 4:21 (261s) - 1.0x -
1 worker 100 41.5s 40.8s 1:23 (83s) 100% 3.15x 0s
2 workers 20 41.4s 43.3s 1:25 (85s) 100% 3.07x +2s
4 workers 10 41.7s 80.5s 2:02 (122s) 100% 2.14x +39s
8 workers 5 41.8s 162s 3:24 (204s) 50% 1.28x +79s + timeouts

Phase Breakdown: - Phase 1 (1-hop): ~41-42s (consistent across all configs, single batched query) - Phase 2 (2-hop): 41s (1 worker) → 43s (2 workers) → 80s (4 workers) → 162s (8 workers)

Key Findings

🎯 Discovery #1: Speedup source is batched queries, NOT parallelization

The 3x speedup comes from: 1. ✅ Fixing the query format to use _execute_cypher() correctly 2. ✅ Batched IN clause queries instead of individual queries per concept 3. ❌ NOT from parallelization - parallelization adds overhead

Evidence: - 1 worker (sequential): 83 seconds - 2 workers (parallel): 85 seconds (+2s) - Difference: Only 2 seconds faster with parallelization

🎯 Discovery #2: 1 worker is optimal for this workload

  • 1 worker: 83s, zero overhead, simplest implementation
  • 2 workers: 85s, +2s overhead (negligible)
  • 4 workers: 122s, +39s overhead (significant)
  • 8 workers: 204s, +79s overhead + timeouts (catastrophic)

Why more workers hurt performance: 1. Per-worker overhead - Thread spawning, coordination, context switching 2. Fixed query costs - Connection acquisition, query planning, setup 3. Database contention - Multiple concurrent queries compete for locks 4. Diminishing returns - Amdahl's Law in action

🎯 Discovery #3: Worker overhead scales badly

Phase 2 overhead vs number of workers: - 1 worker: 41s (baseline) - 2 workers: 43s (+2s, 5% overhead) - 4 workers: 80s (+39s, 95% overhead) - 8 workers: 162s (+121s, 295% overhead + timeouts)

🎯 Discovery #4: Timeout mechanism validates correctly

With 8 workers and CPU contention: - 4/8 chunks completed successfully - 4/8 chunks timed out after exactly 120s - System gracefully returned partial results - No hangs, no crashes, no data corruption

Decision

Optimal Configuration: 1-2 workers

After comprehensive testing, we recommend:

Option A: 1 worker (simplest, fastest)

config = ParallelQueryConfig(
    max_workers=8,
    chunk_size=100,  # Forces 1 worker, sequential Phase 2
    timeout_seconds=120.0,
    per_worker_limit=max_candidates * 2
)

Option B: 2 workers (minimal overhead, nice symmetry)

config = ParallelQueryConfig(
    max_workers=8,
    chunk_size=20,   # Creates 2 workers for typical 40-concept datasets
    timeout_seconds=120.0,
    per_worker_limit=max_candidates * 2
)

Rationale: - Both deliver ~3x speedup over broken baseline - 1 worker eliminates all parallelization complexity - 2 workers adds only 2s overhead, validates parallel infrastructure - 4+ workers add significant overhead without proportional benefit - Simpler = fewer failure modes, easier debugging, better maintainability

Consequences

Positive

  1. 3x speedup achieved - 4:21 → 1:23 (261s → 83s)
  2. Simple implementation - 1 worker eliminates parallelization complexity
  3. Root cause identified - Query format bug, not algorithm design
  4. Robust infrastructure - Parallel system works, validated with timeout testing
  5. Scalability path - Can increase workers for larger datasets if needed
  6. Performance baseline - Comprehensive testing establishes performance curve

Negative

  1. Parallelization infrastructure underutilized - Built for 8+ workers, optimal is 1-2
  2. Design assumptions wrong - ADR-071 predicted 160x speedup from parallelization, actual speedup is from batched queries
  3. Overhead higher than expected - Worker overhead grows faster than linear

Neutral

  1. Configuration tunability - chunk_size controls worker count, but optimal is fixed
  2. Connection pool increase - Increased from 10 → 20, but only need 2-3
  3. Timeout mechanism - Validated and working, but not needed at optimal config

Alternatives Considered

Alternative 1: Remove parallelization entirely

Approach: Delete GraphParallelizer, just use batched queries

Pros: - Simplest possible code - No thread management, no semaphores, no timeouts - Identical performance to 1-worker config

Cons: - Loses scalability path for larger datasets - Loses timeout safety mechanism - Wastes implementation effort

Decision: Keep parallelizer with optimal 1-2 worker config. Infrastructure is valuable for: - Future large-scale queries (100+ concepts) - Timeout safety mechanism - Demonstrates due diligence in optimization

Alternative 2: Adaptive worker count

Approach: Dynamically choose worker count based on dataset size

# Adaptive configuration
def get_optimal_workers(num_concepts):
    if num_concepts < 50:
        return 1  # Sequential is fastest
    elif num_concepts < 100:
        return 2  # Minimal overhead
    else:
        return 4  # More parallelism for large datasets

Pros: - Optimal performance across different dataset sizes - Handles both small and large queries

Cons: - Increased complexity - Untested for large datasets - Premature optimization

Decision: Defer until we have real-world large dataset use cases. Current config is simple and optimal for observed workloads.

Alternative 3: Database-level parallelization

Approach: Investigate PostgreSQL parallel query execution instead of application-level parallelization

Pros: - Database engine handles parallelization - No application-level thread management - May be more efficient

Cons: - AGE Cypher queries are function calls, not parallelizable by PostgreSQL - Would require changes to AGE extension itself - Out of scope for application-level optimization

Decision: Not feasible with current AGE architecture.

Implementation Notes

Files Modified

  1. api/api/lib/graph_parallelizer.py (409 lines)
  2. Fixed query format to use _execute_cypher()
  3. Removed manual ag_catalog.cypher() wrapping
  4. Removed unnecessary json import

  5. api/api/lib/polarity_axis.py

  6. Added discover_candidate_concepts_parallel() function
  7. Updated analyze_polarity_axis() with use_parallel parameter
  8. Configured ParallelQueryConfig with optimal settings

  9. api/api/models/queries.py

  10. Added use_parallel: bool field to PolarityAxisRequest

  11. api/api/routes/queries.py

  12. Passed use_parallel parameter to analysis function

  13. api/api/lib/age_client.py

  14. Increased connection pool: maxconn=10maxconn=20

  15. scripts/development/manual-tests/test_parallel_performance.py

  16. Performance testing script for future benchmarking

Testing Performed

Test Environment: - Clean system (no competing workload) - 32-core machine - PostgreSQL + AGE running in Docker - Test poles: Modern Operating Model ↔ Traditional Operating Model

Configurations Tested: - ✅ 1 worker (chunk_size=100): 1:23, 100% success - ✅ 2 workers (chunk_size=20): 1:25, 100% success - ✅ 4 workers (chunk_size=10): 2:02, 100% success - ✅ 8 workers (chunk_size=5): 3:24, 50% success (timeouts)

Stress Testing: - CPU contention (kernel compilation running): Validated graceful degradation - Timeout mechanism: Validated with 8-worker config - Connection pool: No exhaustion observed with 20-connection pool

  • ADR-071: Parallel Graph Query Optimization (parent) - Original design document
  • ADR-048: GraphQueryFacade - Namespace-safe query interface (not used, went with _execute_cypher)
  • ADR-043: Resource Management - VRAM/CPU contention handling (informed testing approach)

Lessons Learned

  1. Profile before parallelizing - The bottleneck was broken query format, not sequential execution
  2. Measure, don't assume - Design predicted 160x from parallelization, actual gain is from batching
  3. Simple is fast - 1 worker (sequential) beats complex parallelization
  4. Amdahl's Law is real - Overhead grows faster than parallelization benefit
  5. Infrastructure has value - Even if underutilized, timeout safety and scalability path justify keeping GraphParallelizer

Recommendations

For Production

  1. Use 2-worker configuration for balance of:
  2. Near-optimal performance (only 2s slower than 1 worker)
  3. Validates parallel infrastructure is working
  4. Provides scalability path for larger datasets

  5. Monitor Phase 2 timing to detect performance regressions

  6. Consider 1-worker config if simplicity > 2s performance difference

For Future Work

  1. Test with larger datasets (100+ concepts) to validate if more workers help at scale
  2. Profile database-level bottlenecks to understand why larger chunks are faster
  3. Investigate query plan caching to reduce per-query overhead
  4. Consider adaptive worker count based on dataset size (if large datasets become common)

Database Tuning Research

After establishing the optimal application-level configuration (2 workers), we researched PostgreSQL and Apache AGE database-level optimizations to understand potential incremental improvements.

Current Configuration (32-core, 123GB RAM system)

PostgreSQL Memory Settings: - shared_buffers: 128MB (default) - effective_cache_size: 4GB - work_mem: 4MB (default) - maintenance_work_mem: 64MB

Parallelism Settings: - max_worker_processes: 8 - max_parallel_workers: 8 - max_parallel_workers_per_gather: 2

Current Usage: 44MB / 123GB (0.03%)

Research Findings

Key Insight: All sources confirm that indexing strategy is the highest-impact optimization. Apache AGE does NOT auto-create indexes, and graph performance depends heavily on proper indexing of node/edge properties.

Recommended Configuration for Production:

# Memory Settings (25-33% of 123GB RAM)
shared_buffers = 32GB              # Currently: 128MB
effective_cache_size = 75GB        # Currently: 4GB
work_mem = 256MB                   # Currently: 4MB (for graph traversals)
maintenance_work_mem = 2GB         # Currently: 64MB
huge_pages = on                    # CRITICAL for 32GB+ shared_buffers

# Parallelism (32-core system)
max_worker_processes = 32          # Currently: 8
max_parallel_workers = 32          # Currently: 8
max_parallel_workers_per_gather = 16  # Currently: 2

Expected Performance Impact: - Indexing improvements: 2-3x speedup for graph traversals (highest priority) - Memory tuning: 10-15% improvement for complex queries - Parallelism tuning: Diminishing returns beyond 12 workers (validates ADR-071a findings) - Huge pages: Can reduce CPU overhead from 51% to 15% (source: PostgreSQL community benchmarks)

Critical Missing Indexes

Research revealed that AGE requires explicit indexing:

-- Node property indexes (for MATCH filtering)
CREATE INDEX idx_concept_label ON ag_catalog.concept_vertex
  USING btree ((properties->>'label'));
CREATE INDEX idx_concept_id ON ag_catalog.concept_vertex
  USING btree ((properties->>'concept_id'));

-- Relationship indexes (for traversal)
CREATE INDEX idx_edge_start_end ON ag_catalog.concept_edge
  USING btree (start_id, end_id);
CREATE INDEX idx_edge_type ON ag_catalog.concept_edge
  USING btree ((properties->>'type'));

Additional Optimization Strategies

  1. Connection Pooling (PgBouncer)
  2. Industry standard for PostgreSQL production deployments
  3. Transaction pooling mode recommended
  4. Target: 5-10 active connections even with high concurrency

  5. Query Optimization

  6. Use PROFILE to analyze Cypher query execution plans
  7. Filter early in MATCH clauses (reduces intermediate results)
  8. Use explicit node labels (:Concept, :Source) - aligns with ADR-048
  9. Minimize OPTIONAL MATCH usage (generates large intermediate results)

  10. Prepared Statements

  11. Parse and optimize once, execute many times
  12. Particularly effective for ingestion pipelines
  13. Already supported via psycopg2 in age_client.py

Alignment with ADR-071a Performance Testing

Database tuning research confirms our empirical findings: - Parallelism shows diminishing returns beyond 10-12 workers - Application-level batching provides bigger gains than database parallelism - Proper query structure (early filtering, explicit labels) more important than raw resources

Conclusion: The 3x speedup from fixing query format validates that query optimization > resource tuning. Database configuration improvements would provide incremental 10-15% gains, not transformative performance changes.

Deferred Items

The following optimizations are documented but not implemented (await production deployment or larger datasets): - PostgreSQL memory configuration tuning - Huge pages enablement - Explicit graph indexing strategy - PgBouncer connection pooling - Query plan profiling and optimization

These should be prioritized when: - Graph exceeds 100K concepts (indexing becomes critical) - Multi-user production deployment (connection pooling needed) - Query performance degrades (profile and optimize)

Conclusion

The implementation of ADR-071 successfully achieved a 3.15x speedup (4:21 → 1:23), but the source of the speedup was different than designed.

Key Insight: The performance gain comes from fixing the query format to use proper batched queries, not from parallelization. The parallel infrastructure works correctly and provides value (timeout safety, scalability path), but 1-2 workers is optimal for current workloads.

This is a valuable lesson in "make it work, make it right, make it fast" - we made it work (fixed query), and it's already fast. The parallelization exploration was valuable for understanding performance characteristics and establishing that simpler (sequential) is better.

Final Configuration: 2 workers (chunk_size=20), delivering 3.07x speedup with 100% reliability and minimal complexity.