ADR-071a: Parallel Graph Query Implementation Findings
Status: Accepted Date: 2025-12-04 Parent: ADR-071: Parallel Graph Query Optimization
Context
This subletter documents the actual implementation of ADR-071's parallel graph query system and the critical performance discoveries made during testing.
Implementation Environment: - System: 32-core machine - Database: PostgreSQL + Apache AGE - Test workload: Polarity axis analysis (Modern Operating Model ↔ Traditional Operating Model) - Dataset: 40 1-hop neighbors, ~75 2-hop neighbors - Baseline: 4:21 (261 seconds) for max_hops=2 query
Implementation
Critical Bug Fix
The initial implementation failed with "graph 'concept' does not exist" errors. The root cause:
Broken Code (manual query wrapping):
query = f"""
SELECT * FROM ag_catalog.cypher('concept', $$
MATCH (seed:Concept)-[]-(neighbor:Concept)
WHERE seed.concept_id IN $seed_ids
RETURN DISTINCT neighbor.concept_id as concept_id
$$) as (concept_id agtype);
"""
Problem: Manually wrapping queries with ag_catalog.cypher('concept', ...) used wrong graph name.
Fix (use AGEClient._execute_cypher):
# Plain Cypher query - AGEClient._execute_cypher() handles wrapping
query = f"""
MATCH (seed:Concept)-[]-(neighbor:Concept)
WHERE seed.concept_id IN $seed_ids
RETURN DISTINCT neighbor.concept_id as concept_id
LIMIT {self.config.per_worker_limit}
"""
results = self.client._execute_cypher(query, params={'seed_ids': seed_ids})
Impact: This single fix enabled the 3x speedup. The parallelization infrastructure works correctly, but the speedup primarily comes from batched query optimization, not from parallelization itself.
Configuration Changes
AGEClient Connection Pool:
# Increased from 10 to support parallel workers
self.pool = psycopg2.pool.SimpleConnectionPool(
1, # minconn
20, # maxconn (supports up to 16 parallel workers + buffer)
...
)
GraphParallelizer Configuration:
config = ParallelQueryConfig(
max_workers=8, # ThreadPoolExecutor size
chunk_size=100, # Optimal: 1-2 workers (see findings)
timeout_seconds=120.0,
per_worker_limit=max_candidates * 2
)
Performance Testing Results
Complete Performance Curve
Tested configurations from 1 to 8 workers on clean system (no CPU contention):
| Workers | Chunk Size | Phase 1 | Phase 2 | Total | Success | Speedup | Overhead |
|---|---|---|---|---|---|---|---|
| Baseline | N/A | - | ~180s | 4:21 (261s) | - | 1.0x | - |
| 1 worker | 100 | 41.5s | 40.8s | 1:23 (83s) | 100% | 3.15x | 0s ✅ |
| 2 workers | 20 | 41.4s | 43.3s | 1:25 (85s) | 100% | 3.07x | +2s |
| 4 workers | 10 | 41.7s | 80.5s | 2:02 (122s) | 100% | 2.14x | +39s |
| 8 workers | 5 | 41.8s | 162s | 3:24 (204s) | 50% | 1.28x | +79s + timeouts |
Phase Breakdown: - Phase 1 (1-hop): ~41-42s (consistent across all configs, single batched query) - Phase 2 (2-hop): 41s (1 worker) → 43s (2 workers) → 80s (4 workers) → 162s (8 workers)
Key Findings
🎯 Discovery #1: Speedup source is batched queries, NOT parallelization
The 3x speedup comes from:
1. ✅ Fixing the query format to use _execute_cypher() correctly
2. ✅ Batched IN clause queries instead of individual queries per concept
3. ❌ NOT from parallelization - parallelization adds overhead
Evidence: - 1 worker (sequential): 83 seconds - 2 workers (parallel): 85 seconds (+2s) - Difference: Only 2 seconds faster with parallelization
🎯 Discovery #2: 1 worker is optimal for this workload
- 1 worker: 83s, zero overhead, simplest implementation
- 2 workers: 85s, +2s overhead (negligible)
- 4 workers: 122s, +39s overhead (significant)
- 8 workers: 204s, +79s overhead + timeouts (catastrophic)
Why more workers hurt performance: 1. Per-worker overhead - Thread spawning, coordination, context switching 2. Fixed query costs - Connection acquisition, query planning, setup 3. Database contention - Multiple concurrent queries compete for locks 4. Diminishing returns - Amdahl's Law in action
🎯 Discovery #3: Worker overhead scales badly
Phase 2 overhead vs number of workers: - 1 worker: 41s (baseline) - 2 workers: 43s (+2s, 5% overhead) - 4 workers: 80s (+39s, 95% overhead) - 8 workers: 162s (+121s, 295% overhead + timeouts)
🎯 Discovery #4: Timeout mechanism validates correctly
With 8 workers and CPU contention: - 4/8 chunks completed successfully - 4/8 chunks timed out after exactly 120s - System gracefully returned partial results - No hangs, no crashes, no data corruption
Decision
Optimal Configuration: 1-2 workers
After comprehensive testing, we recommend:
Option A: 1 worker (simplest, fastest)
config = ParallelQueryConfig(
max_workers=8,
chunk_size=100, # Forces 1 worker, sequential Phase 2
timeout_seconds=120.0,
per_worker_limit=max_candidates * 2
)
Option B: 2 workers (minimal overhead, nice symmetry)
config = ParallelQueryConfig(
max_workers=8,
chunk_size=20, # Creates 2 workers for typical 40-concept datasets
timeout_seconds=120.0,
per_worker_limit=max_candidates * 2
)
Rationale: - Both deliver ~3x speedup over broken baseline - 1 worker eliminates all parallelization complexity - 2 workers adds only 2s overhead, validates parallel infrastructure - 4+ workers add significant overhead without proportional benefit - Simpler = fewer failure modes, easier debugging, better maintainability
Consequences
Positive
- 3x speedup achieved - 4:21 → 1:23 (261s → 83s)
- Simple implementation - 1 worker eliminates parallelization complexity
- Root cause identified - Query format bug, not algorithm design
- Robust infrastructure - Parallel system works, validated with timeout testing
- Scalability path - Can increase workers for larger datasets if needed
- Performance baseline - Comprehensive testing establishes performance curve
Negative
- Parallelization infrastructure underutilized - Built for 8+ workers, optimal is 1-2
- Design assumptions wrong - ADR-071 predicted 160x speedup from parallelization, actual speedup is from batched queries
- Overhead higher than expected - Worker overhead grows faster than linear
Neutral
- Configuration tunability - chunk_size controls worker count, but optimal is fixed
- Connection pool increase - Increased from 10 → 20, but only need 2-3
- Timeout mechanism - Validated and working, but not needed at optimal config
Alternatives Considered
Alternative 1: Remove parallelization entirely
Approach: Delete GraphParallelizer, just use batched queries
Pros: - Simplest possible code - No thread management, no semaphores, no timeouts - Identical performance to 1-worker config
Cons: - Loses scalability path for larger datasets - Loses timeout safety mechanism - Wastes implementation effort
Decision: Keep parallelizer with optimal 1-2 worker config. Infrastructure is valuable for: - Future large-scale queries (100+ concepts) - Timeout safety mechanism - Demonstrates due diligence in optimization
Alternative 2: Adaptive worker count
Approach: Dynamically choose worker count based on dataset size
# Adaptive configuration
def get_optimal_workers(num_concepts):
if num_concepts < 50:
return 1 # Sequential is fastest
elif num_concepts < 100:
return 2 # Minimal overhead
else:
return 4 # More parallelism for large datasets
Pros: - Optimal performance across different dataset sizes - Handles both small and large queries
Cons: - Increased complexity - Untested for large datasets - Premature optimization
Decision: Defer until we have real-world large dataset use cases. Current config is simple and optimal for observed workloads.
Alternative 3: Database-level parallelization
Approach: Investigate PostgreSQL parallel query execution instead of application-level parallelization
Pros: - Database engine handles parallelization - No application-level thread management - May be more efficient
Cons: - AGE Cypher queries are function calls, not parallelizable by PostgreSQL - Would require changes to AGE extension itself - Out of scope for application-level optimization
Decision: Not feasible with current AGE architecture.
Implementation Notes
Files Modified
- api/api/lib/graph_parallelizer.py (409 lines)
- Fixed query format to use
_execute_cypher() - Removed manual
ag_catalog.cypher()wrapping -
Removed unnecessary
jsonimport -
api/api/lib/polarity_axis.py
- Added
discover_candidate_concepts_parallel()function - Updated
analyze_polarity_axis()withuse_parallelparameter -
Configured ParallelQueryConfig with optimal settings
-
api/api/models/queries.py
-
Added
use_parallel: boolfield toPolarityAxisRequest -
api/api/routes/queries.py
-
Passed
use_parallelparameter to analysis function -
api/api/lib/age_client.py
-
Increased connection pool:
maxconn=10→maxconn=20 -
scripts/development/manual-tests/test_parallel_performance.py
- Performance testing script for future benchmarking
Testing Performed
Test Environment: - Clean system (no competing workload) - 32-core machine - PostgreSQL + AGE running in Docker - Test poles: Modern Operating Model ↔ Traditional Operating Model
Configurations Tested: - ✅ 1 worker (chunk_size=100): 1:23, 100% success - ✅ 2 workers (chunk_size=20): 1:25, 100% success - ✅ 4 workers (chunk_size=10): 2:02, 100% success - ✅ 8 workers (chunk_size=5): 3:24, 50% success (timeouts)
Stress Testing: - CPU contention (kernel compilation running): Validated graceful degradation - Timeout mechanism: Validated with 8-worker config - Connection pool: No exhaustion observed with 20-connection pool
Related ADRs
- ADR-071: Parallel Graph Query Optimization (parent) - Original design document
- ADR-048: GraphQueryFacade - Namespace-safe query interface (not used, went with _execute_cypher)
- ADR-043: Resource Management - VRAM/CPU contention handling (informed testing approach)
Lessons Learned
- Profile before parallelizing - The bottleneck was broken query format, not sequential execution
- Measure, don't assume - Design predicted 160x from parallelization, actual gain is from batching
- Simple is fast - 1 worker (sequential) beats complex parallelization
- Amdahl's Law is real - Overhead grows faster than parallelization benefit
- Infrastructure has value - Even if underutilized, timeout safety and scalability path justify keeping GraphParallelizer
Recommendations
For Production
- Use 2-worker configuration for balance of:
- Near-optimal performance (only 2s slower than 1 worker)
- Validates parallel infrastructure is working
-
Provides scalability path for larger datasets
-
Monitor Phase 2 timing to detect performance regressions
-
Consider 1-worker config if simplicity > 2s performance difference
For Future Work
- Test with larger datasets (100+ concepts) to validate if more workers help at scale
- Profile database-level bottlenecks to understand why larger chunks are faster
- Investigate query plan caching to reduce per-query overhead
- Consider adaptive worker count based on dataset size (if large datasets become common)
Database Tuning Research
After establishing the optimal application-level configuration (2 workers), we researched PostgreSQL and Apache AGE database-level optimizations to understand potential incremental improvements.
Current Configuration (32-core, 123GB RAM system)
PostgreSQL Memory Settings:
- shared_buffers: 128MB (default)
- effective_cache_size: 4GB
- work_mem: 4MB (default)
- maintenance_work_mem: 64MB
Parallelism Settings:
- max_worker_processes: 8
- max_parallel_workers: 8
- max_parallel_workers_per_gather: 2
Current Usage: 44MB / 123GB (0.03%)
Research Findings
Key Insight: All sources confirm that indexing strategy is the highest-impact optimization. Apache AGE does NOT auto-create indexes, and graph performance depends heavily on proper indexing of node/edge properties.
Recommended Configuration for Production:
# Memory Settings (25-33% of 123GB RAM)
shared_buffers = 32GB # Currently: 128MB
effective_cache_size = 75GB # Currently: 4GB
work_mem = 256MB # Currently: 4MB (for graph traversals)
maintenance_work_mem = 2GB # Currently: 64MB
huge_pages = on # CRITICAL for 32GB+ shared_buffers
# Parallelism (32-core system)
max_worker_processes = 32 # Currently: 8
max_parallel_workers = 32 # Currently: 8
max_parallel_workers_per_gather = 16 # Currently: 2
Expected Performance Impact: - Indexing improvements: 2-3x speedup for graph traversals (highest priority) - Memory tuning: 10-15% improvement for complex queries - Parallelism tuning: Diminishing returns beyond 12 workers (validates ADR-071a findings) - Huge pages: Can reduce CPU overhead from 51% to 15% (source: PostgreSQL community benchmarks)
Critical Missing Indexes
Research revealed that AGE requires explicit indexing:
-- Node property indexes (for MATCH filtering)
CREATE INDEX idx_concept_label ON ag_catalog.concept_vertex
USING btree ((properties->>'label'));
CREATE INDEX idx_concept_id ON ag_catalog.concept_vertex
USING btree ((properties->>'concept_id'));
-- Relationship indexes (for traversal)
CREATE INDEX idx_edge_start_end ON ag_catalog.concept_edge
USING btree (start_id, end_id);
CREATE INDEX idx_edge_type ON ag_catalog.concept_edge
USING btree ((properties->>'type'));
Additional Optimization Strategies
- Connection Pooling (PgBouncer)
- Industry standard for PostgreSQL production deployments
- Transaction pooling mode recommended
-
Target: 5-10 active connections even with high concurrency
-
Query Optimization
- Use
PROFILEto analyze Cypher query execution plans - Filter early in MATCH clauses (reduces intermediate results)
- Use explicit node labels (
:Concept,:Source) - aligns with ADR-048 -
Minimize
OPTIONAL MATCHusage (generates large intermediate results) -
Prepared Statements
- Parse and optimize once, execute many times
- Particularly effective for ingestion pipelines
- Already supported via psycopg2 in
age_client.py
Alignment with ADR-071a Performance Testing
Database tuning research confirms our empirical findings: - Parallelism shows diminishing returns beyond 10-12 workers - Application-level batching provides bigger gains than database parallelism - Proper query structure (early filtering, explicit labels) more important than raw resources
Conclusion: The 3x speedup from fixing query format validates that query optimization > resource tuning. Database configuration improvements would provide incremental 10-15% gains, not transformative performance changes.
Deferred Items
The following optimizations are documented but not implemented (await production deployment or larger datasets): - PostgreSQL memory configuration tuning - Huge pages enablement - Explicit graph indexing strategy - PgBouncer connection pooling - Query plan profiling and optimization
These should be prioritized when: - Graph exceeds 100K concepts (indexing becomes critical) - Multi-user production deployment (connection pooling needed) - Query performance degrades (profile and optimize)
Conclusion
The implementation of ADR-071 successfully achieved a 3.15x speedup (4:21 → 1:23), but the source of the speedup was different than designed.
Key Insight: The performance gain comes from fixing the query format to use proper batched queries, not from parallelization. The parallel infrastructure works correctly and provides value (timeout safety, scalability path), but 1-2 workers is optimal for current workloads.
This is a valuable lesson in "make it work, make it right, make it fast" - we made it work (fixed query), and it's already fast. The parallelization exploration was valuable for understanding performance characteristics and establishing that simpler (sequential) is better.
Final Configuration: 2 workers (chunk_size=20), delivering 3.07x speedup with 100% reliability and minimal complexity.