Skip to content

Fix graph pipeline performance regressions vs batch_pipeline#1786

Merged
jperez999 merged 7 commits intoNVIDIA:mainfrom
charlesbluca:graph-pipeline-pps-fixes
Apr 6, 2026
Merged

Fix graph pipeline performance regressions vs batch_pipeline#1786
jperez999 merged 7 commits intoNVIDIA:mainfrom
charlesbluca:graph-pipeline-pps-fixes

Conversation

@charlesbluca
Copy link
Copy Markdown
Collaborator

@charlesbluca charlesbluca commented Apr 2, 2026

Description

The graph_pipeline entrypoint introduced in #1778 had several performance regressions compared to the old batch_pipeline entrypoint it replaced. This PR restores parity.

Root causes fixed:

1. BatchTuningParams were silently discarded.
The batch_tuning field on EmbedParams/ExtractParams was explicitly excluded when building actor kwargs, but nothing translated it into Ray-level scheduling config (batch_size, concurrency, num_gpus per node). GraphIngestor was creating RayDataExecutor with batch_size=1 and no node_overrides regardless of CLI flags. A new batch_tuning_to_node_overrides() function in ingestor_runtime.py performs this translation, and GraphIngestor.ingest() now merges the result with any explicit node_overrides passed at construction.

2. No heuristic defaults when CLI flags are absent.
The old pipeline scaled actor counts and batch sizes from cluster GPU count via resolve_requested_plan(). The new pipeline had no equivalent, so it fell back to batch_size=1 and a single embed actor at num_gpus=0.1. batch_tuning_to_node_overrides() now accepts cluster_resources and uses resolve_requested_plan() as a fallback for any field not explicitly set — matching the heuristic behaviour of batch_pipeline.

3. PDF extract concurrency was not capped.
Without a CPU budget check, PDF extract actors could oversubscribe the cluster and cause downstream actors to deadlock waiting for CPU slots. The cap is now applied: pdf_extract_tasks = min(requested, max(1, (total_cpus - non_pdf_overhead) // cpus_per_task)), where overhead accounts for the 4 fixed pipeline tasks plus initial page-elements, OCR, and embed actors.

4. Ray's new progress UI was not enabled.
Suppressed the hint log line by opting in to DataContext.enable_rich_progress_bars = True / use_ray_tqdm = False in RayDataExecutor.

Observed improvement: ingestion time on jp20 corpus dropped from ~206s back to ~74s (2.8×), matching batch_pipeline throughput.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@charlesbluca charlesbluca changed the title graph pipeline pps fixes Fix graph pipeline performance regressions vs batch_pipeline Apr 2, 2026
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch on the perf drop here. I was actually seeing it go faster for bo767 on my machine... I will run again after this is in, should see some more perf gains.

@charlesbluca charlesbluca marked this pull request as ready for review April 3, 2026 16:32
@charlesbluca charlesbluca requested review from a team as code owners April 3, 2026 16:32
@charlesbluca charlesbluca requested a review from nkmcalli April 3, 2026 16:32
@jperez999 jperez999 merged commit f116fc5 into NVIDIA:main Apr 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants