Fix graph pipeline performance regressions vs batch_pipeline#1786
Merged
jperez999 merged 7 commits intoNVIDIA:mainfrom Apr 6, 2026
Merged
Fix graph pipeline performance regressions vs batch_pipeline#1786jperez999 merged 7 commits intoNVIDIA:mainfrom
jperez999 merged 7 commits intoNVIDIA:mainfrom
Conversation
4 tasks
jperez999
requested changes
Apr 3, 2026
Collaborator
jperez999
left a comment
There was a problem hiding this comment.
Great catch on the perf drop here. I was actually seeing it go faster for bo767 on my machine... I will run again after this is in, should see some more perf gains.
jperez999
approved these changes
Apr 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
graph_pipelineentrypoint introduced in #1778 had several performance regressions compared to the oldbatch_pipelineentrypoint it replaced. This PR restores parity.Root causes fixed:
1.
BatchTuningParamswere silently discarded.The
batch_tuningfield onEmbedParams/ExtractParamswas explicitly excluded when building actorkwargs, but nothing translated it into Ray-level scheduling config (batch_size,concurrency,num_gpusper node).GraphIngestorwas creatingRayDataExecutorwithbatch_size=1and nonode_overridesregardless of CLI flags. A newbatch_tuning_to_node_overrides()function iningestor_runtime.pyperforms this translation, andGraphIngestor.ingest()now merges the result with any explicitnode_overridespassed at construction.2. No heuristic defaults when CLI flags are absent.
The old pipeline scaled actor counts and batch sizes from cluster GPU count via
resolve_requested_plan(). The new pipeline had no equivalent, so it fell back tobatch_size=1and a single embed actor atnum_gpus=0.1.batch_tuning_to_node_overrides()now acceptscluster_resourcesand usesresolve_requested_plan()as a fallback for any field not explicitly set — matching the heuristic behaviour ofbatch_pipeline.3. PDF extract concurrency was not capped.
Without a CPU budget check, PDF extract actors could oversubscribe the cluster and cause downstream actors to deadlock waiting for CPU slots. The cap is now applied:
pdf_extract_tasks = min(requested, max(1, (total_cpus - non_pdf_overhead) // cpus_per_task)), where overhead accounts for the 4 fixed pipeline tasks plus initial page-elements, OCR, and embed actors.4. Ray's new progress UI was not enabled.
Suppressed the hint log line by opting in to
DataContext.enable_rich_progress_bars = True/use_ray_tqdm = FalseinRayDataExecutor.Observed improvement: ingestion time on jp20 corpus dropped from ~206s back to ~74s (2.8×), matching
batch_pipelinethroughput.Checklist