Skip to content

Comments

remove unnecessary parallelize usage in sepal#1120

Open
selmanozleyen wants to merge 10 commits intoscverse:mainfrom
selmanozleyen:feat/remove-parallelize-minimal
Open

remove unnecessary parallelize usage in sepal#1120
selmanozleyen wants to merge 10 commits intoscverse:mainfrom
selmanozleyen:feat/remove-parallelize-minimal

Conversation

@selmanozleyen
Copy link
Member

@selmanozleyen selmanozleyen commented Feb 20, 2026

hi,

from the rsc sepal implementation I know we can get rid of parallelize easily. I also have a longer term plan to remove parallelize. I checked with this script: https://gist.github.com/selmanozleyen/c4b0e4780243fd4621f68bc2d78cae5c

Results pretified with AI:

Benchmark results (502 genes, 2688 cells, visium_hne_adata, 3 runs, 8 cores)

variant mean median min max
main, n_jobs=1 (sequential) 134.9s 134.4s 134.3s 136.0s
main, n_jobs=6 (joblib/loky) 40.5s 40.2s 37.9s 43.5s
this PR, prange (no pre-alloc) 37.7s 37.6s 36.9s 38.6s
this PR, prange + pre-alloc buffers 31.9s 31.3s 30.9s 33.6s

Motivation

This PR replaces the internal parallelize + joblib.Parallel machinery in
sepal() with Numba's native prange threading, removing the need for
multi-process orchestration (subprocess spawning, pickling, IPC).

Key changes

  1. prange over genes -- The outer gene loop is parallelized with
    numba.prange inside a @njit(parallel=True) function, replacing
    joblib.Parallel(n_jobs=..., backend="loky").

  2. Pre-allocated per-thread workspace buffers (this commit e011ddc) You can see why I did this in https://numba.readthedocs.io/en/stable/user/parallel.html#diagnostics under "allocation hoisting"

@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 32.35294% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.16%. Comparing base (6dc2a91) to head (357966a).

Files with missing lines Patch % Lines
src/squidpy/gr/_sepal.py 32.35% 23 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1120      +/-   ##
==========================================
- Coverage   66.36%   66.16%   -0.21%     
==========================================
  Files          44       44              
  Lines        7132     7137       +5     
  Branches     1212     1209       -3     
==========================================
- Hits         4733     4722      -11     
- Misses       1923     1941      +18     
+ Partials      476      474       -2     
Files with missing lines Coverage Δ
src/squidpy/gr/_sepal.py 45.07% <32.35%> (-9.68%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, very straightforward.

The only potential issue I see is that you densify vals as one, whereas the other version densified at most n_jobs columns of vals at a time.

Are there use cases for this where that matters?

sat_shape = sat.shape[0]
n_threads = get_num_threads()

# Pre-allocate per-thread workspace to avoid allocator contention
Copy link
Member

@flying-sheep flying-sheep Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there isn’t a way to tell numba to do the pre-allocation instead of doing it manually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering the same but I couldn't find a way to do it without calling get_num_threads()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Intron7 do you know anything about this?

@timtreis
Copy link
Member

I also have a longer term plan to remove parallelize.

Is that plan fully based on prange? How would you f.e. replace a case like https://github.com/scverse/squidpy/pull/982/changes#diff-bd3d3c041f3b69cca0f8b6ece0e25425eaf6329c636038e53062b1fcf1108285R342 ?

@selmanozleyen
Copy link
Member Author

Is that plan fully based on prange? How would you f.e. replace a case like https://github.com/scverse/squidpy/pull/982/changes#diff-bd3d3c041f3b69cca0f8b6ece0e25425eaf6329c636038e53062b1fcf1108285R342 ?

I would need extra time to give review on this but I am confident we can find a way that uses the cores without having multi-processes and serialization overhead.

@selmanozleyen
Copy link
Member Author

The only potential issue I see is that you densify vals as one, whereas the other version densified at most n_jobs columns of vals at a time.

Are there use cases for this where that matters?

Matters for potential OOM's. It's doable but densifying up-front is better for speed and I don't think we will have inputs that will cause memory issues since sepal works only with spot based datasets. Is that correct @timtreis ?

@timtreis
Copy link
Member

IIRC sepal works with any situation in which the obs are arranged in a regular grid or pattern. That'd include Visium with about 4k spots, yes, but also Visium HD which depending on which bin you choose could have more than a million obs. Not sure that'd actually be useful, but could be a potential input.

@selmanozleyen
Copy link
Member Author

ok I added a sparse_batch_size which is 128. Given that 100m cells per gene is around 400mb I think this is a reasonable default.

Here are the results


Running sepal() 3 time(s)...
  run 1: 45.700s
  run 2: 44.678s
  run 3: 43.301s

Results (3 runs):
  mean:   44.560s
  median: 44.678s
  min:    43.301s
  max:    45.700s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants