Implement njit `map_rows` by Graciaaa3 · Pull Request #451 · lincc-frameworks/nested-pandas

Graciaaa3 · 2026-02-06T23:10:35Z

Change Description

Adding njit version of map_rows (originally reduce) to achieve potential speedup for the row-wise operation. This has a factored out module of njit_funcs.py that contains njit-decorated helper methods.

Adding a documentation notebook for njit map_rows usage with examples and benchmark results.

Closes #277

Solution Description

Added an njit-compiled execution path for map_rows.
Support for user-defined functions with one or two arguments.
Explicit preprocessing of base and nested columns to satisfy Numba’s static typing requirements.
Keeping the existing python map_rows execution as the default implementation if njit doesn't support.

Code Quality

I have read the Contribution Guide and agree to the Code of Conduct
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

New Feature Checklist

I have added or updated the docstrings associated with my feature using the NumPy docstring format
I have updated the tutorial to highlight my new feature (if appropriate)
I have added unit/End-to-End (E2E) test cases to cover my new feature
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

I will add the tests to cover added logic and update the documentation for map_rows afterward.

review-notebook-app · 2026-02-06T23:10:40Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2026-02-06T23:19:31Z

Before [`39a4681`]	After [`3be5eb5`]	Ratio	Benchmark (Parameter)
256M	failed	n/a	benchmarks.AssignSingleDfToNestedSeries.peakmem_run
29.8±2ms	failed	n/a	benchmarks.AssignSingleDfToNestedSeries.time_run
139M	failed	n/a	benchmarks.CountNestedBy.peakmem_run
66.5±0.9ms	failed	n/a	benchmarks.CountNestedBy.time_run
105M	failed	n/a	benchmarks.NestedFrameAddNested.peakmem_run
11.0±0.3ms	failed	n/a	benchmarks.NestedFrameAddNested.time_run
110M	failed	n/a	benchmarks.NestedFrameQuery.peakmem_run
10.7±0.04ms	failed	n/a	benchmarks.NestedFrameQuery.time_run
109M	failed	n/a	benchmarks.NestedFrameReduce.peakmem_run
1.26±0.02ms	failed	n/a	benchmarks.NestedFrameReduce.time_run

Click here to view all benchmarks.

hombit · 2026-02-09T18:34:11Z

Hey, @Graciaaa3, is it actually ready for review? I'm a little bit confused because the PR is marked as a draft, but I'm selected as a reviewer.

Graciaaa3 · 2026-02-10T19:00:10Z

Hey, @Graciaaa3, is it actually ready for review? I'm a little bit confused because the PR is marked as a draft, but I'm selected as a reviewer.

Hi @hombit, sorry for the confusion, it's ready for review! I put in a draft PR since I haven't finish writing the tests, but the code part and the tutorial notebook are finished.

hombit

Very deep and useful work, thank you!

I'm not 100% sure how the numba compilation pipeline works, but I guess that our @njit functions would be recompiled for every new user's func. For the same reason, I guess it is not worth asking the user to pre-jit-compile their function; we should just try to do it ourselves, maybe conditionally on some keyword argument, e.g. engine=None, like in pandas' apply method. Other reasons to make it conditional on a new argument: 1) your benchmarks show that speed-up is not guaranteed, so the choice should be more explicit, 2) if the user wants to use a numba-compiled func, but not a numba-compiled map_rows, for example for an arbitrary number of columns, we absolutely should allow it, and the warning would be confusing.

I'm wondering how hard it would be to generalize to an arbitrary number of columns using a bit of Python metaprogramming magic?

Thank you for making the notebook and the plots, they are great and very informative! One cosmetic issue: could you please set ylim to [0; None], so 0 is always visible? I would also like to see how much time jit-compilation takes, because it is not clear if it is actually worth it for single-frame use, and what guidelines we should give LSDB users about using it.

Tests would be appreciated.

src/nested_pandas/nestedframe/njit_funcs.py

hombit · 2026-02-10T20:32:52Z

src/nested_pandas/nestedframe/core.py

-                    iterators.append(self[col])
-                else:
-                    iterators.append(self[layer].array.iter_field_lists(col))
+            # check if func is jitted by numba and fits the criteria


There is a lot of code here, maybe worth refactoring to a new helper function/method?

dougbrn

Thank you for all your work on this! Just have an initial comment here, and then I'll look through the tutorial notebook separately

dougbrn · 2026-02-11T21:27:59Z

src/nested_pandas/nestedframe/core.py

 from nested_pandas.series.nestedseries import NestedSeries
 from nested_pandas.series.packer import pack, pack_lists, pack_sorted_df_into_struct
+from nested_pandas.series.utils import is_pa_type_a_list
+from numba.core.registry import CPUDispatcher


You'll need to add numba as a dependency to pyproject.toml, and that should hopefully allow the docs to build.

I'm a little conflicted about adding it as a core dependency, but I think it's the right choice for now.

I'm agree about keeping numba optional. It always causes a lot of weird dependency problems with CPython/numpy upgrades.

dougbrn · 2026-02-11T21:51:48Z

Please see this for notebook comments: https://app.reviewnb.com/lincc-frameworks/nested-pandas/blob/map_rows-njit/docs%2Fpre_executed%2Fnjit_map_rows.ipynb%2F/discussion

I think the notebook is looking really good!

Graciaaa3 · 2026-02-14T00:14:22Z

Very deep and useful work, thank you!

I'm not 100% sure how the numba compilation pipeline works, but I guess that our @njit functions would be recompiled for every new user's func. For the same reason, I guess it is not worth asking the user to pre-jit-compile their function; we should just try to do it ourselves, maybe conditionally on some keyword argument, e.g. engine=None, like in pandas' apply method. Other reasons to make it conditional on a new argument: 1) your benchmarks show that speed-up is not guaranteed, so the choice should be more explicit, 2) if the user wants to use a numba-compiled func, but not a numba-compiled map_rows, for example for an arbitrary number of columns, we absolutely should allow it, and the warning would be confusing.

I'm wondering how hard it would be to generalize to an arbitrary number of columns using a bit of Python metaprogramming magic?

Thank you for making the notebook and the plots, they are great and very informative! One cosmetic issue: could you please set ylim to [0; None], so 0 is always visible? I would also like to see how much time jit-compilation takes, because it is not clear if it is actually worth it for single-frame use, and what guidelines we should give LSDB users about using it.

Tests would be appreciated.

Thanks for the comment! I agree we should allow optional njit if the user doesn't wants to use it. I looked into pandas' apply and engine kwarg that is of type decorator or {‘python’, ‘numba’}. I saw that "String parameters will stop being supported in a future pandas version" in the warning section. What would be our take on this?

And would we want to support a similar feature of compiling python custom func with njit inside map_rows? If so, I think maybe it would be better for me to finish up what I have currently and add in the feature after?

Also, for the time it takes for jit-compiling custom function and/or map_rows logic, which one, or both, would be more useful to see? To make sure I'm understanding it correctly, we are looking for the comparison between just one call to map_rows in njit (with method and/or custom func not pre-compiled) with python execution?

hombit · 2026-02-15T21:58:46Z

I saw that "String parameters will stop being supported in a future pandas version" in the warning section. What would be our take on this?

I don't have a strong opinion. I'm OK with supporting callables only, e.g. engine=None/engine=numba.njit, etc.

And would we want to support a similar feature of compiling python custom func with njit inside map_rows? If so, I think maybe it would be better for me to finish up what I have currently and add in the feature after?

No, but I also believe that any func passed as an argument to another function would be (re)compiled anyway. That's why I see no reason to ask the user to wrap func into njit(). We can easily do it for the user. See some benchmark experiments below.

Also, for the time it takes for jit-compiling custom function and/or map_rows logic, which one, or both, would be more useful to see? To make sure I'm understanding it correctly, we are looking for the comparison between just one call to map_rows in njit (with method and/or custom func not pre-compiled) with python execution?

Yes, I think even a single compilation + single jitted map_rows evaluation to pure Python map_rows evaluation ratio would be useful. Again, it is interesting to see these numbers because njitted map_rows would be recompiled for every new func it sees.

Justification that function argument causes recompilation of an outer function

# Feel free to set NUMBA_DEBUG_JIT=1 to see compilation logs
from numba import njit

def inner1(x, y):
    return x**0.2 + y**0.3

inner1_compiled = njit(inner1)
inner1_compiled(1.0, 1.0)

def inner2(x, y):
    return x**0.3 - y**0.2

inner2_compiled = njit(inner2)
inner2_compiled(1.0, 2.0)

def outer(f, x, y):
    return f(x, y) + x**y

outer_compiled = njit(outer)
outer_compiled(inner1_compiled, 1.0, 1.0)

## Check if recompilation is caused for any new input function
## Using the same function argument is fast, no recompilation I guess
%time outer_compiled(inner1_compiled, 2.0, 2.0)
# Wall time: 19.8 μs
## Using a different function is slow: note millisecond vs microseconds
%time outer_compiled(inner2_compiled, 2.0, 2.0)
# Wall time: 16.4 ms

Graciaaa3 added 3 commits January 20, 2026 11:26

wip: adding njit logic for map_rows

a9879e5

factoring out njit module

7cf49e7

adding njit map_rows notebook

f37ff81

Graciaaa3 requested review from dougbrn, gitosaurus and hombit February 6, 2026 23:10

Merge branch 'main' into map_rows-njit

763307a

hombit reviewed Feb 10, 2026

View reviewed changes

adding requirements and import

295ba78

dougbrn reviewed Feb 11, 2026

View reviewed changes

adding numpy docstyle comment for jit helper

9f5954b

Graciaaa3 force-pushed the map_rows-njit branch from e951a59 to 9f5954b Compare February 17, 2026 20:20

Graciaaa3 added 2 commits February 17, 2026 13:04

adding tests

adbd539

factoring out njit logic and adding function compilation logic

860e793

Comments

Conversation

Graciaaa3 commented Feb 6, 2026

Change Description

Solution Description

Code Quality

Project-Specific Pull Request Checklists

New Feature Checklist

Uh oh!

review-notebook-app bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hombit commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Graciaaa3 commented Feb 10, 2026

Uh oh!

hombit left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hombit Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

dougbrn left a comment

Choose a reason for hiding this comment

Uh oh!

dougbrn Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

hombit Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

dougbrn commented Feb 11, 2026

Uh oh!

Graciaaa3 commented Feb 14, 2026

Uh oh!

hombit commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 6, 2026 •

edited

Loading

hombit commented Feb 9, 2026 •

edited

Loading

hombit left a comment •

edited

Loading

hombit commented Feb 15, 2026 •

edited

Loading