Skip to content

Conversation

@koperagen
Copy link
Collaborator

@koperagen koperagen commented Dec 12, 2025

I suggest to look at this PR by individual commits to see how code evolved
First iteration is simple and managed to achieve sub 1s speed for 1 m rows, which is acceptable
Later i introduced wrapper that exploits the fact that Kotlin Notebook takes small part of sorted dataframe, so after sorting we can avoid materializing the entire thing
And, since we can only sort 1 column at a time now, added special case of sorting just 1 column. Comparator can be a bit faster. + It was easy to apply Arrays.parallelSort to that, which all combined gave us another magnitude of speedup
So now sorting of 1 m rows is within ~30-50 ms on my PC, depends on how much core you have.

Bonus: it's now possible to sort columns with DataFrame and List values by their size

Intermediate results after removing reflection from sorting loop:

Type Size eager
int 1000000 66.746
category 1000000 130.657
double 1000000 519.548
string 1000000 366.065
list 1000000 252.999
frame 1000000 246.892

Final results with lazy materialization + improved comparators for optimal bytecode where possible + Arrays.parallelSort:

Type Size optimized
int 1000000 3.562
category 1000000 7.268
double 1000000 30.330
string 1000000 42.588
list 1000000 32.648
frame 1000000 42.252

fixes #1436

@koperagen koperagen added this to the 1.0.0-Beta5 milestone Dec 12, 2025
@koperagen koperagen self-assigned this Dec 12, 2025
@koperagen koperagen added the performance Something related to how fast the library can handle data label Dec 12, 2025
@koperagen koperagen force-pushed the comparable-containers branch from fdbca04 to cf3a7dc Compare December 12, 2025 19:29
@koperagen koperagen force-pushed the comparable-containers branch from cf3a7dc to 7795911 Compare December 15, 2025 12:14
/**
* Sorts a dataframe-like object by multiple columns.
* If a column type is not comparable, sorting by string representation is applied instead.
* Sorts DataFrames by their size because looking at the smallest / biggest groups after groupBy is very popular.
Copy link
Collaborator

@Jolanrensen Jolanrensen Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and lists

testImplementation(projects.dataframeCsv)
}

benchmark {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see first benchmark in our project!

val finalComparator = if (isDesc[0]) comparator.reversed() else comparator

val permutation = Array(column.size()) { it }
Arrays.parallelSort(permutation, finalComparator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot is it thread-safe to use parallesort here?

Copy link

Copilot AI commented Dec 17, 2025

@zaleslaw I've opened a new pull request, #1646, to work on those changes. Once the pull request is ready, I'll request review from you.

@zaleslaw
Copy link
Collaborator

zaleslaw commented Dec 17, 2025

https://stackoverflow.com/questions/36128533/why-is-the-minimum-granularity-defined-as-8192-in-java8-in-order-to-switch-from

val threshold = 8192 // Arrays.parallelSort uses this internally
if (permutation.size >= threshold) {
    Arrays.parallelSort(permutation, finalComparator)
} else {
    Arrays.sort(permutation, finalComparator)
}

ForkJoinPool under the hood uses the common thread pool and also, thinking am I - should it live inside Kotlin kernel thread (I have no idea about how it's organized in Kotlin Notebook), where exactly the child threads will be created

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements significant performance optimizations for sorting DataFrames in Kotlin Notebook environments, achieving a ~10-100x speedup (from 250-500ms to 30-50ms for 1 million rows). The optimization combines three key techniques: (1) removing reflection from the sorting loop, (2) implementing lazy materialization via SortedDataFrameView to avoid materializing entire sorted DataFrames when only a subset is needed for rendering, and (3) using Arrays.parallelSort for single-column sorts.

Key Changes:

  • Introduced SortedDataFrameView wrapper that defers row materialization until accessed
  • Optimized single-column sorting path using Arrays.parallelSort with index-based comparators
  • Added support for sorting DataFrame and List columns by their size (row count/length)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/KotlinNotebookPluginUtils.kt Core implementation: adds SortedDataFrameView for lazy materialization, refactors sorting logic with optimized single-column path, adds comparators for DataFrame/List types by size
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/KotlinNotebookPluginUtilsTests.kt New test file covering sorting of lists and DataFrames by size with various edge cases
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/SortingBenchmark.kt New benchmark suite measuring sorting performance across different column types and sizes
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/DataFrameGet.kt Simplifies getRows to delegate to get methods for consistency
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/DataFrame.kt Moves get(IntRange) and get(Iterable<Int>) implementations from getRows to interface default methods
core/build.gradle.kts Adds kotlinx.benchmark plugin and configuration for performance testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +37 to +44
fun `sort empty lists`() {
val lists = listOf(listOf(1, 2), emptyList(), listOf(1), emptyList())
val df = dataFrameOf("listColumn" to lists)

val res = KotlinNotebookPluginUtils.sortByColumns(df, listOf(listOf("listColumn")), desc = listOf(true))

res["listColumn"].values() shouldBe listOf(listOf(1, 2), listOf(1), emptyList(), emptyList())
}
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies that the result is an empty list twice, but doesn't test preservation of order (stability) for empty lists. Since the test is named sort empty lists, it would be helpful to verify that when sorting descending, the two empty lists maintain their original relative order (stability), similar to the test on line 47-54. Consider modifying the test to use tagged empty lists (e.g., different container objects) to verify stability.

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +120
// Not sure how to have generic logic that would produce Comparator<Int> and Comparator<DataRow> without overhead
// For now Comparator<DataRow> is needed for fallback case of sorting multiple columns. Although it's now impossible in UI
// Please make sure to change both this and createColumnComparator
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 119 states "Although it's now impossible in UI", but this doesn't clarify what "this" refers to or why it's impossible. It appears to refer to multi-column sorting being disabled in the UI, but this should be stated more explicitly. Consider clarifying the comment to explain: "Although multi-column sorting is currently disabled in the Kotlin Notebook Plugin UI, this fallback case is retained for API compatibility and potential future re-enablement."

Suggested change
// Not sure how to have generic logic that would produce Comparator<Int> and Comparator<DataRow> without overhead
// For now Comparator<DataRow> is needed for fallback case of sorting multiple columns. Although it's now impossible in UI
// Please make sure to change both this and createColumnComparator
// Not sure how to have generic logic that would produce Comparator<Int> and Comparator<DataRow> without overhead.
// For now Comparator<DataRow> is needed as a fallback for multi-column sorting. Although multi-column sorting
// is currently disabled in the Kotlin Notebook Plugin UI, this fallback is retained for API compatibility and
// potential future re-enablement. Please make sure to change both this and createColumnComparator.

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +24
fun `sort lists by size descending`() {
val random = Random(123)
val lists = List(20) { List(random.nextInt(1, 100)) { it } } + null
val df = dataFrameOf("listColumn" to lists)

val res = KotlinNotebookPluginUtils.sortByColumns(df, listOf(listOf("listColumn")), desc = listOf(true))

res["listColumn"].values() shouldBe lists.sortedByDescending { it?.size ?: 0 }
}
Copy link

Copilot AI Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test only verifies sorting lists by size but doesn't test edge cases like: (1) all nulls, (2) mix of empty lists and non-empty lists with nulls, or (3) very large lists. Consider adding test cases for these edge cases to ensure the sorting behavior is well-defined and consistent.

Copilot uses AI. Check for mistakes.
@koperagen
Copy link
Collaborator Author

ForkJoinPool under the hood uses the common thread pool and also, thinking am I - should it live inside Kotlin kernel thread (I have no idea about how it's organized in Kotlin Notebook), where exactly the child threads will be created

Two options. First, default - Kotlin Kernel lives in its own JVM process. Second is attached mode where Kernel lives inside intellij process and uses common pool from there. But IMO in both cases it's no different from using Stream.parallelStream which is fairly widespread and, to my knowledge, recommended

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Something related to how fast the library can handle data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Effecient KotlinNotebokPluginUtils.sortByColumns

4 participants