Skip to content

[pull] main from apache:main#82

Merged
pull[bot] merged 3 commits intoburaksenn:mainfrom
apache:main
Apr 4, 2026
Merged

[pull] main from apache:main#82
pull[bot] merged 3 commits intoburaksenn:mainfrom
apache:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 4, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

kosiew and others added 3 commits April 4, 2026 06:48
## Which issue does this PR close?

* Closes #20492.

## Rationale for this change

`HashJoinExec` currently continues polling and consuming the probe side
even after the build side has completed with zero rows.

For join types whose output is guaranteed to be empty when the build
side is empty, this work is unnecessary. In practice, it can trigger
large avoidable scans and extra compute despite producing no output.
This is especially costly for cases such as INNER, LEFT, LEFT SEMI, LEFT
ANTI, LEFT MARK, and RIGHT SEMI joins.

This change makes the stream state machine aware of that condition so
execution can terminate as soon as the build side is known to be empty
and no probe rows are needed to determine the final result.

The change also preserves the existing behavior for join types that
still require probe-side rows even when the build side is empty, such as
RIGHT, FULL, RIGHT ANTI, and RIGHT MARK joins.

## What changes are included in this PR?

* Added `JoinType::empty_build_side_produces_empty_result` to centralize
logic determining when an empty build side guarantees empty output.
* Updated `HashJoinStream` state transitions to:

* Skip transitioning to `FetchProbeBatch` when the build side is empty
and output is deterministically empty.
  * Immediately complete the stream in such cases.
* Refactored logic in `build_batch_empty_build_side` to reuse the new
helper method and simplify match branches.
* Ensured probe-side consumption still occurs for join types that
require probe rows (e.g., RIGHT, FULL).
* Added helper `state_after_build_ready` to unify post-build decision
logic.
* Introduced reusable helper for constructing hash joins with dynamic
filters in tests.


## Are these changes tested?

Yes, comprehensive tests have been added:

* Verified that probe side is **not consumed** when:

  * Build side is empty
  * Join type guarantees empty output
* Verified that probe side **is still consumed** when required by join
semantics (e.g., RIGHT, FULL joins)
* Covered both filtered and non-filtered joins
* Added tests ensuring correct behavior with dynamic filters
* Added regression test ensuring correct behavior after partition bounds
reporting

These tests validate both correctness and the intended optimization
behavior.


## Are there any user-facing changes?

No API changes.

However, this introduces a performance optimization:

* Queries involving joins with empty build sides may complete
significantly faster
* Reduced unnecessary IO and compute

No behavioral changes in query results.


## LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated
content has been manually reviewed and tested.
## Which issue does this PR close?

- Closes #12709.

## Rationale for this change

There is no support for concatenating binary strings.

There are two ways:
1. Cast binaries to text and then concatenate text type (most common
behaviour across DBs as discussed in the ticket)
2. Concatenate binaries and provide binary type (only for Spark)

This PR goes with a way (1). We could customise the operator/UDF
per-dialect, but it is not currently supported.

## What changes are included in this PR?

- Support binary concatenation via a pipe operator
- Support binary concatenation in `concat` UDF

## Are these changes tested?

Added SLT tests

## Are there any user-facing changes?

No

---------

Co-authored-by: Siew Kam Onn <kosiew@gmail.com>
…tring/binary types (#21090)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #17899.

## Rationale for this change
```
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        main ┃ first_val_group_acc ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │   872.01 ms │           904.35 ms │     no change │
│ QQuery 1  │   156.37 ms │           164.83 ms │  1.05x slower │
│ QQuery 2  │   448.58 ms │           497.05 ms │  1.11x slower │
│ QQuery 3  │   233.99 ms │           274.10 ms │  1.17x slower │
│ QQuery 4  │  1448.99 ms │          1556.95 ms │  1.07x slower │
│ QQuery 5  │ 10816.83 ms │         11315.69 ms │     no change │
│ QQuery 6  │  2053.16 ms │          2030.02 ms │     no change │
│ QQuery 7  │  2154.74 ms │          2274.63 ms │  1.06x slower │
│ QQuery 8  │   405.62 ms │           405.72 ms │     no change │
│ QQuery 9  │ 17160.65 ms │          4167.34 ms │ +4.12x faster │
│ QQuery 10 │  1206.03 ms │          1090.69 ms │ +1.11x faster │
│ QQuery 11 │  2437.12 ms │          2446.51 ms │     no change │
│ QQuery 12 │   331.27 ms │           317.73 ms │     no change │
└───────────┴─────────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                  │ 39725.35ms │
│ Total Time (first_val_group_acc)   │ 27445.62ms │
│ Average Time (main)                │  3055.80ms │
│ Average Time (first_val_group_acc) │  2111.20ms │
│ Queries Faster                     │          2 │
│ Queries Slower                     │          5 │
│ Queries with No Change             │          6 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
```
Previously, the `first_value` and `last_value` aggregate functions only
supported GroupsAccumulator for primitive types. For string or binary
types (Utf8, LargeUtf8, Binary, etc.), they fell back to the slower
row-based Accumulator path.

This change implements a specialized state management for byte-based
types, enabling high-performance grouped aggregation for strings and
binary data, especially when used with `ORDER BY`.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?
- New `ValueState` Trait: Abstracted the state management for
`first_value` and `last_value` to support different storage backends.
- `PrimitiveValueState` : Re-implemented the existing primitive handling
using the new trait.
- `BytesValueState`: Added a new state implementation for Utf8,
LargeUtf8, Utf8View, Binary, LargeBinary, and BinaryView. It
     optimizes memory by reusing `Vec<u8>` buffers for group updates.
- Refactored `FirstLastGroupsAccumulato`r: Migrated the accumulator to
use the generic ValueState trait, allowing it to handle both primitive
and byte types uniformly.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

YES
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@pull pull bot locked and limited conversation to collaborators Apr 4, 2026
@pull pull bot added the ⤵️ pull label Apr 4, 2026
@pull pull bot merged commit 587f4c0 into buraksenn:main Apr 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants