Skip to content

Use octet_length instead of length in ClickBench SQL queries #23086

Description

@alamb

Is your feature request related to a problem or challenge?

DataFusion's ClickBench benchmark queries currently use length(...) for string-length aggregations:

In DataFusion, length is an alias of character_length, which returns the number of characters in a string. DataFusion's octet_length returns the length of a string in bytes.

This differs from the benchmark semantics used by ClickHouse and DuckDB in ClickBench:

Because byte length can be computed from string offsets, while character length must inspect UTF-8 contents, using octet_length in these benchmark queries may avoid unnecessary work and improve ClickBench performance without changing the intended benchmark semantics.

This was pointed out by @gatesn

Describe the solution you'd like

Update DataFusion's ClickBench SQL benchmark queries to use octet_length(...) where the upstream ClickBench / DuckDB benchmark is measuring byte length.

Note we would need to change in the DataFusion repo AND the ClickBench repository

Likely changes:

  • AVG(length("URL")) -> AVG(octet_length("URL")) in Q27
  • AVG(length("Referer")) -> AVG(octet_length("Referer")) in Q28

After changing the benchmark SQL, compare ClickBench timings before and after, especially Q27 and Q28.

Describe alternatives you've considered

Leave the queries as-is. That preserves current DataFusion SQL semantics, but likely benchmarks character-counting work that ClickHouse and DuckDB are not doing for Q27 and Q28 in ClickBench.

Add a ClickHouse compatibility alias where length means byte length. I do not think this is appropriate globally because DataFusion currently documents length as an alias of character_length.

Additional context

Related ClickBench tracking:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions