Is your feature request related to a problem or challenge?
DataFusion's ClickBench benchmark queries currently use length(...) for string-length aggregations:
In DataFusion, length is an alias of character_length, which returns the number of characters in a string. DataFusion's octet_length returns the length of a string in bytes.
This differs from the benchmark semantics used by ClickHouse and DuckDB in ClickBench:
Because byte length can be computed from string offsets, while character length must inspect UTF-8 contents, using octet_length in these benchmark queries may avoid unnecessary work and improve ClickBench performance without changing the intended benchmark semantics.
This was pointed out by @gatesn
Describe the solution you'd like
Update DataFusion's ClickBench SQL benchmark queries to use octet_length(...) where the upstream ClickBench / DuckDB benchmark is measuring byte length.
Note we would need to change in the DataFusion repo AND the ClickBench repository
Likely changes:
AVG(length("URL")) -> AVG(octet_length("URL")) in Q27
AVG(length("Referer")) -> AVG(octet_length("Referer")) in Q28
After changing the benchmark SQL, compare ClickBench timings before and after, especially Q27 and Q28.
Describe alternatives you've considered
Leave the queries as-is. That preserves current DataFusion SQL semantics, but likely benchmarks character-counting work that ClickHouse and DuckDB are not doing for Q27 and Q28 in ClickBench.
Add a ClickHouse compatibility alias where length means byte length. I do not think this is appropriate globally because DataFusion currently documents length as an alias of character_length.
Additional context
Related ClickBench tracking:
Is your feature request related to a problem or challenge?
DataFusion's ClickBench benchmark queries currently use
length(...)for string-length aggregations:benchmarks/queries/clickbench/queries/q27.sqlbenchmarks/queries/clickbench/queries/q28.sqlIn DataFusion,
lengthis an alias ofcharacter_length, which returns the number of characters in a string. DataFusion'soctet_lengthreturns the length of a string in bytes.This differs from the benchmark semantics used by ClickHouse and DuckDB in ClickBench:
length(URL)/length(Referer)in Q27 and Q28:ClickHouse/ClickBench/clickhouse/queries.sql. In ClickHouse,lengthis byte-oriented; the ClickHouse string function docs distinguish this fromlengthUTF8, which returns Unicode code points rather than bytes.STRLEN(URL)/STRLEN(Referer)in Q27 and Q28:ClickHouse/ClickBench/duckdb/queries.sql,duckdb-parquet/queries.sql, andduckdb-parquet-partitioned/queries.sql. DuckDB documentsstrlen(string)as returning the number of bytes in a string, whilelength(string)returns the number of characters.Because byte length can be computed from string offsets, while character length must inspect UTF-8 contents, using
octet_lengthin these benchmark queries may avoid unnecessary work and improve ClickBench performance without changing the intended benchmark semantics.This was pointed out by @gatesn
Describe the solution you'd like
Update DataFusion's ClickBench SQL benchmark queries to use
octet_length(...)where the upstream ClickBench / DuckDB benchmark is measuring byte length.Note we would need to change in the DataFusion repo AND the ClickBench repository
Likely changes:
AVG(length("URL"))->AVG(octet_length("URL"))in Q27AVG(length("Referer"))->AVG(octet_length("Referer"))in Q28After changing the benchmark SQL, compare ClickBench timings before and after, especially Q27 and Q28.
Describe alternatives you've considered
Leave the queries as-is. That preserves current DataFusion SQL semantics, but likely benchmarks character-counting work that ClickHouse and DuckDB are not doing for Q27 and Q28 in ClickBench.
Add a ClickHouse compatibility alias where
lengthmeans byte length. I do not think this is appropriate globally because DataFusion currently documentslengthas an alias ofcharacter_length.Additional context
Related ClickBench tracking: