Skip to content

0.9.0

Pre-release
Pre-release

Choose a tag to compare

@andygrove andygrove released this 04 Jul 17:01
· 1007 commits to main since this release
1c462bc

DataFusion Comet 0.9.0 Changelog

This release consists of 139 commits from 24 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

  • fix: typo for instr in fuzz testing #1686 (mbutrovich)
  • fix: Bucketed scan fallback for native_datafusion Parquet scan #1720 (mbutrovich)
  • fix: Skip row index Spark SQL tests for native_datafusion Parquet scan #1724 (mbutrovich)
  • fix: Check acquired memory when CometMemoryPool grows #1732 (wForget)
  • fix: Fix data race in memory profiling #1727 (andygrove)
  • fix: Enable some DPP Spark SQL tests #1734 (andygrove)
  • fix: support literal null list and map #1742 (kazuyukitanimura)
  • fix: get_struct field is incorrect when struct in array #1687 (comphead)
  • fix: cast map types correctly in schema adapter #1771 (parthchandra)
  • fix: correct schema type checking in native_iceberg_compat #1755 (parthchandra)
  • fix: default values for native_datafusion scan #1756 (mbutrovich)
  • fix: [native_scans] Support CASE_SENSITIVE when reading Parquet #1782 (andygrove)
  • fix: cargo install tpchgen-cli in benchmark doc #1797 (zhuqi-lucas)
  • fix: support map_keys #1788 (comphead)
  • fix: fall back on nested types for default values #1799 (mbutrovich)
  • fix: Re-enable Spark 4 tests on Linux #1806 (andygrove)
  • fix: fallback to Spark scan if encryption is enabled (native_datafusion/native_iceberg_compat) #1785 (parthchandra)
  • fix: native_iceberg_compat: move checking parquet types above fetching batch #1809 (mbutrovich)
  • fix: translate missing or corrupt file exceptions, fall back if asked to ignore #1765 (mbutrovich)
  • fix: Fix Spark SQL AQE exchange reuse test failures #1811 (coderfender)
  • fix: Enable more Spark SQL tests #1834 (andygrove)
  • fix: support map_values #1835 (comphead)
  • fix: Handle case where num_cols == 0 in native execution #1840 (andygrove)
  • fix: Fix shuffle writing rows containing null struct fields #1845 (Kontinuation)
  • fix: Fall back to Spark for RANGE BETWEEN window expressions #1848 (andygrove)
  • fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack #1865 (andygrove)
  • fix: support read Struct by user schema #1860 (comphead)
  • fix: map parquet field_id correctly (native_iceberg_compat) #1815 (parthchandra)
  • fix: cast_struct_to_struct aligns to Spark behavior #1879 (mbutrovich)
  • fix: correctly handle schemas with nested array of struct (native_iceberg_compat) #1883 (parthchandra)
  • fix: set RangePartitioning for native shuffle default to false #1907 (mbutrovich)
  • fix: conflict between #1905 and #1892. #1919 (mbutrovich)
  • fix: Add overflow check to evaluate of sum decimal accumulator #1922 (leung-ming)
  • fix: Fix overflow handling when casting float to decimal #1914 (leung-ming)
  • fix: Ignore a test case fails on Miri #1951 (leung-ming)

Performance related:

  • perf: Add memory profiling #1702 (andygrove)
  • perf: Add performance tracing capability #1706 (andygrove)
  • perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config #1936 (andygrove)

Implemented enhancements:

  • feat: add jemalloc as optional custom allocator #1679 (mbutrovich)
  • feat: support array_repeat #1680 (comphead)
  • feat: More warning info for users #1667 (hsiang-c)
  • feat: decode() expression when using 'utf-8' encoding #1697 (mbutrovich)
  • feat: regexp_replace() expression with no starting offset #1700 (mbutrovich)
  • feat: Improve performance tracing feature #1730 (andygrove)
  • feat: Set/cancel with job tag and make max broadcast table size configurable #1693 (wForget)
  • feat: Add support for expm1 expression from datafusion-spark crate #1711 (andygrove)
  • feat: Add config option for showing all Comet plan transformations #1780 (andygrove)
  • feat: Support Type widening: byte → short/int/long, short → int/long #1770 (huaxingao)
  • feat: Translate Hadoop S3A configurations to object_store configurations #1817 (Kontinuation)
  • feat: Upgrade to official DataFusion 48.0.0 release #1877 (andygrove)
  • feat: Add experimental auto mode for COMET_PARQUET_SCAN_IMPL #1747 (andygrove)
  • feat: support RangePartitioning with native shuffle #1862 (mbutrovich)
  • feat: Add support for signum expression #1889 (andygrove)
  • feat: Add support to lookup map by key #1898 (comphead)
  • feat: support array_max #1892 (drexler-sky)
  • feat: pass ignore_nulls flag to first and last #1866 (rluvaton)
  • feat: Implement ToPrettyString #1921 (andygrove)
  • feat: Support hadoop s3a config in native_iceberg_compat #1925 (parthchandra)
  • feat: rand expression support #1199 (akupchinskiy)
  • feat: supports array_distinct #1923 (drexler-sky)
  • feat: auto scan mode should check for supported file location #1930 (andygrove)
  • feat: Encapsulate Parquet objects #1920 (huaxingao)
  • feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto #1933 (andygrove)
  • feat: Supports array_union #1945 (drexler-sky)

Documentation updates:

  • docs: Add changelog for 0.8.0 #1675 (andygrove)
  • docs: Add instructions on running TPC-H on macOS #1647 (andygrove)
  • docs: Add documentation for accelerating Iceberg Parquet scans with Comet #1683 (andygrove)
  • docs: Add note on setting core.abbrev when generating diffs #1735 (andygrove)
  • docs: Remove outdated param in macos bench guide #1748 (ding-young)
  • docs: Add instructions for running individual Spark SQL tests from sbt #1752 (coderfender)
  • docs: Add documentation for native_datafusion Parquet scanner's S3 support #1832 (Kontinuation)
  • docs: Add docs stating that Comet does not support reading decimals encoded in Parquet BINARY format #1895 (andygrove)

Other:

  • chore: Start 0.9.0 development #1676 (andygrove)
  • chore: Update viable crates #1677 (EmilyMatt)
  • chore: match Maven plugin versions with Spark 3.5 #1668 (hsiang-c)
  • chore: Remove fallback reason "because the children were not native" #1672 (andygrove)
  • chore: Rename scalarExprToProto to scalarFunctionExprToProto #1688 (comphead)
  • chore: fix build errors #1690 (comphead)
  • chore: Make Aggregate transformation more compact #1670 (EmilyMatt)
  • chore: update dev/release/rat_exclude_files.txt #1689 (hsiang-c)
  • chore: Move Comet rules into their own files #1695 (andygrove)
  • chore: Remove fast encoding option #1703 (andygrove)
  • chore: fix CI job name #1712 (hsiang-c)
  • minor: Warn if memory pool is dropped with bytes still reserved #1721 (andygrove)
  • chore: Correct memory acquired size in unified memory pool #1738 (zuston)
  • chore: allow large errors for Clippy #1743 (comphead)
  • chore: Refactor DataTypeSupport #1741 (andygrove)
  • chore: More refactoring of type checking logic #1744 (andygrove)
  • chore: Enable more complex type tests #1753 (andygrove)
  • chore: Add scanImpl attribute to CometScanExec #1746 (andygrove)
  • chore: Prepare for DataFusion 48.0.0 #1710 (andygrove)
  • Docs: Setup Comet on IntelliJ #1760 (coderfender)
  • chore: Reenable nested types for CometFuzzTestSuite with int96 #1761 (mbutrovich)
  • chore: Enable partial Spark SQL tests for native_iceberg_compat scan #1762 (andygrove)
  • chore: [native_iceberg_compat / native_datafusion] Ignore Spark SQL Parquet encryption tests #1763 (andygrove)
  • build: Ignore array_repeat test to fix CI issues #1774 (andygrove)
  • chore: Upload crash logs if Java tests fail #1779 (andygrove)
  • chore: Drop support for Java 8 #1777 (andygrove)
  • chore: Bump arrow to 18.3.0 #1773 (Kontinuation)
  • build: Stop running Comet's Spark 4 tests on Linux for PR builds #1802 (andygrove)
  • Chore: Moved strings expressions to separate file #1792 (kazantsev-maksim)
  • chore: Speed up "PR Builds" CI workflows #1807 (andygrove)
  • chore: [native scans] Ignore Spark SQL test for string predicate pushdown #1768 (andygrove)
  • chore: Bump DataFusion to git rev 2c2f225 #1814 (andygrove)
  • Feat: support bit_count function #1602 (kazantsev-maksim)
  • Chore: implement bit_not as ScalarUDFImpl #1825 (kazantsev-maksim)
  • build: Specify -Dsbt.log.noformat=true in sbt CI runs #1822 (andygrove)
  • chore: Use unique artifact names in Java test run #1818 (andygrove)
  • minor: Refactor PhysicalPlanner::default() to avoid duplicate code #1821 (andygrove)
  • Chore: implement bit_count as ScalarUDFImpl #1826 (kazantsev-maksim)
  • chore: IgnoreCometNativeScan on a few more Spark SQL tests #1837 (mbutrovich)
  • chore: Enable tests in RemoveRedundantProjectsSuite.scala related to issue #242 #1838 (rishvin)
  • minor: Replace many instances of checkSparkAnswer with checkSparkAnswerAndOperator #1851 (andygrove)
  • chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate #1847 (andygrove)
  • chore: Ignore Spark SQL WholeStageCodegenSuite tests #1859 (andygrove)
  • chore: Upgrade to DataFusion 48.0.0-rc3 #1863 (andygrove)
  • upgraded spark 3.5.5 to 3.5.6 #1861 (YanivKunda)
  • build: Disable some rounding tests when miri is enabled #1873 (andygrove)
  • chore: Enable Spark SQL tests for native_iceberg_compat #1876 (andygrove)
  • chore: Enable more Spark SQL tests #1869 (andygrove)
  • chore: refactor planner read schema tests #1886 (comphead)
  • chore: Implement date_trunc as ScalarUDFImpl #1880 (leung-ming)
  • Chore: implement datetime funcs as ScalarUDFImpl #1874 (trompa)
  • minor: Improve testing of math scalar functions #1896 (andygrove)
  • minor: Avoid rewriting join to unsupported join #1888 (andygrove)
  • chore: Enable native_iceberg_compat Spark SQL tests (for real, this time) #1910 (andygrove)
  • chore: rename makeParquetFileAllTypes to makeParquetFileAllPrimitiveTypes #1905 (parthchandra)
  • chore: add a test case to read from an arbitrarily complex type schema #1911 (parthchandra)
  • test: Trigger Spark 3.4.3 SQL tests for iceberg-compat #1912 (kazuyukitanimura)
  • build: Fix conflict between #1910 and #1912 #1924 (andygrove)
  • minor: fix kube/Dockerfile build failed #1918 (zhangxffff)
  • chore: Improve reporting of fallback reasons for CollectLimit #1694 (andygrove)
  • chore: move udf registration to better place #1899 (rluvaton)
  • chore: Comet + Iceberg (1.8.1) CI #1715 (hsiang-c)
  • chore: Introduce exprHandlers map in QueryPlanSerde #1903 (andygrove)
  • chore: Enable Spark SQL tests for auto scan mode #1885 (andygrove)
  • Feat: support bit_get function #1713 (kazantsev-maksim)
  • chore: Clippy fixes for Rust 1.88 #1939 (andygrove)
  • Minor: Add unit tests for ceil/floor functions #1728 (tlm365)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    62	Andy Grove
    16	Matt Butrovich
    10	Oleks V
     8	Parth Chandra
     5	Kazantsev Maksim
     5	hsiang-c
     4	Kristin Cowalcijk
     4	Leung Ming
     3	B Vadlamani
     3	drexler-sky
     2	Emily Matheys
     2	Huaxin Gao
     2	KAZUYUKI TANIMURA
     2	Raz Luvaton
     2	Zhen Wang
     1	Artem Kupchinskiy
     1	Junfan Zhang
     1	Qi Zhu
     1	Rishab Joshi
     1	Tai Le Manh
     1	Yaniv Kunda
     1	Zhang Xiaofeng
     1	ding-young
     1	trompa

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.