Skip to content

mocusez/mlxdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLX-DF

GPU-accelerated DataFrame library for Apple Silicon, built on MLX.

MLX-DF brings cuDF-style GPU DataFrame operations to Mac, exploiting Apple's unified memory for zero-copy CPU/GPU data sharing. The API mirrors Pandas for easy migration.

Warning

MLX-DF currently supports only Apple Silicon devices (M-series chips).

Installation

pip install mlxdf

Using uv:

uv add mlxdf

With PyArrow/Parquet support:

pip install mlxdf[arrow]

Using uv:

uv add "mlxdf[arrow]"

From source:

uv sync
uv build

Quick Start

from mlxdf import MlxDataFrame, merge, read_parquet

# Create a DataFrame (string columns auto-detected as CategoricalSeries)
df = MlxDataFrame({
    "product_id": [1.0, 2.0, 1.0, 3.0, 2.0],
    "quantity":   [5.0, 3.0, 2.0, 7.0, 1.0],
    "category":   ["A", "B", "A", "C", "B"],
})

# Filter
high_qty = df[df["quantity"] > 2.0]

# Computed columns
df["double_qty"] = df["quantity"] * 2

# GroupBy aggregation
result = df.groupby("category")["quantity"].sum()
result.show()

# Join two DataFrames
prices = MlxDataFrame({
    "product_id": [1.0, 2.0, 3.0],
    "price":      [10.0, 25.0, 15.0],
})
joined = df.merge(prices, on="product_id", how="inner")

# Parquet I/O (requires mlx-df[arrow])
df.to_parquet("output.parquet")
df2 = read_parquet("output.parquet", columns=["product_id", "quantity"])

Features

  • MlxSeries — Column with boolean null mask, vectorized arithmetic, comparisons, and aggregations
  • CategoricalSeries — Dictionary-encoded string column (55× faster filtering vs Pandas)
  • MlxDataFrame — Dict-like table with column access, boolean filtering, head/tail/slicing
  • GroupBy — Bincount/sort-based groupby with sum/mean/count/max/min aggregations
  • Join — Hash-index join supporting inner/left/right/outer (4× faster vs Pandas at 200M rows)
  • Pandas Interopto_pandas() / from_pandas() with automatic type conversion
  • PyArrow & Parquet — Read/write Parquet with column pruning and predicate pushdown
  • JIT Compilationcompile_fn for fused GPU kernel execution

Development

Setup

uv sync

Running Tests

# Run all unit tests (benchmarks are excluded by default)
uv run pytest

# Run a specific test file
uv run pytest tests/test_series.py

# Run a specific test case
uv run pytest tests/test_series.py::TestArithmetic::test_add_series -v

# Run with verbose output
uv run pytest -v

# Run and stop on first failure
uv run pytest -x

Benchmarks

Benchmarks are integrated into pytest via the bench marker, defaulting to deselected so they don't slow down regular test runs.

# Run all benchmarks
uv run pytest -m bench

# Run a specific benchmark
uv run pytest -m bench -k parquet
uv run pytest -m bench -k tpch
uv run pytest -m bench -k categorical
uv run pytest -m bench -k compile

# Run both tests and benchmarks together
uv run pytest -m ""

# Run benchmark scripts directly (also works)
uv run python benchmarks/bench_vs_pandas.py

Available benchmarks: bench_vs_pandas, bench_categorical, bench_parquet, bench_compile_df, bench_tpch_q1/q3/q4/q6/q18/q19

About

GPU-accelerated DataFrame library for Apple Silicon, built on MLX.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages