Skip to content

Conversation

@mwiewior
Copy link
Contributor

@mwiewior mwiewior commented Dec 6, 2025

Summary

  • Adds coordinate_system_zero_based: bool parameter to all TableProvider constructors (VCF, GFF, BAM, BED, CRAM)
  • When true (default), positions are converted from noodles' internal 1-based system to 0-based coordinates
  • Stores coordinate system preference in Arrow schema metadata using key bio.coordinate_system_zero_based
  • Updates all binaries, examples, and tests to use the new parameter

Details

This change supports the polars-bio initiative (issue #259) to switch the default coordinate system from 1-based to 0-based for consistency with common bioinformatics tools and conventions.

Implementation

  • Added COORDINATE_SYSTEM_METADATA_KEY constant in bio-format-core
  • Each format's physical_exec.rs now converts positions when reading records:
    • start = noodles_position.get() - 1 when coordinate_system_zero_based=true
    • End positions are similarly adjusted

Files Changed

  • bio-format-core/src/lib.rs - Central constant definition
  • bio-format-vcf/src/{table_provider,physical_exec}.rs
  • bio-format-gff/src/{table_provider,physical_exec}.rs
  • bio-format-bam/src/{table_provider,physical_exec}.rs
  • bio-format-bed/src/{table_provider,physical_exec}.rs
  • bio-format-cram/src/{table_provider,physical_exec}.rs
  • All related binaries, examples, and tests

Test plan

  • Updated test assertions to expect 0-based coordinates
  • All format-specific tests pass with coordinate conversion
  • Pre-commit hooks (fmt, cargo check) pass

🤖 Generated with Claude Code

@mwiewior mwiewior force-pushed the feature/coordinate-system-parameter branch from 9beb883 to a16521c Compare December 6, 2025 14:05
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

📊 Benchmark Results

Benchmarks have been completed and stored for this PR.

View Results: https://biodatageeks.org/datafusion-bio-formats/benchmark-comparison/

  • Target: feature/coordinate-system-parameter
  • Baseline: v0.1.1
  • Platforms: Linux, macOS
  • Mode: fast

Raw data: https://biodatageeks.org/datafusion-bio-formats/benchmark-data/

@mwiewior mwiewior force-pushed the feature/coordinate-system-parameter branch from a16521c to 67f1b17 Compare December 6, 2025 18:16
This commit implements the coordinate system parameter for all bio format
table providers (VCF, GFF, BAM, BED, CRAM) to support both 0-based and
1-based coordinate output.

Changes:
- Add `coordinate_system_zero_based: bool` parameter to all TableProvider
  constructors
- Store coordinate system preference in Arrow schema metadata with key
  `bio.coordinate_system_zero_based`
- Add `COORDINATE_SYSTEM_METADATA_KEY` constant in bio-format-core
- Update position conversion in physical_exec.rs for each format:
  - When true (default): subtract 1 from noodles 1-based positions
  - When false: use noodles positions as-is (1-based)
- Update all binaries, examples, and tests to pass the new parameter
- Fix test assertions to expect 0-based coordinates (default)

Breaking Change: All TableProvider::new() constructors now require
an additional bool parameter as the last argument.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@mwiewior mwiewior force-pushed the feature/coordinate-system-parameter branch from 67f1b17 to e32541d Compare December 6, 2025 18:23
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

📊 Benchmark Results

Benchmarks have been completed and stored for this PR.

View Results: https://biodatageeks.org/datafusion-bio-formats/benchmark-comparison/

  • Target: feature/coordinate-system-parameter
  • Baseline: v0.1.1
  • Platforms: Linux, macOS
  • Mode: fast

Raw data: https://biodatageeks.org/datafusion-bio-formats/benchmark-data/

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

📊 Benchmark Results

Benchmarks have been completed and stored for this PR.

View Results: https://biodatageeks.org/datafusion-bio-formats/benchmark-comparison/

  • Target: feature/coordinate-system-parameter
  • Baseline: v0.1.1
  • Platforms: Linux, macOS
  • Mode: fast

Raw data: https://biodatageeks.org/datafusion-bio-formats/benchmark-data/

- Add coordinate_system_zero_based parameter to BgzfGffTableProvider::try_new()
- Pass parameter through to BgzfGffExec for coordinate conversion
- Apply coordinate conversion at parse time (subtract 1 from start when zero_based=true)
- Update binaries and tests to use new API

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2025

📊 Benchmark Results

Benchmarks have been completed and stored for this PR.

View Results: https://biodatageeks.org/datafusion-bio-formats/benchmark-comparison/

  • Target: feature/coordinate-system-parameter
  • Baseline: v0.1.1
  • Platforms: Linux, macOS
  • Mode: fast

Raw data: https://biodatageeks.org/datafusion-bio-formats/benchmark-data/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants