Skip to content

Conversation

@adamklie
Copy link
Collaborator

@adamklie adamklie commented Jan 7, 2025

Updated first tutorial for introducing Zarr and Xarray.

@adamklie adamklie requested a review from d-laub January 7, 2025 23:26
@adamklie adamklie added the documentation Improvements or additions to documentation label Jan 7, 2025
@adamklie adamklie marked this pull request as ready for review January 8, 2025 22:16
- v0.3.0: Improved out of core functionality, robust BED classification datasets
- v0.0.4 — Interoperability with AnnData and SnapATAC2
- v0.X.0: Improved out of core functionality, robust BED classification datasets
- v0.X.4 — Interoperability with AnnData and SnapATAC2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v0.X.0


### Loading data from BigWig files
[BigWig files](https://genome.ucsc.edu/goldenpath/help/bigWig.html) are a common way to store track-based data and the workhorse of modern genomic sequence based ML. ...
Because BAM files contain read alignments, we can use different strategies for quantifying the pileup at each position. See the TODO for a deeper dive into...TODO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some TODOs here

### Working with Zarr stores and XArray objects
The SeqData API is built to convert data from common formats to Zarr stores on disk. The Zarr store... When coupled with XArray and Dask, we also have the ability to lazy load data and work with data that is too large to fit in memory.
### Building a dataloader
One of the main goals of SeqData is to allow a seamless flow from files on disk to machine learning ready datasets. This can be achieved after loading data from the above functions by building a PyTorch dataloader with the `get_torch_dataloader` function:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after loading data from constructing Xarray datasets with the above functions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Some typos e.g. "Adatped from..."
Converting Xarray to other formats looks unfinished?
Consider reworking Zarr stores section? Intent with SeqData is to never use the default Dataset.to_zarr() method because it concatenates the length axis of string/char arrays by default, and maybe some other issues I don't remember off the top of my head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants