Prepare Annotations

A dedicated toolkit for downloading, processing, and preparing genomic annotation datasets.

Features

Prefect-based Pipelines: robust workflows for data preparation.
Support for multiple sources:
- Ensembl: Human genetic variations.
- ClinVar: Clinical variant data.
- dbSNP: Single Nucleotide Polymorphism database.
- gnomAD: Genome Aggregation Database.
VCF to Parquet: Efficient conversion of large VCF files to columnar format.
Variant Splitting: Splitting variants by type (SNV, Indel, etc.) for optimized annotation.
Hugging Face Hub Integration: Direct upload of processed datasets with automatic dataset card generation.

Installation

This project uses uv for dependency management.

git clone https://github.com/dna-seq/prepare-annotations.git
cd prepare-annotations
uv sync

Usage

Command Line Interface

The main entry point is the prepare-annotations command.

# Show version
uv run prepare-annotations version

# Download and process Ensembl variations
uv run prepare-annotations ensembl --split --upload

# Download and process ClinVar data
uv run prepare-annotations clinvar --split --upload

Options

--dest-dir: Destination directory for downloads.
--split: Split downloaded files by variant type.
--upload: Upload results to Hugging Face Hub.
--repo-id: Custom Hugging Face repository ID.

Development

See AGENTS.md for development guidelines and repository layout.

Running Tests

uv run python -m pytest

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
data		data
logs		logs
src/prepare_annotations		src/prepare_annotations
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prepare Annotations

Features

Installation

Usage

Command Line Interface

Options

Development

Running Tests

License

About

Uh oh!

Releases

Packages

Languages

License

dna-seq/prepare-annotations

Folders and files

Latest commit

History

Repository files navigation

Prepare Annotations

Features

Installation

Usage

Command Line Interface

Options

Development

Running Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages