Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
57e39bb
docs: design spec for `rustqc atac` subcommand
xuzhougeng May 4, 2026
ff9d9c9
docs: implementation plan for `rustqc atac` subcommand
xuzhougeng May 4, 2026
bc56619
chore: gitignore .worktrees/ for isolated workspaces
xuzhougeng May 4, 2026
289a405
refactor: extract shared BAM/preseq infra out of rna module
xuzhougeng May 4, 2026
29c3af1
feat(atac): add atac subcommand with CLI args
xuzhougeng May 4, 2026
478bb3d
feat(atac): scaffold atac subcommand routing
xuzhougeng May 4, 2026
7bf87bb
feat(atac): add AtacConfig
xuzhougeng May 4, 2026
dd04537
feat(atac): resolved-config struct
xuzhougeng May 4, 2026
36f827a
feat(atac): mito chromosome auto-detection
xuzhougeng May 4, 2026
89200cd
feat(atac): paired-end startup check
xuzhougeng May 4, 2026
088276b
feat(gtf): TSS extraction helper for ATAC
xuzhougeng May 4, 2026
32819f4
chore: suppress clippy dead_code warnings for Phase 2 scaffold
xuzhougeng May 4, 2026
91f1412
fix(atac): merge YAML AtacConfig into resolve()
xuzhougeng May 4, 2026
c244750
refactor(atac): extract_tss uses parse_gtf instead of parallel parser
xuzhougeng May 4, 2026
950c9e8
chore(atac): mark PE-accept test gap as TODO
xuzhougeng May 4, 2026
464834a
feat(atac): bamQC flag/MAPQ accumulator
xuzhougeng May 4, 2026
bd9cd70
feat(atac): per-chromosome PBC fingerprint accumulator
xuzhougeng May 4, 2026
2e41af8
feat(atac): finalize BamQcReport with NRF/PBC1/PBC2
xuzhougeng May 4, 2026
f866eb5
chore(atac): suppress dead_code on bam_qc public surface until Phase 13
xuzhougeng May 4, 2026
6974eaf
fix(atac): rename test to satisfy non_snake_case lint
xuzhougeng May 4, 2026
ff4df3c
fix(atac): add M2==0 PBC2 edge-case test, fix qnames doc comment
xuzhougeng May 4, 2026
b869c01
feat(atac): fragSize histogram accumulator
xuzhougeng May 4, 2026
f7b6051
feat(atac): fragSize TSV writer
xuzhougeng May 4, 2026
8488de9
feat(atac): readsDupFreq accumulator
xuzhougeng May 4, 2026
f5cd4e7
feat(atac): library complexity via preseq
xuzhougeng May 4, 2026
71d9ef2
feat(atac): sparse per-TSS coverage buffer
xuzhougeng May 4, 2026
d38ded8
fix(atac): suppress dead_code on TssCov until Phase 13
xuzhougeng May 4, 2026
3aac079
feat(atac): NFRscore
xuzhougeng May 4, 2026
ea231b9
feat(atac): PTscore
xuzhougeng May 4, 2026
062c8d9
fix(atac): restore R-faithful smallNumber formula in NFR/PT
xuzhougeng May 4, 2026
72cba5b
feat(atac): minimal loess port for TSSE
xuzhougeng May 4, 2026
980e7e3
feat(atac): TSSEscore
xuzhougeng May 4, 2026
2e908e8
feat(atac): flank resolution
xuzhougeng May 4, 2026
4801401
feat(atac): coordinate-only Tn5 shift
xuzhougeng May 4, 2026
e5f5b84
feat(atac): fixed-interval fragment split
xuzhougeng May 4, 2026
292e017
feat(atac): CIGAR 5' trim helper for Tn5 BAM emission
xuzhougeng May 4, 2026
d6767f8
fix(atac): use iter_mut().enumerate() for sums loop in TSSE compute
xuzhougeng May 4, 2026
e801e64
feat(atac): add GL1 fixture and stub GTF for integration testing
xuzhougeng May 4, 2026
3a80644
feat(atac): single-pass driver wiring all Phase 3-12 metric accumulators
xuzhougeng May 4, 2026
f521979
feat(atac): SVG plots for fragsize, TSSE profile, and lib complexity
xuzhougeng May 4, 2026
f4c2130
feat(atac): JSON summary writer with schema_keys_match_spec test
xuzhougeng May 4, 2026
b5795e4
test(atac): GL1 smoke integration test for rustqc atac end-to-end run
xuzhougeng May 4, 2026
e69e2d7
test(atac): extract GL2-4 BAM fixtures + splited/ outputs
xuzhougeng May 4, 2026
bac62f8
test(atac): R reference script + synthetic GL_tss GTF for goldens
xuzhougeng May 4, 2026
c103576
test(atac): per-metric integration tests vs R goldens (skip when absent)
xuzhougeng May 4, 2026
6278454
test(atac): reserved Tn5 shift / split BAM round-trip tests
xuzhougeng May 4, 2026
3043f98
docs(atac): add ATAC-seq QC documentation pages
xuzhougeng May 4, 2026
c93ccbf
docs: add ATAC-seq tools table to README
xuzhougeng May 4, 2026
7971fc5
docs: changelog entry for rustqc atac
xuzhougeng May 4, 2026
c0d1656
fix(atac): align JSON summary schema to spec
xuzhougeng May 4, 2026
5b7829f
chore(atac): clean up stale #[allow(dead_code)], CLI help, dead code
xuzhougeng May 4, 2026
e03f2c7
chore(atac): suppress dead_code on reserved threads/mapq_cut fields
xuzhougeng May 4, 2026
e1c5e5d
fix(atac): align numerical metrics with ATACseqQC R reference
xuzhougeng May 5, 2026
9e0174f
test(atac): R-reference goldens for GL1-4 + script portability fixes
xuzhougeng May 5, 2026
ecd5a7e
style: cargo fmt across atac module + touched RNA files
xuzhougeng May 5, 2026
adc89d6
fix(atac): use is_multiple_of(2) to satisfy clippy 1.95
xuzhougeng May 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ target
# Test output directories (generated by cargo test / manual runs)
tests/output*/

# Git worktrees
.worktrees/

# Benchmark directory (results now live in https://github.com/seqeralabs/RustQC-benchmarks)
benchmark/

Expand Down
24 changes: 24 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,29 @@
# RustQC Changelog

## [Unreleased]

### Added

- `rustqc atac` subcommand for single-pass ATAC-seq QC, replicating
ATACseqQC 1.36.0's `bamQC`, `fragSizeDist`, `TSSEscore`, `NFRscore`,
`PTscore`, and `estimateLibComplexity`. Numerical fidelity targets the
upstream R package (byte-identical / 1e-6 / 1e-3 per metric — see the
documentation site).
- Tn5 +4/−5 shift and fixed-interval (NFR / mono / di / tri) length split
helpers — opt-in via `--emit-shifted-bam` / `--emit-split-bams` (file
writing reserved for a follow-up release).
- Shared BAM/preseq infrastructure lifted out of `src/rna/` to crate root
(`src/bam_flags.rs`, `src/bam_io.rs`, `src/cpp_rng.rs`, `src/preseq.rs`).
No behavior change to existing `rustqc rna` outputs.

### TODO

- Factor-footprinting metrics (`factorFootprints`, `vPlot`, etc.) —
requires BSgenome / PWM motif DBs.
- Random-forest split branch (only the fixed-interval split lands here).
- `EmitWriters::open` actual file writing for `--emit-shifted-bam` /
`--emit-split-bams`.

## [Version 0.3.0](https://github.com/AI4S-YB/RustQC/releases/tag/v0.3.0) - 2026-04-22

First release of the AI4S-YB fork. Focus: Windows support and a pure-Rust
Expand Down
49 changes: 49 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,9 @@ plotters = { version = "0.3", features = ["fontconfig-dlopen"] }
[target.'cfg(windows)'.dependencies]
plotters = "0.3"

[dev-dependencies]
tempfile = "3"

[build-dependencies]
cc = "1"

Expand Down
26 changes: 23 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</picture>
</h1>

<h4 align="center">Fast genomics quality control tools for sequencing data, written in Rust.</h4>
<h4 align="center">Fast genomics quality control tools for RNA-seq and ATAC-seq data, written in Rust.</h4>

<p align="center">
<a href="https://github.com/seqeralabs/RustQC/actions/workflows/ci.yml"><img src="https://github.com/seqeralabs/RustQC/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
Expand Down Expand Up @@ -36,7 +36,12 @@

It currently includes:

- `rustqc rna` is a single-command RNA-Seq QC tool that runs all QC analyses in one pass. Designed to slot into the [nf-core/rnaseq pipeline](https://nf-co.re/rnaseq/), but works anywhere:
- `rustqc rna` — single-command RNA-Seq QC, designed to slot into the [nf-core/rnaseq pipeline](https://nf-co.re/rnaseq/), but works anywhere:
- `rustqc atac` — single-command ATAC-seq QC, designed to slot into the [nf-core/atacseq pipeline](https://nf-co.re/atacseq/), but works anywhere.

### `rustqc rna`

`rustqc rna` is a single-command RNA-Seq QC tool that runs all QC analyses in one pass:

| Tool | Reimplements | Description |
| ------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
Expand All @@ -56,7 +61,22 @@ It currently includes:
| idxstats | [samtools](http://www.htslib.org/) `idxstats` | Per-chromosome read counts |
| stats | [samtools](http://www.htslib.org/) `stats` | Full samtools stats output including all histogram sections |

All outputs are format- and numerically identical to the upstream tools, and compatible with [MultiQC](https://multiqc.info/) for reporting.
All RNA outputs are format- and numerically identical to the upstream tools, and compatible with [MultiQC](https://multiqc.info/) for reporting.

### `rustqc atac`

`rustqc atac` is a single-command ATAC-seq QC tool that runs all QC analyses in one pass, replicating [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) 1.36.0:

| Module | Reimplements | Description |
| ------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| bamQC | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `bamQC` | Mapping rates, NRF, PBC1/2, MAPQ histogram |
| fragSizeDist | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `fragSizeDist` | Fragment-length distribution + plot |
| TSSEscore | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `TSSEscore` | TSS enrichment score (sliding window + loess smoothing) |
| NFRscore | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `NFRscore` | Nucleosome-free region score per TSS |
| PTscore | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `PTscore` | Promoter / transcript-body score per TSS |
| estimateLibComplexity | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `estimateLibComplexity` | Library complexity extrapolation (preseq) |
| Tn5 shift | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `shiftGAlignmentsList` | +4/−5 Tn5 shift |
| Split (fixed) | [ATACseqQC](https://bioconductor.org/packages/ATACseqQC/) `splitGAlignmentsByCut` (fixed-interval) | NFR/mono/di/tri length split |

## Quick start

Expand Down
8 changes: 8 additions & 0 deletions docs/astro.config.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,14 @@ export default defineConfig({
{ label: "Samtools", slug: "rna/samtools" },
],
},
{
label: "ATAC",
items: [
{ label: "Overview", slug: "atac/index" },
{ label: "CLI Reference", slug: "atac/cli" },
{ label: "Numerical Fidelity", slug: "atac/numerical-fidelity" },
],
},
{
label: "About",
items: [
Expand Down
82 changes: 82 additions & 0 deletions docs/src/content/docs/atac/cli.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
title: CLI Reference
description: Command-line interface reference for rustqc atac.
---

## Synopsis

```
rustqc atac [OPTIONS] --gtf <GTF> <INPUT>...
```

`<INPUT>` is one or more paired-end BAM/SAM/CRAM files. Single-end inputs are
rejected at startup.

## Full help output

```
ATAC-Seq QC — single-pass analysis of paired-end BAM/SAM/CRAM files.

Runs bamQC, fragSizeDist, TSSEscore, NFRscore, PTscore, and library complexity
in one pass. Requires a GTF annotation and paired-end input. Optionally emits
Tn5-shifted and length-split BAMs.
```

## Options

### Input / Output

| Flag | Env var | Default | Description |
|------|---------|---------|-------------|
| `-g, --gtf <GTF>` | `RUSTQC_GTF` | — | GTF gene annotation (plain or `.gz`); TSS coords source |
| `-r, --reference <FASTA>` | `RUSTQC_REFERENCE` | — | Reference FASTA (required for CRAM) |
| `-o, --outdir <OUTDIR>` | `RUSTQC_OUTDIR` | `.` | Output directory |
| `--sample-name <NAME>` | `RUSTQC_SAMPLE_NAME` | derived from filename | Override sample name |
| `--flat-output` | `RUSTQC_FLAT_OUTPUT` | false | Write outputs to a flat directory (no subdirs) |
| `-c, --config <CONFIG>` | — | — | YAML configuration file |
| `-j, --json-summary [<PATH>]` | `RUSTQC_JSON_SUMMARY` | — | JSON summary path (use `-` for stdout) |

### ATAC-specific

| Flag | Env var | Default | Description |
|------|---------|---------|-------------|
| `--mito-chrom <NAME>` | `RUSTQC_MITO_CHROM` | auto-detect `^chrM$\|^MT$\|^Mito$` | Mitochondrial chromosome name |
| `--emit-shifted-bam` | `RUSTQC_EMIT_SHIFTED_BAM` | false | Emit +4/−5 Tn5-shifted BAM |
| `--emit-split-bams` | `RUSTQC_EMIT_SPLIT_BAMS` | false | Emit NFR/mono/di/tri BAMs (fixed intervals) |
| `--tsse-flank <N>` | `RUSTQC_TSSE_FLANK` | `1000` | TSSEscore flank (bp) |

### General

| Flag | Env var | Default | Description |
|------|---------|---------|-------------|
| `-t, --threads <THREADS>` | `RUSTQC_THREADS` | `1` | Number of threads |
| `-Q, --mapq <MAPQ_CUT>` | `RUSTQC_MAPQ` | `30` | MAPQ cutoff |
| `-q, --quiet` | `RUSTQC_QUIET` | false | Suppress output except warnings/errors |
| `-v, --verbose` | — | false | Show additional detail |

## Examples

```bash
# Basic run
rustqc atac sample.markdup.bam --gtf genes.gtf --outdir results/

# With Tn5 shift + length-split BAMs (Phase 14 reserved — not yet wired to disk)
rustqc atac sample.markdup.bam --gtf genes.gtf --outdir results/ \
--emit-shifted-bam --emit-split-bams

# Custom mito chromosome name
rustqc atac sample.markdup.bam --gtf genes.gtf --mito-chrom Mito --outdir results/

# With sample-name override and JSON summary to stdout
rustqc atac sample.markdup.bam --gtf genes.gtf --outdir results/ \
--sample-name MyExperiment -j -
```

## Notes on `--emit-shifted-bam` / `--emit-split-bams`

These flags are accepted by the CLI and the Tn5 shift / fixed-interval split
logic runs in memory, but **file writing is reserved for a follow-up release**.
The internal `EmitWriters::open` function is a stub in this version — the
shifted/split BAMs are not written to disk. The flags are present so pipeline
configurations can be written now and the output will appear automatically when
file writing is wired up.
58 changes: 58 additions & 0 deletions docs/src/content/docs/atac/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: ATAC-Seq QC
description: Single-pass ATAC-seq quality control and Tn5 preprocessing in Rust.
---

`rustqc atac` is a single-command ATAC-seq QC tool that runs all QC analyses
in one pass. Designed to slot into the [nf-core/atacseq pipeline][nfc] but works
anywhere.

[nfc]: https://nf-co.re/atacseq/

## Modules

| Module | Reimplements | Description |
|--------|--------------|-------------|
| bamQC | [ATACseqQC][a] `bamQC` | Mapping rates, NRF, PBC1/2, MAPQ histogram |
| fragSizeDist | [ATACseqQC][a] `fragSizeDist` | Fragment-length distribution + plot |
| TSSEscore | [ATACseqQC][a] `TSSEscore` | TSS enrichment score (sliding window + loess smoothing) |
| NFRscore | [ATACseqQC][a] `NFRscore` | Nucleosome-free region score per TSS |
| PTscore | [ATACseqQC][a] `PTscore` | Promoter / transcript-body score per TSS |
| estimateLibComplexity | [ATACseqQC][a] `estimateLibComplexity` | Library complexity extrapolation (preseq) |
| Tn5 shift | [ATACseqQC][a] `shiftGAlignmentsList` | +4/−5 Tn5 shift |
| Split (fixed) | [ATACseqQC][a] `splitGAlignmentsByCut` (fixed-interval branch) | NFR/mono/di/tri split |

[a]: https://bioconductor.org/packages/ATACseqQC/

## Usage

```bash
rustqc atac sample.markdup.bam --gtf genes.gtf --outdir results/
```

ATAC-seq input must be paired-end. Single-end inputs are rejected at startup.

## Outputs

```
outdir/
bamqc/<sample>.bamqc.tsv # rates, NRF, PBC1/2
bamqc/<sample>.mapq.tsv # MAPQ histogram
fragsize/<sample>.fragsize.{tsv,svg} # fragment-length distribution
tsse/<sample>.tsse.{tsv,svg} # TSS enrichment
nfr/<sample>.nfr.tsv # NFR score per TSS
pt/<sample>.pt.tsv # PT score per TSS
lib_complexity/<sample>.libcomplexity.{tsv,svg}
<sample>.atac.summary.json
```

## Scope and TODOs

This release covers the core ATAC-seq QC metrics plus Tn5 +4/−5 shift and fixed-interval
split (NFR < 100, mono 180–247, di 315–473, tri 558–615 bp). The following ATACseqQC
features are **not yet implemented**:

- Factor footprinting (`factorFootprints`, `footprintsScanner`, `vPlot`,
`enrichedFragments`, `distanceDyad`) — depends on BSgenome / PWM motif DBs.
- Random-forest split branch — only the fixed-interval branch is implemented.
- Multi-BAM batching — current driver requires exactly one BAM per invocation.
42 changes: 42 additions & 0 deletions docs/src/content/docs/atac/numerical-fidelity.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Numerical Fidelity
description: Validation strategy and per-metric tolerances against ATACseqQC 1.36.0.
---

`rustqc atac`'s outputs are validated against the upstream ATACseqQC R package
(version 1.36.0, Bioconductor 3.23). The reference outputs are generated by the
script at `tests/atac/golden/run_r_reference.R` and committed as fixtures
(integration tests skip gracefully when goldens are absent — CI does not run R).

## Per-metric acceptance bars

| Metric | Bar |
|--------|-----|
| `bamQC` (rates / NRF / PBC1 / PBC2 / MAPQ histogram) | byte-identical (`abs diff <= 1e-12`) |
| `fragSizeDist` | byte-identical histogram |
| `NFRscore` (per TSS) | abs diff ≤ 1e-6 |
| `PTscore` (per TSS) | abs diff ≤ 1e-6 |
| `TSSEscore` (post-loess scalar) | abs diff ≤ 1e-3 |

## Loess port

ATACseqQC's `TSSEscore` calls R's `stats::loess.smooth(family = "gaussian")`. We
port a minimal locally-weighted polynomial fit (tricube weights, degree 2,
span 2/3, no robust reweighting) sufficient to land within `1e-3` of R on the
GL1–GL4 fixtures. The pre-loess sliding-window vector (length 20) is byte-
identical to R; only the loess smoothing introduces the 1e-3 tolerance.

## Fixtures

The GL1–GL4 BAMs at `tests/data/atac/` are the same files shipped inside
ATACseqQC's `inst/extdata/` and align to chr1 of hg19. The `GL_tss.gtf` in the
same directory is a synthetic 14-transcript subset on chr1 covering the read
ranges of GL1 (genome-wide) and GL2-4 (~565kb–996kb dense cluster).

## What this release does **not** validate

- Factor footprinting (out of scope for v1).
- Random-forest split branch (only fixed-interval split is implemented).
- Tn5 shift / split BAM round-trip vs ATACseqQC outputs — `EmitWriters::open`
is a stub in this release; integration tests are reserved (`if true { return; }`)
until file writing lands.
Loading