Skip to content

Conversation

@bwlang
Copy link
Member

@bwlang bwlang commented Jan 25, 2026

Fixes #15

The barcode extraction from FASTQ headers was failing with non-standard formats like SRA headers (e.g., "@SRR20318439.1 ... length=111") where the extracted "barcode" contained spaces, breaking downstream shell commands.

Changes:

  • Refactored barcode extraction to sample first 10k reads and return the most frequent valid barcode (avoids single-read sequencing errors)
  • Validate barcodes against pattern ^[ACGTN+-]+$ (nucleotides with optional dual-index separator)
  • Fall back to "unknown" for files without valid barcodes
  • Extracted shared function to eliminate code duplication between paired-end and single-end processes
  • Added test case with SRA-style headers to verify the fix

Fixes #15

The barcode extraction from FASTQ headers was failing with non-standard
formats like SRA headers (e.g., "@SRR20318439.1 ... length=111") where
the extracted "barcode" contained spaces, breaking downstream shell
commands.

Changes:
- Refactored barcode extraction to sample first 10k reads and return
  the most frequent valid barcode (avoids single-read sequencing errors)
- Validate barcodes against pattern ^[ACGTN+-]+$ (nucleotides with
  optional dual-index separator)
- Fall back to "unknown" for files without valid barcodes
- Extracted shared function to eliminate code duplication between
  paired-end and single-end processes
- Added test case with SRA-style headers to verify the fix

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 25, 2026 23:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes barcode extraction from FASTQ files with non-standard header formats (e.g., SRA headers) that was causing failures due to spaces in extracted barcodes breaking downstream shell commands.

Changes:

  • Refactored barcode extraction to use a shared shell function that samples 10k reads, validates barcodes against a nucleotide pattern, and returns the most frequent valid barcode
  • Added test case with SRA-style headers to verify the fix

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated no comments.

File Description
fastq_to_ubam.nf Introduced shared barcode extraction function and replaced direct barcode extraction in both paired-end and single-end processes
tests/fastq_to_ubam.nf.test Added test case for non-standard SRA header format
tests/fastq_to_ubam.nf.test.snap Added snapshot for the new SRA header test case

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bwlang
Copy link
Member Author

bwlang commented Jan 26, 2026

this went much better than pull request #49 with claude:

issue15_conversation.txt

@bwlang bwlang requested a review from lnblum January 26, 2026 01:39
Copy link
Contributor

@lnblum lnblum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the added sampling of reads to find the barcode.

It was interesting to read the prompt log. I wondered if the agents would be confused by the fact that the issue was referring to code was substantially different from the current, but it seems like both identified that the barcode extraction code had been moved to the fastq_to_ubam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fastq/bam header parsing when unexpected format

2 participants