Add new trimming approach: Highest Quality Subread #52

MillarCD · 2025-06-30T03:17:08Z

Hi @wdecoster,
I've used Chopper a couple of times and found it to be a very useful and easy-to-use trimming tool. I wanted to contribute my small grain of sand. I apologize, as I initially started with a simple idea, but along the way I ended up making additional changes, which might make it a bit harder to review everything in detail. Here’s a brief overview of my contribution:

What's new?

A new trimming strategy called Highest Quality Subread was added. It is based on the algorithm presented by Phred, which identifies the highest-quality subread based on a cutoff value (--cutoff).

How does the algorithm work?

The behavior of this new trimming approach can be summarized in four steps:

For each base in the read, the Phred quality score is converted into an error probability.
For each base, a score is calculated as the difference between the --cutoff value and the error probability; bases with a negative score are considered to have insufficient quality.
The subsequence with the highest cumulative score is identified using Kadane’s algorithm.
The indices of the highest-quality subread are returned, if one is found.

Types of changes

New feature

A new trimming approach was implemented, and the TrimStrategy trait was created to standardize the logic across different strategies, making the system easier to extend and maintain.
The --trim-approach flag was added to allow users to select one of the available trimming strategies, while ensuring only one strategy is applied at a time.
The --cutoff flag was added to define a quality threshold for strategies that require it (e.g., trim-by-quality and best-subread).

Additional implementations

parse_gc_value function was added to parse GC content parameters and validate that their values fall within the [0, 1] range.
parse_usize_or_inf function was added to parse the maximum read length, allowing the terminal to display INF (instead of usize::MAX) as default value.
The build_trimming_approach function was added to construct the appropriate trimming strategy and validate that all required parameters (--cutoff, --headcrop, --tailcrop) have been provided for the selected strategy.

Refactoring

The --trim flag was removed and replaced by --cutoff, which must now be used together with the --trim-approach trim-by-quality flag.
The new --trim-approach flag allows selecting one of three strategies: fixed-length trimming (fixed-crop), quality-based trimming (trim-by-quality), and best subread extraction (best-subread).
CLI argument presentation was reorganized and grouped into three categories in the help output: Filtering options, Trimming options, and Setup options.
Trimming logic was extracted from the write_record function and encapsulated into dedicated structs implementing the TrimStrategy trait.
The write_record function was simplified to focus solely on writing reads to output.
The filter function was refactored to improve readability and modularity.
All tests using the Cli structure were updated accordingly.

Bugfix

Validation for --gcmin and --gcmax was added to the filter function when the --inverse flag is used.

Testing

All existing tests were verified and passed successfully.
The tool was tested on a Linux system (Debian 12) and compared against version 0.10.0 installed via Conda.
Multiple FASTQ files were used to confirm consistent results. Identical output was observed in all cases, except when using the --inverse flag with --gcmin and/or --gcmax, where the new logic now takes GC limits into account when --inverse flag is set.

Benchmark

To evaluate performance, chopper v0.10.0 (installed via Conda) was compared with the new version using a 8.9 GB FASTQ file. The following parameters were used:

# Old version
chopper -t <THREADS> --trim 25 --mingc 0.2 --maxgc 0.8 -q 25 -i <TEST_FILE>

# New version
<PATH_TO_CHOPPER_RELEASE> -t <THREADS> --trim-approach trim-by-quality --cutoff 25 --mingc 0.2 --maxgc 0.8 -q 25 -i <TEST_FILE>

Each version was executed 20 times using 1 and 4 threads, and the average execution time (in seconds) was recorded. The table below shows similar performance between both versions, with the new version being slightly faster on average.

Version	Avg time (1 thread)	Avg time (4 threads)
chopper_v0.10.0	95.1797 s	39.2535 s
chopper_new_version	93.4072 s	39.2436 s

…ility

Introduced a new trait to support pluggable trimming strategies. Also implemented , which applies a modified Mott algorithm to extract the highest-quality sub-read based on cumulative error probability.

…trategy` implementation - Introduced a new CLI argument to select the trimming strategy - Added validation logic to ensure required parameters are provided for each strategy - Implemented `HighestQualityTrimStrategy` as a new trimming option - Updated tests for the `filter` function to reflect strategy-based trimming behavior

…ng into write_record - Implemented and trimming strategies - Integrated trimming strategies into the function

…trategy::trim - Reorganized CLI flags into logical categories to improve readability - Updated README.md to reflect the new flag structure and document filtering and trimming options - Removed an unused variable from the `HighestQualityTrimStrategy::trim` method

…date tests - Manually parsed to display 'INF' in the CLI - Added parser for GC content arguments to validate values are within [0, 1] - Renamed `check_trimming_approach_args` to `build_trimming_approach` and changed its return type to `Option<Arc<dyn TrimStrategy>>` - Removed trimming parameter validation from - Refactored `filter` function for simplicity and integrated trimming - Refactored `write_record` function to remove trimming logic - Created helper functions to validate each filter option independently - Updated tests accordingly - Constrained `TrimStrategy` trait to require `Send` and `Sync`

wdecoster · 2025-07-02T09:21:27Z

Hi @MillarCD,

Wow! Thanks for contributing! I will need some time to go through this. Looks like a great addition, while making things a bit more complicated. I haven't worked with Traits much, so that's nice too. Also, thanks for the bug fix of the GC content.

One thing that I am conflicted about is how you named the strategy. A 'subread' is terminology used by PacBio for individual passes of the polymerase around the fragment, before merging them all in a ccs read. Can you think of an alternative?

Best,
Wouter

MillarCD · 2025-07-03T03:56:57Z

Thanks for the clarification!

I hadn’t considered the specific meaning of subread in the PacBio context.
Some alternative names that come to mind are "read segment" or "read region", so perhaps the strategy could be renamed to something like best-read-segment or best-read-region.
Let me know what you think.

wdecoster · 2025-07-03T07:14:35Z

best-read-segment sounds good to me!

wdecoster · 2025-07-04T14:17:15Z

I have a deadline to reach early next week, and will pick up on this after that.

MillarCD · 2025-07-04T14:25:47Z

Don't worry, good luck with the deadline!

wdecoster · 2025-07-31T11:32:25Z

I took a look at your code and made two suggestions (and then figured out that one of those was already taken care of). Could you also still rename the strategy? Thanks!

wdecoster · 2025-07-31T11:25:46Z

src/main.rs

+            },
+            TrimApproach::BestSubread => {
+                if let Some(cutoff) = args.cutoff {
+                    Some(Arc::new(HighestQualityTrimStrategy::new(phred_score_to_probability(cutoff + 33))))


I am not sure about the +33 here?

Yes, it's a bit confusing. This happens because the phred_score_to_probability function assumes that the phred parameter is the decimal value of the ASCII character representing quality (for example, "!", which corresponds to 33 in decimal, represents Q0). This function is responsible for converting that ASCII value to its numeric equivalent by subtracting 33.

This happened because I extracted phred_score_to_probability from the ave_qual function, where this conversion was originally done.

Later on, since cutoff represents a numeric Phred score, I added 33 to convert it to its ASCII character before passing it to phred_score_to_probability, which again performed the reverse conversion.

At the time, I didn’t think much of it, but now I believe it would be better to refactor phred_score_to_probability so that it works directly with the numeric Phred score. That way, it could receive cutoff without any transformation, and ave_qual would be responsible for converting from the ASCII representation to the numeric value required by phred_score_to_probability.

What do you think?

yes, sounds good!

wdecoster · 2025-07-31T11:28:04Z

src/main.rs

-                            0.5
-                        };
-
-                        if args.headcrop + args.tailcrop < read_len {


do we still need something like this - checking that the read is not fully trimmed away?

Never mind, I see this is handled in trimmers.rs in the trim implementation of FixedCropStrategy

wdecoster · 2025-07-31T11:55:16Z

Oh, and I see in trimmers.rs a TrimByQualitytrategy which should presumably be TrimByQualityStrategy :)

… fix TrimByQualityStrategy typo - Updated documentation to replace sub-read with "read segment" - Updated README.md to reflect the renamed strategy in the help message - Updated README.md to reflect the renamed strategy in the help message

MillarCD · 2025-08-01T05:39:26Z

I’ve completed the name corrections. Let me know if you have any questions or suggestions. :)

…as input, replacing previous handling of ASCII-encoded FASTQ quality characters

Copilot

Pull Request Overview

This PR introduces a new "Highest Quality Subread" trimming approach to Chopper, implementing a Phred-based algorithm for extracting the highest-quality read segment. The implementation refactors the existing trimming logic into a trait-based system and reorganizes the CLI interface for better usability.

Key changes include:

Addition of a new trimming strategy using Kadane's algorithm to find optimal quality subreads
Refactoring of trimming logic into a TrimStrategy trait with three implementations
Reorganization of CLI arguments into categorized help sections with improved parameter validation

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
src/trimmers.rs	New file implementing the `TrimStrategy` trait and three trimming strategies including the new highest-quality approach
src/main.rs	Major refactoring of CLI structure, argument validation, and filter logic to support the new trimming system
README.md	Updated documentation reflecting the new trimming approaches and reorganized CLI help structure

Comments suppressed due to low confidence (1)

src/main.rs:109

[nitpick] The enum variant name BestReadSegment is inconsistent with the CLI value best-read-segment shown in the help text. Consider renaming to BestSubread or HighestQuality to better match the feature description.

    BestReadSegment

Copilot · 2025-08-03T18:47:08Z

src/trimmers.rs

+            let prob_error = self.cutoff - phred_score_to_probability(*phred_qual - 33);
+            if current_cumulative_error < 0.0 {
+                current_cumulative_error = 0.0;
+                current_start = i;
+            } 
+
+            current_cumulative_error += prob_error;


The variable name prob_error is misleading as it represents a score difference, not an error probability. Consider renaming to quality_score or score_diff to better reflect its purpose in the algorithm.

Suggested change

let prob_error = self.cutoff - phred_score_to_probability(*phred_qual - 33);

if current_cumulative_error < 0.0 {

current_cumulative_error = 0.0;

current_start = i;

}

current_cumulative_error += prob_error;

let score_diff = self.cutoff - phred_score_to_probability(*phred_qual - 33);

if current_cumulative_error < 0.0 {

current_cumulative_error = 0.0;

current_start = i;

}

current_cumulative_error += score_diff;

src/trimmers.rs

src/main.rs

Copilot · 2025-08-03T18:47:09Z

src/trimmers.rs

+        if record.seq().len() - self.tail_crop <= self.head_crop {
+            return None;
+        }
+        Some((self.head_crop, record.seq().len() - self.tail_crop))


Potential integer underflow if self.tail_crop is greater than record.seq().len(). This could cause a panic in debug mode or wraparound in release mode. Check that self.tail_crop <= record.seq().len() before performing the subtraction.

Suggested change

if record.seq().len() - self.tail_crop <= self.head_crop {

return None;

}

Some((self.head_crop, record.seq().len() - self.tail_crop))

let read_len = record.seq().len();

if self.tail_crop > read_len {

return None;

}

if read_len - self.tail_crop <= self.head_crop {

return None;

}

Some((self.head_crop, read_len - self.tail_crop))

This seems a valid concern, right @MillarCD?

Yes, you're right. That’s my mistake. I didn’t cover that case in the unit tests. Using the saturating_* function should handle it correctly:

if record.seq().len().saturating_sub(self.tail_crop) <= self.head_crop { return None; }

wdecoster · 2025-08-03T18:47:19Z

(looks good to me but I wonder about what use copilot is for things like this, let's see what it thinks... Will merge soon!)

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

wdecoster · 2025-08-03T18:53:30Z

It came up with two spelling fixes and one potential underflow, which appears a valid concern to me, do you agree?

…crop < 0` - Use `saturating_sub` to prevent underflow when `tail_crop` exceeds sequence length - Add test case to cover this scenario

MillarCD · 2025-08-03T22:22:25Z

It came up with two spelling fixes and one potential underflow, which appears a valid concern to me, do you agree?

Yes, it was a small detail that was easy to fix but with a significant impact.

wdecoster · 2025-08-04T06:46:25Z

Thanks again!

MillarCD added 7 commits June 3, 2025 23:39

refactor: Extracted a function to convert a Phred score into a probab…

1396bbc

…ility

Add trait and implementation

a3fa3e3

Introduced a new trait to support pluggable trimming strategies. Also implemented , which applies a modified Mott algorithm to extract the highest-quality sub-read based on cumulative error probability.

Add FixedCropStrategy and TrimByQualityStrategy, and integrate trimmi…

67c6898

…ng into write_record - Implemented and trimming strategies - Integrated trimming strategies into the function

Update README and clean up unused code comment

51f6bbf

wdecoster reviewed Jul 31, 2025

View reviewed changes

Refactor phred_score_to_probability to accept numeric Phred scores …

fd66cef

…as input, replacing previous handling of ASCII-encoded FASTQ quality characters

wdecoster requested a review from Copilot August 3, 2025 18:46

Copilot AI reviewed Aug 3, 2025

View reviewed changes

wdecoster and others added 2 commits August 3, 2025 20:48

Spelling fix

37ce38e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/main.rs

09182d5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix potential integer underflow when `record.seq().len() - self.tail_…

e128d14

…crop < 0` - Use `saturating_sub` to prevent underflow when `tail_crop` exceeds sequence length - Add test case to cover this scenario

wdecoster merged commit daf1351 into wdecoster:master Aug 4, 2025
1 check passed

Add new trimming approach: Highest Quality Subread #52

Add new trimming approach: Highest Quality Subread #52

Uh oh!

Conversation

MillarCD commented Jun 30, 2025

What's new?

How does the algorithm work?

Types of changes

New feature

Additional implementations

Refactoring

Bugfix

Testing

Benchmark

Uh oh!

wdecoster commented Jul 2, 2025

Uh oh!

MillarCD commented Jul 3, 2025

Uh oh!

wdecoster commented Jul 3, 2025

Uh oh!

wdecoster commented Jul 4, 2025

Uh oh!

MillarCD commented Jul 4, 2025

Uh oh!

wdecoster commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdecoster commented Jul 31, 2025

Uh oh!

MillarCD commented Aug 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdecoster commented Aug 3, 2025

Uh oh!

wdecoster commented Aug 3, 2025

Uh oh!

MillarCD commented Aug 3, 2025

Uh oh!

Uh oh!

wdecoster commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wdecoster commented Jul 31, 2025 •

edited

Loading