Skip to content

Conversation

@MillarCD
Copy link
Contributor

Hi @wdecoster,
I've used Chopper a couple of times and found it to be a very useful and easy-to-use trimming tool. I wanted to contribute my small grain of sand. I apologize, as I initially started with a simple idea, but along the way I ended up making additional changes, which might make it a bit harder to review everything in detail. Here’s a brief overview of my contribution:

What's new?

A new trimming strategy called Highest Quality Subread was added. It is based on the algorithm presented by Phred, which identifies the highest-quality subread based on a cutoff value (--cutoff).

How does the algorithm work?

The behavior of this new trimming approach can be summarized in four steps:

  1. For each base in the read, the Phred quality score is converted into an error probability.
  2. For each base, a score is calculated as the difference between the --cutoff value and the error probability; bases with a negative score are considered to have insufficient quality.
  3. The subsequence with the highest cumulative score is identified using Kadane’s algorithm.
  4. The indices of the highest-quality subread are returned, if one is found.

Types of changes

New feature

  • A new trimming approach was implemented, and the TrimStrategy trait was created to standardize the logic across different strategies, making the system easier to extend and maintain.
  • The --trim-approach flag was added to allow users to select one of the available trimming strategies, while ensuring only one strategy is applied at a time.
  • The --cutoff flag was added to define a quality threshold for strategies that require it (e.g., trim-by-quality and best-subread).

Additional implementations

  • parse_gc_value function was added to parse GC content parameters and validate that their values fall within the [0, 1] range.
  • parse_usize_or_inf function was added to parse the maximum read length, allowing the terminal to display INF (instead of usize::MAX) as default value.
  • The build_trimming_approach function was added to construct the appropriate trimming strategy and validate that all required parameters (--cutoff, --headcrop, --tailcrop) have been provided for the selected strategy.

Refactoring

  • The --trim flag was removed and replaced by --cutoff, which must now be used together with the --trim-approach trim-by-quality flag.
  • The new --trim-approach flag allows selecting one of three strategies: fixed-length trimming (fixed-crop), quality-based trimming (trim-by-quality), and best subread extraction (best-subread).
  • CLI argument presentation was reorganized and grouped into three categories in the help output: Filtering options, Trimming options, and Setup options.
  • Trimming logic was extracted from the write_record function and encapsulated into dedicated structs implementing the TrimStrategy trait.
  • The write_record function was simplified to focus solely on writing reads to output.
  • The filter function was refactored to improve readability and modularity.
  • All tests using the Cli structure were updated accordingly.

Bugfix

  • Validation for --gcmin and --gcmax was added to the filter function when the --inverse flag is used.

Testing

  • All existing tests were verified and passed successfully.
  • The tool was tested on a Linux system (Debian 12) and compared against version 0.10.0 installed via Conda.
  • Multiple FASTQ files were used to confirm consistent results. Identical output was observed in all cases, except when using the --inverse flag with --gcmin and/or --gcmax, where the new logic now takes GC limits into account when --inverse flag is set.

Benchmark

To evaluate performance, chopper v0.10.0 (installed via Conda) was compared with the new version using a 8.9 GB FASTQ file. The following parameters were used:

# Old version
chopper -t <THREADS> --trim 25 --mingc 0.2 --maxgc 0.8 -q 25 -i <TEST_FILE>

# New version
<PATH_TO_CHOPPER_RELEASE> -t <THREADS> --trim-approach trim-by-quality --cutoff 25 --mingc 0.2 --maxgc 0.8 -q 25 -i <TEST_FILE>

Each version was executed 20 times using 1 and 4 threads, and the average execution time (in seconds) was recorded. The table below shows similar performance between both versions, with the new version being slightly faster on average.

Version Avg time (1 thread) Avg time (4 threads)
chopper_v0.10.0 95.1797 s 39.2535 s
chopper_new_version 93.4072 s 39.2436 s

MillarCD added 7 commits June 3, 2025 23:39
Introduced a new  trait to support pluggable trimming strategies. Also implemented , which applies a modified Mott algorithm to extract the highest-quality sub-read based on cumulative error probability.
…trategy` implementation

- Introduced a new CLI argument to select the trimming strategy
- Added validation logic to ensure required parameters are provided for each strategy
- Implemented `HighestQualityTrimStrategy` as a new trimming option
- Updated tests for the `filter` function to reflect strategy-based trimming behavior
…ng into write_record

- Implemented  and  trimming strategies
- Integrated trimming strategies into the  function
…trategy::trim

- Reorganized CLI flags into logical categories to improve readability
- Updated README.md to reflect the new flag structure and document filtering and trimming options
- Removed an unused variable from the `HighestQualityTrimStrategy::trim` method
…date tests

- Manually parsed  to display 'INF' in the CLI
- Added parser for GC content arguments to validate values are within [0, 1]
- Renamed `check_trimming_approach_args` to `build_trimming_approach` and changed its return type to `Option<Arc<dyn TrimStrategy>>`
- Removed trimming parameter validation from
- Refactored `filter` function for simplicity and integrated trimming
- Refactored `write_record` function to remove trimming logic
- Created helper functions to validate each filter option independently
- Updated tests accordingly
- Constrained `TrimStrategy` trait to require `Send` and `Sync`
@wdecoster
Copy link
Owner

Hi @MillarCD,

Wow! Thanks for contributing! I will need some time to go through this. Looks like a great addition, while making things a bit more complicated. I haven't worked with Traits much, so that's nice too. Also, thanks for the bug fix of the GC content.

One thing that I am conflicted about is how you named the strategy. A 'subread' is terminology used by PacBio for individual passes of the polymerase around the fragment, before merging them all in a ccs read. Can you think of an alternative?

Best,
Wouter

@MillarCD
Copy link
Contributor Author

MillarCD commented Jul 3, 2025

Thanks for the clarification!

I hadn’t considered the specific meaning of subread in the PacBio context.
Some alternative names that come to mind are "read segment" or "read region", so perhaps the strategy could be renamed to something like best-read-segment or best-read-region.
Let me know what you think.

@wdecoster
Copy link
Owner

best-read-segment sounds good to me!

@wdecoster
Copy link
Owner

I have a deadline to reach early next week, and will pick up on this after that.

@MillarCD
Copy link
Contributor Author

MillarCD commented Jul 4, 2025

Don't worry, good luck with the deadline!

@wdecoster
Copy link
Owner

wdecoster commented Jul 31, 2025

I took a look at your code and made two suggestions (and then figured out that one of those was already taken care of). Could you also still rename the strategy? Thanks!

src/main.rs Outdated
},
TrimApproach::BestSubread => {
if let Some(cutoff) = args.cutoff {
Some(Arc::new(HighestQualityTrimStrategy::new(phred_score_to_probability(cutoff + 33))))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the +33 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a bit confusing. This happens because the phred_score_to_probability function assumes that the phred parameter is the decimal value of the ASCII character representing quality (for example, "!", which corresponds to 33 in decimal, represents Q0). This function is responsible for converting that ASCII value to its numeric equivalent by subtracting 33.

This happened because I extracted phred_score_to_probability from the ave_qual function, where this conversion was originally done.

Later on, since cutoff represents a numeric Phred score, I added 33 to convert it to its ASCII character before passing it to phred_score_to_probability, which again performed the reverse conversion.

At the time, I didn’t think much of it, but now I believe it would be better to refactor phred_score_to_probability so that it works directly with the numeric Phred score. That way, it could receive cutoff without any transformation, and ave_qual would be responsible for converting from the ASCII representation to the numeric value required by phred_score_to_probability.

What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

0.5
};

if args.headcrop + args.tailcrop < read_len {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need something like this - checking that the read is not fully trimmed away?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I see this is handled in trimmers.rs in the trim implementation of FixedCropStrategy

@wdecoster
Copy link
Owner

Oh, and I see in trimmers.rs a TrimByQualitytrategy which should presumably be TrimByQualityStrategy :)

… fix TrimByQualityStrategy typo

- Updated documentation to replace sub-read with "read segment"
- Updated README.md to reflect the renamed strategy in the help message
- Updated README.md to reflect the renamed strategy in the help message
@MillarCD
Copy link
Contributor Author

MillarCD commented Aug 1, 2025

I’ve completed the name corrections. Let me know if you have any questions or suggestions. :)

…as input, replacing previous handling of ASCII-encoded FASTQ quality characters
@wdecoster wdecoster requested a review from Copilot August 3, 2025 18:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new "Highest Quality Subread" trimming approach to Chopper, implementing a Phred-based algorithm for extracting the highest-quality read segment. The implementation refactors the existing trimming logic into a trait-based system and reorganizes the CLI interface for better usability.

Key changes include:

  • Addition of a new trimming strategy using Kadane's algorithm to find optimal quality subreads
  • Refactoring of trimming logic into a TrimStrategy trait with three implementations
  • Reorganization of CLI arguments into categorized help sections with improved parameter validation

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
src/trimmers.rs New file implementing the TrimStrategy trait and three trimming strategies including the new highest-quality approach
src/main.rs Major refactoring of CLI structure, argument validation, and filter logic to support the new trimming system
README.md Updated documentation reflecting the new trimming approaches and reorganized CLI help structure
Comments suppressed due to low confidence (1)

src/main.rs:109

  • [nitpick] The enum variant name BestReadSegment is inconsistent with the CLI value best-read-segment shown in the help text. Consider renaming to BestSubread or HighestQuality to better match the feature description.
    BestReadSegment

Comment on lines +51 to +57
let prob_error = self.cutoff - phred_score_to_probability(*phred_qual - 33);
if current_cumulative_error < 0.0 {
current_cumulative_error = 0.0;
current_start = i;
}

current_cumulative_error += prob_error;
Copy link

Copilot AI Aug 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name prob_error is misleading as it represents a score difference, not an error probability. Consider renaming to quality_score or score_diff to better reflect its purpose in the algorithm.

Suggested change
let prob_error = self.cutoff - phred_score_to_probability(*phred_qual - 33);
if current_cumulative_error < 0.0 {
current_cumulative_error = 0.0;
current_start = i;
}
current_cumulative_error += prob_error;
let score_diff = self.cutoff - phred_score_to_probability(*phred_qual - 33);
if current_cumulative_error < 0.0 {
current_cumulative_error = 0.0;
current_start = i;
}
current_cumulative_error += score_diff;

Copilot uses AI. Check for mistakes.
src/trimmers.rs Outdated
Comment on lines 133 to 136
if record.seq().len() - self.tail_crop <= self.head_crop {
return None;
}
Some((self.head_crop, record.seq().len() - self.tail_crop))
Copy link

Copilot AI Aug 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential integer underflow if self.tail_crop is greater than record.seq().len(). This could cause a panic in debug mode or wraparound in release mode. Check that self.tail_crop <= record.seq().len() before performing the subtraction.

Suggested change
if record.seq().len() - self.tail_crop <= self.head_crop {
return None;
}
Some((self.head_crop, record.seq().len() - self.tail_crop))
let read_len = record.seq().len();
if self.tail_crop > read_len {
return None;
}
if read_len - self.tail_crop <= self.head_crop {
return None;
}
Some((self.head_crop, read_len - self.tail_crop))

Copilot uses AI. Check for mistakes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a valid concern, right @MillarCD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. That’s my mistake. I didn’t cover that case in the unit tests. Using the saturating_* function should handle it correctly:

if record.seq().len().saturating_sub(self.tail_crop) <= self.head_crop {
    return None;
}

@wdecoster
Copy link
Owner

(looks good to me but I wonder about what use copilot is for things like this, let's see what it thinks... Will merge soon!)

wdecoster and others added 2 commits August 3, 2025 20:48
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@wdecoster
Copy link
Owner

It came up with two spelling fixes and one potential underflow, which appears a valid concern to me, do you agree?

…crop < 0`

- Use `saturating_sub` to prevent underflow when `tail_crop` exceeds sequence length
- Add test case to cover this scenario
@MillarCD
Copy link
Contributor Author

MillarCD commented Aug 3, 2025

It came up with two spelling fixes and one potential underflow, which appears a valid concern to me, do you agree?

Yes, it was a small detail that was easy to fix but with a significant impact.

@wdecoster wdecoster merged commit daf1351 into wdecoster:master Aug 4, 2025
1 check passed
@wdecoster
Copy link
Owner

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants