-
Notifications
You must be signed in to change notification settings - Fork 17
Add new trimming approach: Highest Quality Subread #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduced a new trait to support pluggable trimming strategies. Also implemented , which applies a modified Mott algorithm to extract the highest-quality sub-read based on cumulative error probability.
…trategy` implementation - Introduced a new CLI argument to select the trimming strategy - Added validation logic to ensure required parameters are provided for each strategy - Implemented `HighestQualityTrimStrategy` as a new trimming option - Updated tests for the `filter` function to reflect strategy-based trimming behavior
…ng into write_record - Implemented and trimming strategies - Integrated trimming strategies into the function
…trategy::trim - Reorganized CLI flags into logical categories to improve readability - Updated README.md to reflect the new flag structure and document filtering and trimming options - Removed an unused variable from the `HighestQualityTrimStrategy::trim` method
…date tests - Manually parsed to display 'INF' in the CLI - Added parser for GC content arguments to validate values are within [0, 1] - Renamed `check_trimming_approach_args` to `build_trimming_approach` and changed its return type to `Option<Arc<dyn TrimStrategy>>` - Removed trimming parameter validation from - Refactored `filter` function for simplicity and integrated trimming - Refactored `write_record` function to remove trimming logic - Created helper functions to validate each filter option independently - Updated tests accordingly - Constrained `TrimStrategy` trait to require `Send` and `Sync`
|
Hi @MillarCD, Wow! Thanks for contributing! I will need some time to go through this. Looks like a great addition, while making things a bit more complicated. I haven't worked with Traits much, so that's nice too. Also, thanks for the bug fix of the GC content. One thing that I am conflicted about is how you named the strategy. A 'subread' is terminology used by PacBio for individual passes of the polymerase around the fragment, before merging them all in a ccs read. Can you think of an alternative? Best, |
|
Thanks for the clarification! I hadn’t considered the specific meaning of subread in the PacBio context. |
|
|
|
I have a deadline to reach early next week, and will pick up on this after that. |
|
Don't worry, good luck with the deadline! |
|
I took a look at your code and made two suggestions (and then figured out that one of those was already taken care of). Could you also still rename the strategy? Thanks! |
src/main.rs
Outdated
| }, | ||
| TrimApproach::BestSubread => { | ||
| if let Some(cutoff) = args.cutoff { | ||
| Some(Arc::new(HighestQualityTrimStrategy::new(phred_score_to_probability(cutoff + 33)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the +33 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a bit confusing. This happens because the phred_score_to_probability function assumes that the phred parameter is the decimal value of the ASCII character representing quality (for example, "!", which corresponds to 33 in decimal, represents Q0). This function is responsible for converting that ASCII value to its numeric equivalent by subtracting 33.
This happened because I extracted phred_score_to_probability from the ave_qual function, where this conversion was originally done.
Later on, since cutoff represents a numeric Phred score, I added 33 to convert it to its ASCII character before passing it to phred_score_to_probability, which again performed the reverse conversion.
At the time, I didn’t think much of it, but now I believe it would be better to refactor phred_score_to_probability so that it works directly with the numeric Phred score. That way, it could receive cutoff without any transformation, and ave_qual would be responsible for converting from the ASCII representation to the numeric value required by phred_score_to_probability.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
| 0.5 | ||
| }; | ||
|
|
||
| if args.headcrop + args.tailcrop < read_len { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need something like this - checking that the read is not fully trimmed away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, I see this is handled in trimmers.rs in the trim implementation of FixedCropStrategy
|
Oh, and I see in trimmers.rs a |
… fix TrimByQualityStrategy typo - Updated documentation to replace sub-read with "read segment" - Updated README.md to reflect the renamed strategy in the help message - Updated README.md to reflect the renamed strategy in the help message
|
I’ve completed the name corrections. Let me know if you have any questions or suggestions. :) |
…as input, replacing previous handling of ASCII-encoded FASTQ quality characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new "Highest Quality Subread" trimming approach to Chopper, implementing a Phred-based algorithm for extracting the highest-quality read segment. The implementation refactors the existing trimming logic into a trait-based system and reorganizes the CLI interface for better usability.
Key changes include:
- Addition of a new trimming strategy using Kadane's algorithm to find optimal quality subreads
- Refactoring of trimming logic into a
TrimStrategytrait with three implementations - Reorganization of CLI arguments into categorized help sections with improved parameter validation
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/trimmers.rs | New file implementing the TrimStrategy trait and three trimming strategies including the new highest-quality approach |
| src/main.rs | Major refactoring of CLI structure, argument validation, and filter logic to support the new trimming system |
| README.md | Updated documentation reflecting the new trimming approaches and reorganized CLI help structure |
Comments suppressed due to low confidence (1)
src/main.rs:109
- [nitpick] The enum variant name
BestReadSegmentis inconsistent with the CLI valuebest-read-segmentshown in the help text. Consider renaming toBestSubreadorHighestQualityto better match the feature description.
BestReadSegment
| let prob_error = self.cutoff - phred_score_to_probability(*phred_qual - 33); | ||
| if current_cumulative_error < 0.0 { | ||
| current_cumulative_error = 0.0; | ||
| current_start = i; | ||
| } | ||
|
|
||
| current_cumulative_error += prob_error; |
Copilot
AI
Aug 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name prob_error is misleading as it represents a score difference, not an error probability. Consider renaming to quality_score or score_diff to better reflect its purpose in the algorithm.
| let prob_error = self.cutoff - phred_score_to_probability(*phred_qual - 33); | |
| if current_cumulative_error < 0.0 { | |
| current_cumulative_error = 0.0; | |
| current_start = i; | |
| } | |
| current_cumulative_error += prob_error; | |
| let score_diff = self.cutoff - phred_score_to_probability(*phred_qual - 33); | |
| if current_cumulative_error < 0.0 { | |
| current_cumulative_error = 0.0; | |
| current_start = i; | |
| } | |
| current_cumulative_error += score_diff; |
src/trimmers.rs
Outdated
| if record.seq().len() - self.tail_crop <= self.head_crop { | ||
| return None; | ||
| } | ||
| Some((self.head_crop, record.seq().len() - self.tail_crop)) |
Copilot
AI
Aug 3, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential integer underflow if self.tail_crop is greater than record.seq().len(). This could cause a panic in debug mode or wraparound in release mode. Check that self.tail_crop <= record.seq().len() before performing the subtraction.
| if record.seq().len() - self.tail_crop <= self.head_crop { | |
| return None; | |
| } | |
| Some((self.head_crop, record.seq().len() - self.tail_crop)) | |
| let read_len = record.seq().len(); | |
| if self.tail_crop > read_len { | |
| return None; | |
| } | |
| if read_len - self.tail_crop <= self.head_crop { | |
| return None; | |
| } | |
| Some((self.head_crop, read_len - self.tail_crop)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a valid concern, right @MillarCD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right. That’s my mistake. I didn’t cover that case in the unit tests. Using the saturating_* function should handle it correctly:
if record.seq().len().saturating_sub(self.tail_crop) <= self.head_crop {
return None;
}
|
(looks good to me but I wonder about what use copilot is for things like this, let's see what it thinks... Will merge soon!) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
It came up with two spelling fixes and one potential underflow, which appears a valid concern to me, do you agree? |
…crop < 0` - Use `saturating_sub` to prevent underflow when `tail_crop` exceeds sequence length - Add test case to cover this scenario
Yes, it was a small detail that was easy to fix but with a significant impact. |
|
Thanks again! |
Hi @wdecoster,
I've used Chopper a couple of times and found it to be a very useful and easy-to-use trimming tool. I wanted to contribute my small grain of sand. I apologize, as I initially started with a simple idea, but along the way I ended up making additional changes, which might make it a bit harder to review everything in detail. Here’s a brief overview of my contribution:
What's new?
A new trimming strategy called Highest Quality Subread was added. It is based on the algorithm presented by Phred, which identifies the highest-quality subread based on a cutoff value (
--cutoff).How does the algorithm work?
The behavior of this new trimming approach can be summarized in four steps:
--cutoffvalue and the error probability; bases with a negative score are considered to have insufficient quality.Types of changes
New feature
TrimStrategytrait was created to standardize the logic across different strategies, making the system easier to extend and maintain.--trim-approachflag was added to allow users to select one of the available trimming strategies, while ensuring only one strategy is applied at a time.--cutoffflag was added to define a quality threshold for strategies that require it (e.g.,trim-by-qualityandbest-subread).Additional implementations
parse_gc_valuefunction was added to parse GC content parameters and validate that their values fall within the[0, 1]range.parse_usize_or_inffunction was added to parse the maximum read length, allowing the terminal to displayINF(instead ofusize::MAX) as default value.build_trimming_approachfunction was added to construct the appropriate trimming strategy and validate that all required parameters (--cutoff,--headcrop,--tailcrop) have been provided for the selected strategy.Refactoring
--trimflag was removed and replaced by--cutoff, which must now be used together with the--trim-approach trim-by-qualityflag.--trim-approachflag allows selecting one of three strategies: fixed-length trimming (fixed-crop), quality-based trimming (trim-by-quality), and best subread extraction (best-subread).write_recordfunction and encapsulated into dedicated structs implementing theTrimStrategytrait.write_recordfunction was simplified to focus solely on writing reads to output.filterfunction was refactored to improve readability and modularity.Clistructure were updated accordingly.Bugfix
--gcminand--gcmaxwas added to thefilterfunction when the--inverseflag is used.Testing
0.10.0installed via Conda.--inverseflag with--gcminand/or--gcmax, where the new logic now takes GC limits into account when--inverseflag is set.Benchmark
To evaluate performance,
chopper v0.10.0(installed via Conda) was compared with the new version using a 8.9 GB FASTQ file. The following parameters were used:Each version was executed 20 times using 1 and 4 threads, and the average execution time (in seconds) was recorded. The table below shows similar performance between both versions, with the new version being slightly faster on average.