-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Querying human reads to the human genome (ref) is very slow (~500 RPM, a 1000x slowdown compared to querying bacterial reads to the same index).
Technical reason
This is due to highly repeted k-mers in the index, causing very large portions of the SA to be checked by the function bwa_sa2pos. Re-assembling with prophyle_assembler before creating the index brings the speed up to 25k RPM.
The number of calls to each function, obtained with gprof while querying 1000 simulated reads with (_compact) and without (_repeat) pre-assembly, are attached.
human_compact.pdf
human_repeat.pdf
Suggested solutions
Since the main use of this would be read filtering, one possible solution is to implement a new command prophex filter which stops querying k-mers as soon as a given threshold of k-mer hits is reached.
Furthermore, we could only check that the SA interval is non-empty instead of retrieving the position, but this would create some false positives due to k-mers on the border of two contigs.
Alternatively, once the interval is computed, we can stop retrieving the positions in the text as soon as one k-mer is verified using bwa2pos.