Skip to content

Slow queries for human genome #13

@simonepignotti

Description

@simonepignotti

Querying human reads to the human genome (ref) is very slow (~500 RPM, a 1000x slowdown compared to querying bacterial reads to the same index).

Technical reason

This is due to highly repeted k-mers in the index, causing very large portions of the SA to be checked by the function bwa_sa2pos. Re-assembling with prophyle_assembler before creating the index brings the speed up to 25k RPM.

The number of calls to each function, obtained with gprof while querying 1000 simulated reads with (_compact) and without (_repeat) pre-assembly, are attached.
human_compact.pdf
human_repeat.pdf

Suggested solutions

Since the main use of this would be read filtering, one possible solution is to implement a new command prophex filter which stops querying k-mers as soon as a given threshold of k-mer hits is reached.

Furthermore, we could only check that the SA interval is non-empty instead of retrieving the position, but this would create some false positives due to k-mers on the border of two contigs.
Alternatively, once the interval is computed, we can stop retrieving the positions in the text as soon as one k-mer is verified using bwa2pos.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions