Slow queries for human genome

Querying human reads to the human genome ([ref](ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz)) is very slow (~500 RPM, a **1000x slowdown** compared to querying bacterial reads to the same index).

## Technical reason 
This is due to highly repeted k-mers in the index, causing very large portions of the SA to be checked by the function `bwa_sa2pos`. Re-assembling with `prophyle_assembler` before creating the index brings the speed up to 25k RPM.

The number of calls to each function, obtained with `gprof` while querying 1000 simulated reads with (_compact) and without (_repeat) pre-assembly, are attached.
[human_compact.pdf](https://github.com/prophyle/prophex/files/1963965/human_compact.pdf)
[human_repeat.pdf](https://github.com/prophyle/prophex/files/1963966/human_repeat.pdf)

## Suggested solutions
Since the main use of this would be read filtering, one possible solution is to implement a new command `prophex filter` which stops querying k-mers as soon as a given threshold of k-mer hits is reached.

Furthermore, we could only check that the SA interval is non-empty instead of retrieving the position, but this would create some false positives due to k-mers on the border of two contigs.
Alternatively, once the interval is computed, we can stop retrieving the positions in the text as soon as one k-mer is verified using `bwa2pos`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow queries for human genome #13

Technical reason

Suggested solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow queries for human genome #13

Description

Technical reason

Suggested solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions