Skip to content

Optimizations for re filter #355

@dkharms

Description

@dkharms

In #348, we decided to use anchored regular expressions by default.
This change was necessary to enable several optimizations for regular expression matching.

Prefix and Suffix Literals

Before matching all tokens for a field from a search query, we can extract two literals from the regular expression: prefix and suffix literals. Using these two literals, we can apply two key optimizations:

  • Reduce the number of token blocks to examine by performing a binary search over token blocks using the (strings|bytes).HasPrefix() function (as we already do for literals);
  • For each token we match against, use (strings|bytes).HasSuffix() before running the regular expression NFA.

Set Lookups

Sometimes we can transform a regular expression into a basic search query. For example, consider this query using the re filter: k8s_pod:re("pod-1|pod-2"). It is easy to see that we can rewrite it as k8s_pod:'pod-1' OR k8s_pod:'pod-2'.

I am pretty sure there are many more optimizations we can implement, so this is a topic worth researching further.

Metadata

Metadata

Assignees

Labels

performanceFeatures or improvements that positively affect seq-db performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions