Skip to content

idea: optimize rule matching via bit sets #3074

@williballenthin

Description

@williballenthin

here's a report created as the result of a discussion between me and Claude, on how to improve the performance of rule matching (especially relevant for a Rust implementation of capa). focus on the first idea below.


Bitset-based rule matching for capa

Two ideas, often conflated

The phrase "represent rules as bitsets" packs together two independent techniques. They compose, but they're worth seeing separately first.

Idea 1 — Compact truth values. Boolean intermediates (whether a leaf predicate is satisfied, whether a sub-tree is satisfied) live in single bits inside machine words rather than as Python objects or Rust bools in heap-allocated nodes. The CPU can then combine them with AND/OR/XOR/NOT instructions that each touch 64 truth values per cycle. This is a representation change. No SIMD required — u64-level scalar bit ops on a modern CPU are already enormously parallel relative to per-node interpretation.

Idea 2 — Cross-axis parallelism. Once truth values are bits, you can evaluate many things at once by packing them into wide registers (AVX2 = 256 bits, AVX-512 = 512 bits) and running one shape of bitwise operation across all of them. The "many things" can be many rules, many scopes, or many leaf evaluations — different choices give different layouts.

Idea 1 is mature, well-understood, and probably gives capa most of the win. Idea 2 is what shows up in the QuickScorer/Treelite literature; for capa it's a smaller marginal gain on top of Idea 1 because per-rule work is already tiny once Idea 1 is in place.

I'll build Idea 1 carefully and end with where Idea 2 fits.


Step 1: leaves as bits

Number all the leaves of one rule 0..K. For a rule with 5 leaves you have 5 leaf predicates; allocate one u8 (or u64 if K ≤ 64) to hold their truth values. After leaf evaluation:

bit 4 3 2 1 0
    L4 L3 L2 L1 L0

For the moment, treat leaf evaluation as a separate phase — for each leaf, do whatever feature lookup is needed (bitset/perfect-hash/regex) and OR the result into the correct bit position. We'll come back to making this phase fast in a minute.

The point is: after this phase, the entire state of the rule's truth values is one u64. Everything downstream is bit math.

Step 2: pure AND — the easy case

Suppose the rule is L0 AND L1 AND L2 AND L4 (note: L3 not used). Build one mask once at rule-load time:

mask     = 0b10111   // bits 0,1,2,4 set; bit 3 unused

Match test:

let matched = (leaves & mask) == mask;

One AND, one compare. ~2 cycles. This is the canonical "feature set contains required features" check, and it's already much faster than walking a tree.

Step 3: AND of ORs — the common capa shape

Most capa rules are AND-of-OR-of-leaves at the top: every clause must hold, each clause is satisfied by any of its alternatives. Two integer masks per clause, one comparison:

Rule: (L0 OR L1) AND (L2 OR L3) AND L4

clause_0_mask = 0b00011   // L0, L1
clause_1_mask = 0b01100   // L2, L3
clause_2_mask = 0b10000   // L4
let matched = (leaves & clause_0_mask) != 0
           && (leaves & clause_1_mask) != 0
           && (leaves & clause_2_mask) != 0;

Per clause: one AND, one compare-against-zero (which on x86 becomes a TEST instruction, no register needed for the zero). Three clauses = ~6 cycles. The whole rule body is roughly the cost of a single hash lookup in the Python version.

This is the shape that most existing capa rules will compile to directly. There's no need for CNF conversion or any algebraic transformation — they're already in this form because that's how malware analysts naturally write them.

Step 4: NOT is a polarity flip

A NOT(Li) in a clause means: the clause is satisfied by Li being false.

The trick is to maintain two pieces of information per clause:

  • clause_mask — every leaf the clause cares about (positive or negative)
  • clause_polarity — for each leaf in the clause, what value satisfies it (0 = positive, 1 = negative). Bits not in the clause don't matter; we'll mask them off.

Then one leaf in the clause being in its satisfying state means the clause holds:

let satisfied = ((leaves ^ clause_polarity) & clause_mask) != 0;

Walk through it: the XOR flips the bits of leaves that appear negatively in the clause, leaving positive-occurring leaves alone. After the XOR, every literal that's currently satisfied shows up as a 1 in its bit position. The AND with clause_mask discards bits outside this clause. If anything remains, the clause is satisfied.

Leaf In clause? Polarity leaves bit After XOR Counts toward satisfaction?
Positive lit, present yes 0 1 1
Positive lit, absent yes 0 0 0
Negative lit, present yes 1 1 0
Negative lit, absent yes 1 0 1
Not in clause no any any masked out

The whole machine for AND-of-clauses with arbitrary positive/negative literals is now:

fn matches(leaves: u64, rule: &Rule) -> bool {
    for clause in &rule.clauses {
        if ((leaves ^ clause.polarity) & clause.mask) == 0 {
            return false;
        }
    }
    true
}

For a typical capa rule, that's a short loop with ~3–10 iterations, each ~3 cycles. The compiler unrolls it. Branch prediction is nearly perfect because rules either match consistently or don't.

Step 5: a top-level OR (or a nested clause that doesn't fit)

So far we've handled "AND of (OR-clauses with optional NOTs on leaves)." What about a top-level OR, or deeper nesting like (A AND B) OR (C AND D)?

You have three options:

5a. Expand into multiple per-rule programs. For (A AND B) OR (C AND D), store two clause programs, one per top-level disjunct, and the rule matches if either succeeds. This works well when there are few alternatives but blows up exponentially for (A OR B) AND (C OR D) AND (E OR F) → 8 conjunctions. Don't expand blindly; only expand when the result has, say, ≤ 8 disjuncts.

5b. Compile to a tiny straight-line bytecode of bit ops. When expansion is too big, lower the tree to a sequence of operations on a "bit register file." Each register holds one bit; the program is straight-line.

Rule: (L0 OR L1) AND (L2 OR L3 OR NOT L4) AND (L5 AND NOT L6)

Compiled program:
  r0 = leaves & 0b0000011        ; OR-clause 1: any of L0,L1
  r0 = r0 != 0
  r1 = (leaves ^ 0b0010000) & 0b0011100  ; OR-clause 2: L2 or L3 or !L4
  r1 = r1 != 0
  r2 = leaves & 0b0100000        ; L5
  r2 = r2 != 0
  r3 = leaves & 0b1000000        ; L6
  r3 = r3 == 0
  result = r0 & r1 & r2 & r3

This is dense, branchless, and trivially compiled to a Vec<Op> where each Op is a few-byte enum variant. The dispatch loop is a match on the opcode; modern Rust lowers this to a jump table. With ~10 opcodes per rule and ~3 cycles each, a rule evaluates in ~30 cycles — and there's no per-rule call/return overhead, no allocation, no pointer chasing.

This is the "bytecode interpreter on bits" design. It's what you should reach for when a rule won't fit cleanly into the AND-of-clauses pattern.

5c. Truth-table lookup for tiny rules. If a rule has K ≤ 12 leaves, its truth function is determined by 2^K ≤ 4096 bits of lookup table. Pack those into one 4096-bit array (512 bytes), index by the leaf bits, get the answer in one load:

fn match_tiny_rule(leaves: u64, rule: &TinyRule) -> bool {
    let idx = leaves as usize & rule.leaf_mask_as_usize;
    rule.truth_table.get(idx)
}

For very small rules (K ≤ 6 → 8 bytes of table), the whole table fits in a u64 and lookup is (table >> leaves) & 1 — ~2 cycles, fully branchless, regardless of the rule's logical complexity. capa rules with ≤ 6 leaves are extremely common; this is a real optimization, not a curiosity.

The choice between 5a, 5b, 5c is made at rule-load time based on the rule's shape. None of them require codegen.

Step 6: N-of-M without branches

n_or_more constraints (at least N of M children must be true) have a beautiful representation when the children are leaves:

let satisfied_count = (leaves & member_mask).count_ones();
let matched = satisfied_count >= n;

That's one POPCNT instruction (native since Nehalem/Bobcat). If the children are sub-expressions rather than leaves, evaluate each into a temporary bit (as in step 5b), accumulate into a small bitset, then popcount. Cost: O(M) ops then one popcount.

This is so much faster than the tree-walking version that it's worth checking whether your existing Python implementation's n_or_more evaluator is one of the hot spots.

Step 7: making leaf evaluation cheap

Steps 1–6 assumed leaves were already evaluated into bits. That's the part that actually dominates cost in practice. Two layers help.

Layer 1 — Interning. Every feature any rule references gets a u32 ID at rule-load time. The total ID space is the union across all rules, which for capa is in the low thousands. Per-scope features (the ones extracted from the binary) are also translated through the same interning table; features that nobody references get discarded at extraction.

Layer 2 — A global feature bitset per scope. Length N_referenced bits, ~2 KB for a few thousand interned features — fits trivially in L1. Membership test is:

#[inline(always)]
fn has(bv: &[u64], id: u32) -> bool {
    (bv[(id >> 6) as usize] >> (id & 63)) & 1 != 0
}

3–5 cycles. One cache line.

Gathering leaf bits with PEXT. Now the interesting move: if your rule has K leaves and you've assigned each leaf-i an interned feature ID f_i, you'd ideally like a u64 where bit i = has(global_bv, f_i). The BMI2 PEXT instruction does exactly this: given a source word and a mask, it gathers the masked bits into the low bits of the result.

// Precomputed at rule-load: a 64-bit mask M_r where bit f_i is set, for each leaf i of rule r.
// Constraint: all f_i fit in one 64-bit word (i.e., they live in the same chunk of global_bv).
let leaves_for_rule = _pext_u64(global_bv_word, rule.pext_mask);

PEXT is 3 cycles on Zen3+, 1 cycle on Intel (older AMD pre-Zen3 was microcoded and slow — that's the gotcha). After PEXT, the rule's leaves live in the low K bits of a u64, in the order you chose, and steps 1–6 apply directly.

For rules where leaves span multiple 64-bit words of the global bitset (rare — would require >64 distinct features in one rule), do two PEXTs and OR-shift. Or just lay out the global feature space so that hot rules' features cluster into single words; rule-load-time analysis decides this.

For non-x86 platforms without BMI2 (AArch64, RISC-V), the fallback is a manual gather using clz/ctz and shifts; on AArch64 NEON BIT/BIF/TBL give you a different but workable path. The performance gap is small because the values are in registers throughout.

Step 8: a worked example end-to-end

Take this rule (lightly stylized but realistic):

rule:
  meta:
    name: encrypt data using AES via WinCrypt
  features:
    - and:
      - or:
        - api: advapi32.CryptEncrypt
        - api: bcrypt.BCryptEncrypt
      - or:
        - string: "AES"
        - number: 0x6610  # CALG_AES_128
        - number: 0x660E
        - number: 0x660F
      - not:
        - characteristic: thunk

At rule-load time, we assign interned feature IDs:

f0 = id("api:advapi32.CryptEncrypt")  = 142
f1 = id("api:bcrypt.BCryptEncrypt")   = 893
f2 = id("string:AES")                 = 2104
f3 = id("number:0x6610")              = 3771
f4 = id("number:0x660E")              = 3769
f5 = id("number:0x660F")              = 3770
f6 = id("characteristic:thunk")       = 17

The rule has 7 leaves. Pick a leaf-bit assignment 0..6 within the rule's local u64. Build:

pext_mask_word_0 = (1<<17)                              // for f6
pext_mask_word_2 = (1<<(142-128)) | (1<<(2104-2048))    // f0 and f2 happen to be in word_? not this one — illustrative
... etc.

In practice, you'd pre-shuffle global bitset layout so that all this rule's feature IDs land in the same global bitset word — this is a separate rule-load-time optimization called bit-packing for locality — and then a single PEXT gathers all 7 bits at once.

Three clauses in CNF form:

clause_0: (L0 OR L1)               // mask = 0b0000011, polarity = 0
clause_1: (L2 OR L3 OR L4 OR L5)   // mask = 0b0111100, polarity = 0
clause_2: NOT L6                   // mask = 0b1000000, polarity = 0b1000000

The compiled per-rule struct is roughly:

struct Rule {
    pext_mask: u64,            // gathers global → local
    clauses: [Clause; 3],      // const-sized when small
}
struct Clause { mask: u64, polarity: u64 }

Match function:

fn matches(rule: &Rule, global_bv_word: u64) -> bool {
    let leaves = unsafe { _pext_u64(global_bv_word, rule.pext_mask) };
    rule.clauses.iter().all(|c| ((leaves ^ c.polarity) & c.mask) != 0)
}

Per rule, on Intel with BMI2: ~1 cycle PEXT + 3×(2 cycles XOR/AND/test) = ~7 cycles. For 1000 rules in a scope: 7000 cycles = ~2 microseconds. Compared to Python's tree walk per rule at ~100 microseconds, that's the ~50× per-rule speedup the QuickScorer/Treelite numbers prepared you for — and it's pure scalar code, no SIMD yet.

Step 9: Where SIMD actually fits

Up to here, "bitset matching" has bought you huge wins through scalar u64 ops alone. Three places where stepping up to SIMD pays:

9a. Batch leaf evaluation across many candidate features. When extracting features from a function, you might be checking thousands of operands against thousands of "constant value" features. Pack the candidate-value list into AVX-512 vectors, compare against the next reference value, OR into a presence bitset. This is "feature extraction is faster," not "rule matching is faster" — but it feeds the matcher.

9b. Column-major rule evaluation. Lay out clauses across rules as 512-bit lanes: instead of "rule r's clause c," think "clause c of rules r, r+8, r+16, …" packed into one AVX-512 register. The (leaves ^ polarity) & mask != 0 operation becomes a single vector op evaluating 8 rules' clauses in parallel. This requires that the rules-in-a-batch have similar leaf layouts, which means clustering rules by their feature set at load time. Real speedup: ~4–8× on top of scalar bitset matching. Engineering cost: substantial. Worth it only if the bitset matcher itself is the bottleneck — which is unlikely once you've also done pre-filtering.

9c. The QuickScorer-style scan. This is the technique you've been seeing referenced. For regression decision trees, QuickScorer precomputes, per internal node, a bitmask of "which leaves are still reachable if this node takes the false branch." At inference time, evaluate every internal node's threshold check (these are independent — no tree-walk dependency!), AND together the masks of false-going nodes, find the lowest-set bit = matched leaf. The whole tree becomes a linear scan of node tests plus a bitwise reduction, perfect for SIMD.

For capa, the direct adaptation doesn't quite apply because capa's "leaves" are predicate results, not threshold checks, and the "answer" is one bit, not a leaf index. But there's a useful adapted variant: for each rule, precompute per-clause a bitmask of "what leaf assignments satisfy this clause." A clause over K leaves has a satisfaction set ⊆ {0,1}^K. Store the indicator bitset of that set (4 KB at K=12). Rule matches iff every clause's bitset contains leaves as an index. This is the "truth table" idea from step 5c generalized per clause rather than per rule, useful when individual clauses are complex but rules have many of them.

In honest engineering terms: 9a and 9c are worth investigating once the basics are in place; 9b is overkill until profiling demands it.

Putting it all together: the compilation pipeline

At rule-load time:

  1. Parse YAML rules into AST.
  2. Walk every rule, collect referenced features, intern to u32 IDs.
  3. For each rule, classify by shape:
    • K ≤ 6 leaves → emit truth-table form (8 bytes per rule)
    • AND-of-OR-of-(possibly-negated)-leaves → emit clause-mask form
    • everything else → emit small bitset bytecode
  4. Compute pre-filter witness per rule (rarest required feature).
  5. Build inverted index feature_id → Vec<rule_id> for witness-based pre-filtering.
  6. Lay out the global feature bitset to cluster co-referenced features into the same 64-bit chunks (this is a graph-partitioning subproblem; greedy heuristics work fine).
  7. Compile per-rule pext_masks for fast leaf gathering.

At scan time, per scope:

  1. Extract features from the binary into the global feature bitset (this is your real cost; it's mostly disassembly + string scanning, both already-fast Rust crates).
  2. Walk the inverted index to find candidate rules — most rules eliminated here.
  3. For each candidate, dispatch to truth-table / clause-mask / bytecode evaluator by shape tag.
  4. Collect matches; do a second pass for OPTIONAL evidence gathering on matched rules.

Nothing in this pipeline requires runtime codegen. There's no JIT, no Cranelift, no LLVM. The "compilation" is data-structure construction in plain Rust.

What this is actually worth for capa

The honest accounting, relative to the Python tree walker today:

  • Rust + interning + bitset feature lookup: 5–10× from removing Python overhead and per-node allocation.
  • AND-of-clause-mask matcher per rule: 3–8× on the rule-evaluation phase itself, if dispatch was previously the bottleneck. (Amdahl-bound by feature-extraction time.)
  • Witness-based pre-filter: 2–4× by skipping non-candidate rules entirely. (Capa perf: don't try to match rules that will never match #830 measured ~50% on Python already; the relative gain in Rust will be similar.)
  • Truth-table form for small rules: ~2× for those rules, which may be most of the corpus.
  • SIMD packing (Idea 2): another 2–4× on the matcher if you do it, but only if the matcher is still the bottleneck after the above. Often it isn't.

The compound number is plausibly in the 30–100× range on the matcher itself relative to current Python, of which something like 20–50× is realized as end-to-end wall-clock improvement (because feature extraction doesn't speed up by the same ratio). And critically: none of this requires JIT compilation, none of it requires Cranelift, none of it ships executable code at runtime. It's all Vec<u64> plus a small match-dispatched bytecode for the irregular rules.

That makes it the right first thing to build for the Rust port. After it lands and you profile, you'll know with measurements whether anything more exotic is worth pursuing. My strong prior — and I'd happily revise it on data — is that you won't need anything more exotic.

Sanity-check terms to search for

If you want to chase any of this further, the most productive references are:

  • QuickScorer (Lucchese et al., SIGIR 2015) and the RapidScorer follow-up — for the "tree as bitmask layout" angle, even though capa's adaptation is partial.
  • Aho-Corasick + Teddy (BurntSushi's aho-corasick crate docs and the Hyperscan paper) — for fast literal-string feature extraction, which is the bottleneck more often than rule matching itself.
  • BDD (binary decision diagram) literature for the formal version of what "lowering boolean expressions to canonical bit-shuffling" looks like at the extreme. You probably don't need BDDs — they're overkill — but understanding them sharpens intuition.
  • Database column-store predicate evaluation (DuckDB and Polars source code) — the cleanest open-source examples of vectorized boolean predicate evaluation in modern languages, including the column-major rule-batching idea from §9b.
  • Compiling Pattern Matching to Good Decision Trees (Maranget, ML 2008) — for the rigorous version of clause ordering and predicate sharing across rules.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions