idea: optimize rule matching via bit sets

here's a report created as the result of a discussion between me and Claude, on how to improve the performance of rule matching (especially relevant for a Rust implementation of capa). focus on the first idea below.

---

# Bitset-based rule matching for capa

## Two ideas, often conflated

The phrase "represent rules as bitsets" packs together two independent techniques. They compose, but they're worth seeing separately first.

**Idea 1 — Compact truth values.** Boolean intermediates (whether a leaf predicate is satisfied, whether a sub-tree is satisfied) live in single bits inside machine words rather than as Python objects or Rust `bool`s in heap-allocated nodes. The CPU can then combine them with `AND`/`OR`/`XOR`/`NOT` instructions that each touch 64 truth values per cycle. This is a *representation* change. No SIMD required — `u64`-level scalar bit ops on a modern CPU are already enormously parallel relative to per-node interpretation.

**Idea 2 — Cross-axis parallelism.** Once truth values are bits, you can evaluate many things at once by packing them into wide registers (AVX2 = 256 bits, AVX-512 = 512 bits) and running one shape of bitwise operation across all of them. The "many things" can be many rules, many scopes, or many leaf evaluations — different choices give different layouts.

Idea 1 is mature, well-understood, and probably gives capa most of the win. Idea 2 is what shows up in the QuickScorer/Treelite literature; for capa it's a smaller marginal gain on top of Idea 1 because per-rule work is already tiny once Idea 1 is in place.

I'll build Idea 1 carefully and end with where Idea 2 fits.

---

## Step 1: leaves as bits

Number all the leaves of *one rule* `0..K`. For a rule with 5 leaves you have 5 leaf predicates; allocate one `u8` (or `u64` if K ≤ 64) to hold their truth values. After leaf evaluation:

```
bit 4 3 2 1 0
    L4 L3 L2 L1 L0
```

For the moment, treat leaf evaluation as a separate phase — for each leaf, do whatever feature lookup is needed (bitset/perfect-hash/regex) and OR the result into the correct bit position. We'll come back to making this phase fast in a minute.

The point is: after this phase, the entire state of the rule's truth values is one `u64`. Everything downstream is bit math.

## Step 2: pure AND — the easy case

Suppose the rule is `L0 AND L1 AND L2 AND L4` (note: L3 not used). Build one mask once at rule-load time:

```
mask     = 0b10111   // bits 0,1,2,4 set; bit 3 unused
```

Match test:

```rust
let matched = (leaves & mask) == mask;
```

One `AND`, one compare. ~2 cycles. This is the canonical "feature set contains required features" check, and it's already much faster than walking a tree.

## Step 3: AND of ORs — the common capa shape

Most capa rules are AND-of-OR-of-leaves at the top: every clause must hold, each clause is satisfied by any of its alternatives. Two integer masks per clause, one comparison:

```
Rule: (L0 OR L1) AND (L2 OR L3) AND L4

clause_0_mask = 0b00011   // L0, L1
clause_1_mask = 0b01100   // L2, L3
clause_2_mask = 0b10000   // L4
```

```rust
let matched = (leaves & clause_0_mask) != 0
           && (leaves & clause_1_mask) != 0
           && (leaves & clause_2_mask) != 0;
```

Per clause: one `AND`, one compare-against-zero (which on x86 becomes a `TEST` instruction, no register needed for the zero). Three clauses = ~6 cycles. The whole rule body is roughly the cost of a single hash lookup in the Python version.

This is the shape that most existing capa rules will compile to directly. There's no need for CNF conversion or any algebraic transformation — they're already in this form because that's how malware analysts naturally write them.

## Step 4: NOT is a polarity flip

A `NOT(Li)` in a clause means: the clause is satisfied by `Li` being **false**.

The trick is to maintain two pieces of information per clause:

- `clause_mask` — every leaf the clause cares about (positive *or* negative)
- `clause_polarity` — for each leaf in the clause, what value satisfies it (0 = positive, 1 = negative). Bits *not in the clause* don't matter; we'll mask them off.

Then *one* leaf in the clause being in its satisfying state means the clause holds:

```rust
let satisfied = ((leaves ^ clause_polarity) & clause_mask) != 0;
```

Walk through it: the XOR flips the bits of leaves that appear negatively in the clause, leaving positive-occurring leaves alone. After the XOR, every literal that's currently *satisfied* shows up as a `1` in its bit position. The AND with `clause_mask` discards bits outside this clause. If anything remains, the clause is satisfied.

| Leaf                  | In clause? | Polarity | leaves bit | After XOR | Counts toward satisfaction? |
| --------------------- | ---------- | -------- | ---------- | --------- | --------------------------- |
| Positive lit, present | yes        | 0        | 1          | 1         | ✓                           |
| Positive lit, absent  | yes        | 0        | 0          | 0         | —                           |
| Negative lit, present | yes        | 1        | 1          | 0         | —                           |
| Negative lit, absent  | yes        | 1        | 0          | 1         | ✓                           |
| Not in clause         | no         | —        | any        | any       | masked out                  |

The whole machine for AND-of-clauses with arbitrary positive/negative literals is now:

```rust
fn matches(leaves: u64, rule: &Rule) -> bool {
    for clause in &rule.clauses {
        if ((leaves ^ clause.polarity) & clause.mask) == 0 {
            return false;
        }
    }
    true
}
```

For a typical capa rule, that's a short loop with ~3–10 iterations, each ~3 cycles. The compiler unrolls it. Branch prediction is nearly perfect because rules either match consistently or don't.

## Step 5: a top-level OR (or a nested clause that doesn't fit)

So far we've handled "AND of (OR-clauses with optional NOTs on leaves)." What about a top-level OR, or deeper nesting like `(A AND B) OR (C AND D)`?

You have three options:

**5a. Expand into multiple per-rule programs.** For `(A AND B) OR (C AND D)`, store *two* clause programs, one per top-level disjunct, and the rule matches if either succeeds. This works well when there are few alternatives but blows up exponentially for `(A OR B) AND (C OR D) AND (E OR F)` → 8 conjunctions. Don't expand blindly; only expand when the result has, say, ≤ 8 disjuncts.

**5b. Compile to a tiny straight-line bytecode of bit ops.** When expansion is too big, lower the tree to a sequence of operations on a "bit register file." Each register holds one bit; the program is straight-line.

```
Rule: (L0 OR L1) AND (L2 OR L3 OR NOT L4) AND (L5 AND NOT L6)

Compiled program:
  r0 = leaves & 0b0000011        ; OR-clause 1: any of L0,L1
  r0 = r0 != 0
  r1 = (leaves ^ 0b0010000) & 0b0011100  ; OR-clause 2: L2 or L3 or !L4
  r1 = r1 != 0
  r2 = leaves & 0b0100000        ; L5
  r2 = r2 != 0
  r3 = leaves & 0b1000000        ; L6
  r3 = r3 == 0
  result = r0 & r1 & r2 & r3
```

This is dense, branchless, and trivially compiled to a `Vec<Op>` where each Op is a few-byte enum variant. The dispatch loop is a `match` on the opcode; modern Rust lowers this to a jump table. With ~10 opcodes per rule and ~3 cycles each, a rule evaluates in ~30 cycles — and there's no per-rule call/return overhead, no allocation, no pointer chasing.

This is the "bytecode interpreter on bits" design. It's what you should reach for when a rule won't fit cleanly into the AND-of-clauses pattern.

**5c. Truth-table lookup for tiny rules.** If a rule has K ≤ 12 leaves, its truth function is determined by `2^K ≤ 4096` bits of lookup table. Pack those into one 4096-bit array (512 bytes), index by the leaf bits, get the answer in one load:

```rust
fn match_tiny_rule(leaves: u64, rule: &TinyRule) -> bool {
    let idx = leaves as usize & rule.leaf_mask_as_usize;
    rule.truth_table.get(idx)
}
```

For very small rules (K ≤ 6 → 8 bytes of table), the whole table fits in a `u64` and lookup is `(table >> leaves) & 1` — ~2 cycles, fully branchless, regardless of the rule's logical complexity. capa rules with ≤ 6 leaves are extremely common; this is a real optimization, not a curiosity.

The choice between 5a, 5b, 5c is made at rule-load time based on the rule's shape. None of them require codegen.

## Step 6: N-of-M without branches

`n_or_more` constraints (at least N of M children must be true) have a beautiful representation when the children are leaves:

```rust
let satisfied_count = (leaves & member_mask).count_ones();
let matched = satisfied_count >= n;
```

That's one `POPCNT` instruction (native since Nehalem/Bobcat). If the children are sub-expressions rather than leaves, evaluate each into a temporary bit (as in step 5b), accumulate into a small bitset, then `popcount`. Cost: O(M) ops then one popcount.

This is so much faster than the tree-walking version that it's worth checking whether your existing Python implementation's `n_or_more` evaluator is one of the hot spots.

## Step 7: making leaf evaluation cheap

Steps 1–6 assumed leaves were already evaluated into bits. That's the part that actually dominates cost in practice. Two layers help.

**Layer 1 — Interning.** Every feature any rule references gets a `u32` ID at rule-load time. The total ID space is the union across all rules, which for capa is in the low thousands. Per-scope features (the ones extracted from the binary) are also translated through the same interning table; features that nobody references get discarded at extraction.

**Layer 2 — A global feature bitset per scope.** Length `N_referenced` bits, ~2 KB for a few thousand interned features — fits trivially in L1. Membership test is:

```rust
#[inline(always)]
fn has(bv: &[u64], id: u32) -> bool {
    (bv[(id >> 6) as usize] >> (id & 63)) & 1 != 0
}
```

3–5 cycles. One cache line.

**Gathering leaf bits with PEXT.** Now the interesting move: if your rule has K leaves and you've assigned each leaf-i an interned feature ID `f_i`, you'd ideally like a u64 where bit i = `has(global_bv, f_i)`. The BMI2 `PEXT` instruction does exactly this: given a source word and a mask, it gathers the masked bits into the low bits of the result.

```rust
// Precomputed at rule-load: a 64-bit mask M_r where bit f_i is set, for each leaf i of rule r.
// Constraint: all f_i fit in one 64-bit word (i.e., they live in the same chunk of global_bv).
let leaves_for_rule = _pext_u64(global_bv_word, rule.pext_mask);
```

`PEXT` is 3 cycles on Zen3+, 1 cycle on Intel (older AMD pre-Zen3 was microcoded and slow — that's the gotcha). After PEXT, the rule's leaves live in the low K bits of a `u64`, in the order you chose, and steps 1–6 apply directly.

For rules where leaves span multiple 64-bit words of the global bitset (rare — would require >64 distinct features in one rule), do two PEXTs and OR-shift. Or just lay out the global feature space so that hot rules' features cluster into single words; rule-load-time analysis decides this.

For non-x86 platforms without BMI2 (AArch64, RISC-V), the fallback is a manual gather using `clz`/`ctz` and shifts; on AArch64 NEON `BIT`/`BIF`/`TBL` give you a different but workable path. The performance gap is small because the values are in registers throughout.

## Step 8: a worked example end-to-end

Take this rule (lightly stylized but realistic):

```yaml
rule:
  meta:
    name: encrypt data using AES via WinCrypt
  features:
    - and:
      - or:
        - api: advapi32.CryptEncrypt
        - api: bcrypt.BCryptEncrypt
      - or:
        - string: "AES"
        - number: 0x6610  # CALG_AES_128
        - number: 0x660E
        - number: 0x660F
      - not:
        - characteristic: thunk
```

At rule-load time, we assign interned feature IDs:

```
f0 = id("api:advapi32.CryptEncrypt")  = 142
f1 = id("api:bcrypt.BCryptEncrypt")   = 893
f2 = id("string:AES")                 = 2104
f3 = id("number:0x6610")              = 3771
f4 = id("number:0x660E")              = 3769
f5 = id("number:0x660F")              = 3770
f6 = id("characteristic:thunk")       = 17
```

The rule has 7 leaves. Pick a leaf-bit assignment 0..6 within the rule's local u64. Build:

```
pext_mask_word_0 = (1<<17)                              // for f6
pext_mask_word_2 = (1<<(142-128)) | (1<<(2104-2048))    // f0 and f2 happen to be in word_? not this one — illustrative
... etc.
```

In practice, you'd pre-shuffle global bitset layout so that all this rule's feature IDs land in the same global bitset word — this is a separate rule-load-time optimization called *bit-packing for locality* — and then a single PEXT gathers all 7 bits at once.

Three clauses in CNF form:

```
clause_0: (L0 OR L1)               // mask = 0b0000011, polarity = 0
clause_1: (L2 OR L3 OR L4 OR L5)   // mask = 0b0111100, polarity = 0
clause_2: NOT L6                   // mask = 0b1000000, polarity = 0b1000000
```

The compiled per-rule struct is roughly:

```rust
struct Rule {
    pext_mask: u64,            // gathers global → local
    clauses: [Clause; 3],      // const-sized when small
}
struct Clause { mask: u64, polarity: u64 }
```

Match function:

```rust
fn matches(rule: &Rule, global_bv_word: u64) -> bool {
    let leaves = unsafe { _pext_u64(global_bv_word, rule.pext_mask) };
    rule.clauses.iter().all(|c| ((leaves ^ c.polarity) & c.mask) != 0)
}
```

Per rule, on Intel with BMI2: ~1 cycle PEXT + 3×(2 cycles XOR/AND/test) = ~7 cycles. For 1000 rules in a scope: 7000 cycles = ~2 microseconds. Compared to Python's tree walk per rule at ~100 microseconds, that's the ~50× per-rule speedup the QuickScorer/Treelite numbers prepared you for — and it's pure scalar code, no SIMD yet.

## Step 9: Where SIMD actually fits

Up to here, "bitset matching" has bought you huge wins through scalar `u64` ops alone. Three places where stepping up to SIMD pays:

**9a. Batch leaf evaluation across many candidate features.** When extracting features from a function, you might be checking thousands of operands against thousands of "constant value" features. Pack the candidate-value list into AVX-512 vectors, compare against the next reference value, OR into a presence bitset. This is "feature extraction is faster," not "rule matching is faster" — but it feeds the matcher.

**9b. Column-major rule evaluation.** Lay out clauses across rules as 512-bit lanes: instead of "rule r's clause c," think "clause c of rules r, r+8, r+16, …" packed into one AVX-512 register. The `(leaves ^ polarity) & mask != 0` operation becomes a single vector op evaluating 8 rules' clauses in parallel. This requires that the rules-in-a-batch have similar leaf layouts, which means clustering rules by their feature set at load time. Real speedup: ~4–8× on top of scalar bitset matching. Engineering cost: substantial. Worth it only if the bitset matcher itself is the bottleneck — which is unlikely once you've also done pre-filtering.

**9c. The QuickScorer-style scan.** This is the technique you've been seeing referenced. For *regression* decision trees, QuickScorer precomputes, per internal node, a bitmask of "which leaves are still reachable if this node takes the false branch." At inference time, evaluate every internal node's threshold check (these are independent — no tree-walk dependency!), AND together the masks of false-going nodes, find the lowest-set bit = matched leaf. The whole tree becomes a linear scan of node tests plus a bitwise reduction, perfect for SIMD.

For capa, the direct adaptation doesn't quite apply because capa's "leaves" are predicate results, not threshold checks, and the "answer" is one bit, not a leaf index. But there's a useful adapted variant: for each rule, precompute *per-clause* a bitmask of "what leaf assignments satisfy this clause." A clause over K leaves has a satisfaction set ⊆ {0,1}^K. Store the indicator bitset of that set (4 KB at K=12). Rule matches iff every clause's bitset contains `leaves` as an index. This is the "truth table" idea from step 5c generalized per clause rather than per rule, useful when individual clauses are complex but rules have many of them.

In honest engineering terms: 9a and 9c are worth investigating once the basics are in place; 9b is overkill until profiling demands it.

## Putting it all together: the compilation pipeline

At rule-load time:

1. Parse YAML rules into AST.
2. Walk every rule, collect referenced features, intern to `u32` IDs.
3. For each rule, classify by shape:
   - K ≤ 6 leaves → emit truth-table form (8 bytes per rule)
   - AND-of-OR-of-(possibly-negated)-leaves → emit clause-mask form
   - everything else → emit small bitset bytecode
4. Compute pre-filter witness per rule (rarest required feature).
5. Build inverted index `feature_id → Vec<rule_id>` for witness-based pre-filtering.
6. Lay out the global feature bitset to cluster co-referenced features into the same 64-bit chunks (this is a graph-partitioning subproblem; greedy heuristics work fine).
7. Compile per-rule `pext_mask`s for fast leaf gathering.

At scan time, per scope:

1. Extract features from the binary into the global feature bitset (this is your real cost; it's mostly disassembly + string scanning, both already-fast Rust crates).
2. Walk the inverted index to find candidate rules — most rules eliminated here.
3. For each candidate, dispatch to truth-table / clause-mask / bytecode evaluator by shape tag.
4. Collect matches; do a second pass for OPTIONAL evidence gathering on matched rules.

Nothing in this pipeline requires runtime codegen. There's no JIT, no Cranelift, no LLVM. The "compilation" is data-structure construction in plain Rust.

## What this is actually worth for capa

The honest accounting, relative to the Python tree walker today:

- Rust + interning + bitset feature lookup: **5–10× from removing Python overhead and per-node allocation**.
- AND-of-clause-mask matcher per rule: **3–8× on the rule-evaluation phase itself**, *if* dispatch was previously the bottleneck. (Amdahl-bound by feature-extraction time.)
- Witness-based pre-filter: **2–4×** by skipping non-candidate rules entirely. (Capa #830 measured ~50% on Python already; the relative gain in Rust will be similar.)
- Truth-table form for small rules: ~2× *for those rules*, which may be most of the corpus.
- SIMD packing (Idea 2): another 2–4× *on the matcher* if you do it, but only if the matcher is still the bottleneck after the above. Often it isn't.

The compound number is plausibly in the 30–100× range on the matcher itself relative to current Python, of which something like 20–50× is realized as end-to-end wall-clock improvement (because feature extraction doesn't speed up by the same ratio). And critically: **none of this requires JIT compilation, none of it requires Cranelift, none of it ships executable code at runtime**. It's all `Vec<u64>` plus a small `match`-dispatched bytecode for the irregular rules.

That makes it the right *first* thing to build for the Rust port. After it lands and you profile, you'll know with measurements whether anything more exotic is worth pursuing. My strong prior — and I'd happily revise it on data — is that you won't need anything more exotic.

## Sanity-check terms to search for

If you want to chase any of this further, the most productive references are:

- *QuickScorer* (Lucchese et al., SIGIR 2015) and the *RapidScorer* follow-up — for the "tree as bitmask layout" angle, even though capa's adaptation is partial.
- *Aho-Corasick* + *Teddy* (BurntSushi's `aho-corasick` crate docs and the Hyperscan paper) — for fast literal-string feature extraction, which is the bottleneck more often than rule matching itself.
- BDD (binary decision diagram) literature for the formal version of what "lowering boolean expressions to canonical bit-shuffling" looks like at the extreme. You probably don't need BDDs — they're overkill — but understanding them sharpens intuition.
- Database column-store predicate evaluation (DuckDB and Polars source code) — the cleanest open-source examples of vectorized boolean predicate evaluation in modern languages, including the column-major rule-batching idea from §9b.
- *Compiling Pattern Matching to Good Decision Trees* (Maranget, ML 2008) — for the rigorous version of clause ordering and predicate sharing across rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: optimize rule matching via bit sets #3074

Bitset-based rule matching for capa

Two ideas, often conflated

Step 1: leaves as bits

Step 2: pure AND — the easy case

Step 3: AND of ORs — the common capa shape

Step 4: NOT is a polarity flip

Step 5: a top-level OR (or a nested clause that doesn't fit)

Step 6: N-of-M without branches

Step 7: making leaf evaluation cheap

Step 8: a worked example end-to-end

Step 9: Where SIMD actually fits

Putting it all together: the compilation pipeline

What this is actually worth for capa

Sanity-check terms to search for

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Leaf	In clause?	Polarity	leaves bit	After XOR	Counts toward satisfaction?
Positive lit, present	yes	0	1	1	✓
Positive lit, absent	yes	0	0	0	—
Negative lit, present	yes	1	1	0	—
Negative lit, absent	yes	1	0	1	✓
Not in clause	no	—	any	any	masked out

idea: optimize rule matching via bit sets #3074

Description

Bitset-based rule matching for capa

Two ideas, often conflated

Step 1: leaves as bits

Step 2: pure AND — the easy case

Step 3: AND of ORs — the common capa shape

Step 4: NOT is a polarity flip

Step 5: a top-level OR (or a nested clause that doesn't fit)

Step 6: N-of-M without branches

Step 7: making leaf evaluation cheap

Step 8: a worked example end-to-end

Step 9: Where SIMD actually fits

Putting it all together: the compilation pipeline

What this is actually worth for capa

Sanity-check terms to search for

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions