Skip to content

Commit d274a85

Browse files
feat(host): add trie-based host matching for MSSP scalability
Implement a reverse domain trie for efficient host pattern matching, designed to scale for MSSP deployments with hundreds/thousands of hosts. Changes: - Add domainTrie data structure with O(m) lookup complexity - Hybrid approach: trie for simple patterns, filepath.Match fallback for complex - Priority system ensures most-specific-first matching behavior - Comprehensive tests and benchmarks Benchmark results (4 mixed lookups per iteration): | Hosts | Slice (old) | Trie (new) | Speedup | |---------|-------------|------------|--------------| | 10 | 4,901 ns | 432 ns | 11x faster | | 100 | 53,221 ns | 419 ns | 127x faster | | 1,000 | 414,463 ns | 428 ns | 968x faster | | 10,000 | 3,835,689 ns| 453 ns | 8,468x faster| Note: For small deployments (1-4 hosts), the existing cache provides sufficient performance. The trie optimization primarily benefits large-scale MSSP deployments.
1 parent 4e2cef1 commit d274a85

File tree

5 files changed

+1107
-15
lines changed

5 files changed

+1107
-15
lines changed

pkg/host/TRIE_IMPLEMENTATION.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Host Trie Implementation
2+
3+
## Overview
4+
5+
This document describes the trie-based optimization for host pattern matching, designed to scale efficiently for MSSP deployments with hundreds or thousands of host configurations.
6+
7+
## Problem Statement
8+
9+
The original implementation stored hosts in a slice and used linear search with `filepath.Match()` for each request. This approach has O(n) complexity and doesn't scale well for large numbers of hosts.
10+
11+
## Solution: Reverse Domain Trie
12+
13+
We implemented a **reverse domain trie** that provides O(m) lookup complexity where m is the depth of the domain (typically 2-4 levels), independent of the total number of hosts.
14+
15+
### How It Works
16+
17+
#### Domain Reversal
18+
19+
Domains are reversed before insertion to enable efficient prefix matching:
20+
- `www.example.com``["com", "example", "www"]`
21+
- `*.example.com``["com", "example", "*"]`
22+
- `*``["*"]`
23+
24+
This allows patterns like `*.example.com` to share the common `com → example` path with other patterns for the same domain.
25+
26+
#### Trie Structure
27+
28+
```
29+
root
30+
├── com (exact)
31+
│ └── example (exact)
32+
│ ├── www (exact) → Host: www.example.com
33+
│ └── * (wildcard) → Host: *.example.com
34+
└── * (wildcard) → Host: * (catch-all)
35+
```
36+
37+
#### Matching Algorithm
38+
39+
The `findMatches` function traverses the trie recursively:
40+
41+
1. **Exact match first**: Try to match the current domain segment exactly
42+
2. **Wildcard fallback**: If no exact match found, try the wildcard child
43+
3. **Priority comparison**: When multiple matches are possible, the highest priority wins
44+
45+
### Priority System
46+
47+
Priority determines which pattern wins when multiple patterns could match:
48+
49+
| Factor | Impact |
50+
|--------|--------|
51+
| Exact match (no wildcards) | +10,000 |
52+
| Pattern length | +10 per character |
53+
| Each wildcard character | -1,000 |
54+
55+
Examples:
56+
- `www.example.com` → 10,000 + 150 = **10,150**
57+
- `*.example.com` → 0 + 130 - 1,000 = **-870**
58+
- `*` → 0 + 10 - 1,000 = **-990**
59+
60+
### Pattern Classification
61+
62+
**Simple patterns** (handled efficiently by trie):
63+
- Exact: `www.example.com`
64+
- Prefix wildcard: `*.example.com`
65+
- Suffix wildcard: `example.*`
66+
- Catch-all: `*`
67+
68+
**Complex patterns** (fallback to `filepath.Match`):
69+
- Middle wildcards: `example.*.com`
70+
- Partial wildcards: `*example.com`, `www*.example.com`
71+
72+
## Performance Characteristics
73+
74+
| Operation | Complexity |
75+
|-----------|------------|
76+
| Lookup | O(m) where m = domain depth |
77+
| Insert | O(m) |
78+
| Delete | O(m) |
79+
| Space | O(n × m) where n = number of hosts |
80+
81+
For typical domains (3-4 segments), lookup is effectively O(1) regardless of the number of hosts stored.
82+
83+
## API
84+
85+
The implementation is transparent - no changes needed to existing code:
86+
87+
```go
88+
manager := host.NewManager(logger)
89+
manager.addHost(host) // Adds to trie or complexPatterns
90+
manager.removeHost(host) // Removes from trie
91+
matched := manager.MatchFirstHost("api.example.com") // Uses trie for lookup
92+
```
93+
94+
## Key Improvements (v2)
95+
96+
1. **Removed dead code**: Eliminated unused `getAllHosts()` and `collectHosts()` functions
97+
2. **Fixed priority bug**: Priority comparison now uses `math.MinInt` as the initial value
98+
3. **Zero allocations in hot path**: `findMatches` uses pointers instead of returning slices
99+
4. **Better documentation**: Comprehensive comments explaining the algorithm
100+
5. **Cleaner node structure**: Removed unused `priority` field from nodes (calculated on demand)
101+
6. **Edge case handling**: Proper nil/empty checks throughout
102+
103+
## Testing
104+
105+
The implementation includes comprehensive tests covering:
106+
- Single host matching
107+
- Multiple hosts with priority ordering
108+
- Wildcard patterns (prefix, suffix, catch-all)
109+
- Complex wildcard patterns
110+
- Host removal
111+
- Cache behavior
112+
- Edge cases (no hosts, no match)

0 commit comments

Comments
 (0)