|
| 1 | +# Host Trie Implementation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This document describes the trie-based optimization for host pattern matching, designed to scale efficiently for MSSP deployments with hundreds or thousands of host configurations. |
| 6 | + |
| 7 | +## Problem Statement |
| 8 | + |
| 9 | +The original implementation stored hosts in a slice and used linear search with `filepath.Match()` for each request. This approach has O(n) complexity and doesn't scale well for large numbers of hosts. |
| 10 | + |
| 11 | +## Solution: Reverse Domain Trie |
| 12 | + |
| 13 | +We implemented a **reverse domain trie** that provides O(m) lookup complexity where m is the depth of the domain (typically 2-4 levels), independent of the total number of hosts. |
| 14 | + |
| 15 | +### How It Works |
| 16 | + |
| 17 | +#### Domain Reversal |
| 18 | + |
| 19 | +Domains are reversed before insertion to enable efficient prefix matching: |
| 20 | +- `www.example.com` → `["com", "example", "www"]` |
| 21 | +- `*.example.com` → `["com", "example", "*"]` |
| 22 | +- `*` → `["*"]` |
| 23 | + |
| 24 | +This allows patterns like `*.example.com` to share the common `com → example` path with other patterns for the same domain. |
| 25 | + |
| 26 | +#### Trie Structure |
| 27 | + |
| 28 | +``` |
| 29 | +root |
| 30 | +├── com (exact) |
| 31 | +│ └── example (exact) |
| 32 | +│ ├── www (exact) → Host: www.example.com |
| 33 | +│ └── * (wildcard) → Host: *.example.com |
| 34 | +└── * (wildcard) → Host: * (catch-all) |
| 35 | +``` |
| 36 | + |
| 37 | +#### Matching Algorithm |
| 38 | + |
| 39 | +The `findMatches` function traverses the trie recursively: |
| 40 | + |
| 41 | +1. **Exact match first**: Try to match the current domain segment exactly |
| 42 | +2. **Wildcard fallback**: If no exact match found, try the wildcard child |
| 43 | +3. **Priority comparison**: When multiple matches are possible, the highest priority wins |
| 44 | + |
| 45 | +### Priority System |
| 46 | + |
| 47 | +Priority determines which pattern wins when multiple patterns could match: |
| 48 | + |
| 49 | +| Factor | Impact | |
| 50 | +|--------|--------| |
| 51 | +| Exact match (no wildcards) | +10,000 | |
| 52 | +| Pattern length | +10 per character | |
| 53 | +| Each wildcard character | -1,000 | |
| 54 | + |
| 55 | +Examples: |
| 56 | +- `www.example.com` → 10,000 + 150 = **10,150** |
| 57 | +- `*.example.com` → 0 + 130 - 1,000 = **-870** |
| 58 | +- `*` → 0 + 10 - 1,000 = **-990** |
| 59 | + |
| 60 | +### Pattern Classification |
| 61 | + |
| 62 | +**Simple patterns** (handled efficiently by trie): |
| 63 | +- Exact: `www.example.com` |
| 64 | +- Prefix wildcard: `*.example.com` |
| 65 | +- Suffix wildcard: `example.*` |
| 66 | +- Catch-all: `*` |
| 67 | + |
| 68 | +**Complex patterns** (fallback to `filepath.Match`): |
| 69 | +- Middle wildcards: `example.*.com` |
| 70 | +- Partial wildcards: `*example.com`, `www*.example.com` |
| 71 | + |
| 72 | +## Performance Characteristics |
| 73 | + |
| 74 | +| Operation | Complexity | |
| 75 | +|-----------|------------| |
| 76 | +| Lookup | O(m) where m = domain depth | |
| 77 | +| Insert | O(m) | |
| 78 | +| Delete | O(m) | |
| 79 | +| Space | O(n × m) where n = number of hosts | |
| 80 | + |
| 81 | +For typical domains (3-4 segments), lookup is effectively O(1) regardless of the number of hosts stored. |
| 82 | + |
| 83 | +## API |
| 84 | + |
| 85 | +The implementation is transparent - no changes needed to existing code: |
| 86 | + |
| 87 | +```go |
| 88 | +manager := host.NewManager(logger) |
| 89 | +manager.addHost(host) // Adds to trie or complexPatterns |
| 90 | +manager.removeHost(host) // Removes from trie |
| 91 | +matched := manager.MatchFirstHost("api.example.com") // Uses trie for lookup |
| 92 | +``` |
| 93 | + |
| 94 | +## Key Improvements (v2) |
| 95 | + |
| 96 | +1. **Removed dead code**: Eliminated unused `getAllHosts()` and `collectHosts()` functions |
| 97 | +2. **Fixed priority bug**: Priority comparison now uses `math.MinInt` as the initial value |
| 98 | +3. **Zero allocations in hot path**: `findMatches` uses pointers instead of returning slices |
| 99 | +4. **Better documentation**: Comprehensive comments explaining the algorithm |
| 100 | +5. **Cleaner node structure**: Removed unused `priority` field from nodes (calculated on demand) |
| 101 | +6. **Edge case handling**: Proper nil/empty checks throughout |
| 102 | + |
| 103 | +## Testing |
| 104 | + |
| 105 | +The implementation includes comprehensive tests covering: |
| 106 | +- Single host matching |
| 107 | +- Multiple hosts with priority ordering |
| 108 | +- Wildcard patterns (prefix, suffix, catch-all) |
| 109 | +- Complex wildcard patterns |
| 110 | +- Host removal |
| 111 | +- Cache behavior |
| 112 | +- Edge cases (no hosts, no match) |
0 commit comments