Skip to content

Commit f0317b6

Browse files
Perf docs (#95)
* dpapi masterkey comment * more verbose CLI errors, ignore swp files * update enrichment modules doc * update traefik * Infra and perf improvements - Switched to debian postgres image after benchmarking it faster - Upgraded PostgreSQL to v18.1 - Made postgres port configurable - Upgraded RabbitMQ to v4.2.0 - Added RabbitMQ detailed rates for better prom metrics - Jaeger config changes to reduce memory consumption over time * perf docs, jaeger fix
1 parent 726ebc6 commit f0317b6

File tree

16 files changed

+129
-101
lines changed

16 files changed

+129
-101
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,3 +182,9 @@ projects/dotnet_service/bin/
182182
projects/jupyter/notebooks/*.csv
183183

184184
projects/frontend/public/env.js
185+
186+
# Vim swap files
187+
*.swp
188+
*.swo
189+
*.swn
190+
*~

compose.prod.build.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,15 @@ services:
3131
dockerfile: ./projects/file_enrichment/Dockerfile
3232
target: prod
3333

34+
### Replica #1
3435
# file-enrichment-1:
3536
# <<: *file-enrichment-build
3637

38+
### Replica #2
3739
# file-enrichment-2:
3840
# <<: *file-enrichment-build
3941

42+
### Replica #3
4043
# file-enrichment-3:
4144
# <<: *file-enrichment-build
4245

compose.yaml

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,7 @@ services:
324324
rabbitmq: { condition: service_healthy }
325325
network_mode: "service:file-enrichment"
326326

327+
### Replica #1
327328
# file-enrichment-1:
328329
# <<: *file-enrichment-template
329330
# environment:
@@ -338,6 +339,7 @@ services:
338339
# rabbitmq: { condition: service_healthy }
339340
# network_mode: "service:file-enrichment-1"
340341

342+
### Replica #2
341343
# file-enrichment-2:
342344
# <<: *file-enrichment-template
343345
# environment:
@@ -352,6 +354,7 @@ services:
352354
# rabbitmq: { condition: service_healthy }
353355
# network_mode: "service:file-enrichment-2"
354356

357+
### Replica #3
355358
# file-enrichment-3:
356359
# <<: *file-enrichment-template
357360
# environment:
@@ -548,7 +551,7 @@ services:
548551
# - TIKA_OCR_LANGUAGES=${TIKA_OCR_LANGUAGES:-eng chi_sim chi_tra jpn rus deu spa}
549552
# Note: each package installed will increase the image size!
550553
- TIKA_OCR_LANGUAGES=${TIKA_OCR_LANGUAGES:-eng}
551-
- MAX_PARALLEL_WORKFLOWS=${DOCUMENTCONVERSION_WORKERS:-5}
554+
- MAX_PARALLEL_WORKFLOWS=${DOCUMENTCONVERSION_MAX_PARALLEL_WORKFLOWS:-5}
552555
- MAX_WORKFLOW_EXECUTION_TIME=${MAX_WORKFLOW_EXECUTION_TIME:-300}
553556
- LOG_LEVEL=${LOG_LEVEL:-INFO}
554557
- OMP_THREAD_LIMIT=1 # Limit the number of tesseract instances
@@ -707,7 +710,7 @@ services:
707710
restart: "no"
708711

709712
rabbitmq:
710-
image: rabbitmq:4.1.2-management
713+
image: rabbitmq:4.2.0-management
711714
hostname: rabbitmq-node # have to do this for persistence reasons
712715
environment:
713716
- RABBITMQ_DEFAULT_PASS=${RABBITMQ_PASSWORD:?}
@@ -717,6 +720,7 @@ services:
717720
volumes:
718721
- rabbitmq_data:/var/lib/rabbitmq
719722
- ./infra/rabbitmq/enabled_plugins:/etc/rabbitmq/enabled_plugins:ro
723+
- ./infra/rabbitmq/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf:ro
720724
healthcheck:
721725
test: ["CMD", "rabbitmq-diagnostics", "check_port_connectivity"]
722726
interval: 10s
@@ -730,7 +734,7 @@ services:
730734
- "traefik.http.routers.rabbitmq-ui.rule=PathPrefix(`/rabbitmq`)"
731735

732736
postgres:
733-
image: postgres:17.6-alpine
737+
image: postgres:18.1
734738
command:
735739
[
736740
"postgres",
@@ -740,14 +744,14 @@ services:
740744
"-c", "pg_stat_statements.track=all"
741745
]
742746
ports:
743-
- "5432:5432"
747+
- "${POSTGRES_EXTERNAL_PORT:-}5432"
744748
environment:
745749
POSTGRES_DB: ${POSTGRES_DB:-enrichment}
746750
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:?}
747751
POSTGRES_USER: ${POSTGRES_USER:?}
748752
volumes:
749753
- ./infra/postgres:/docker-entrypoint-initdb.d:ro
750-
- postgres_data:/var/lib/postgresql/data
754+
- postgres_data:/var/lib/postgresql
751755
healthcheck:
752756
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d enrichment"]
753757
interval: 5s
@@ -792,7 +796,7 @@ services:
792796
- "traefik.http.routers.hasura.tls=true"
793797

794798
traefik:
795-
image: traefik:v3.4.4
799+
image: traefik:v3.6.1
796800
command:
797801
- "--api.insecure=true"
798802
# - "--log.level=DEBUG"
@@ -882,8 +886,15 @@ services:
882886
profiles: ["monitoring"]
883887
image: jaegertracing/jaeger:latest # v2.x image
884888
user: "0:0"
889+
deploy:
890+
resources:
891+
limits:
892+
memory: 2G
893+
reservations:
894+
memory: 1G
885895
environment:
886896
- QUERY_BASE_PATH=/jaeger
897+
- GOGC=80
887898
volumes:
888899
- ./infra/jaeger/jaeger-config.yaml:/etc/jaeger/config.yaml
889900
- jaeger_data:/badger # Persist trace data

docs/enrichment_configuration.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,20 +30,20 @@ Currently the PII module detects the following entity types: `CREDIT_CARD`, `US_
3030

3131
The [Document Conversion service](https://github.com/SpecterOps/Nemesis/tree/main/projects/document_conversion) has several ENV variables variable that can be passed through from the environment launching Nemesis, or modified in [compose.yaml](https://github.com/SpecterOps/Nemesis/blob/main/compose.yaml):
3232

33-
| ENV Variable | Default Value | Description |
34-
| ----------------------------- | ------------- | --------------------------------------------------------------- |
35-
| `MAX_PARALLEL_WORKFLOWS` | 5 | Maxmimum number of parallel conversion workflows allows |
36-
| `MAX_WORKFLOW_EXECUTION_TIME` | 300 | Maximum time (in seconds) workflows can run before being killed |
37-
| `TIKA_USE_OCR` | false | Set to `true` to enable OCR support via Tessaract |
38-
| `TIKA_OCR_LANGUAGES` | eng | Tika/Tesseract OCR languages supported. |
33+
| ENV Variable | Default Value | Description |
34+
|---------------------------------------------|---------------|-----------------------------------------------------------------|
35+
| `DOCUMENTCONVERSION_MAX_PARALLEL_WORKFLOWS` | 5 | Maxmimum number of parallel conversion workflows allows |
36+
| `MAX_WORKFLOW_EXECUTION_TIME` | 300 | Maximum time (in seconds) workflows can run before being killed |
37+
| `TIKA_USE_OCR` | false | Set to `true` to enable OCR support via Tessaract |
38+
| `TIKA_OCR_LANGUAGES` | eng | Tika/Tesseract OCR languages supported. |
3939

40-
If you want to have additional language packs supported (see https://github.com/tesseract-ocr/tessdata for a full list), run something like this before launching Nemesis or set the value in your .env :
40+
If you want to have additional language packs supported (see https://github.com/tesseract-ocr/tessdata for a full list), run something like this before launching Nemesis or set the value in your `.env` file:
4141

4242
```bash
4343
export TIKA_OCR_LANGUAGES="eng chi_sim chi_tra jpn rus deu spa"
4444
```
4545

46-
**NOTE:** due to Docker's ENV variable substitution, setting `TIKA_USE_OCR=false` will be interpreted as true - either removing `TIKA_USE_OCR` from an .env file or setting `TIKA_USE_OCR=""` will disable OCR (the default).
46+
**NOTE:** due to Docker's ENV variable substitution, setting `TIKA_USE_OCR=false` will be interpreted as true - either removing `TIKA_USE_OCR` from an .env file or setting `TIKA_USE_OCR=""` will disable OCR (the default). Enabling OCR significantly increases CPU as it will OCR standalone images as well as all images embedded in documents.
4747

4848
## Nosey Parker
4949

@@ -52,7 +52,7 @@ export TIKA_OCR_LANGUAGES="eng chi_sim chi_tra jpn rus deu spa"
5252
The [Nosey Parker scanner service](https://github.com/SpecterOps/Nemesis/tree/main/projects/noseyparker_scanner) has several ENV variables variable that can be passed through from the environment launching Nemesis, or modified in [compose.yaml](https://github.com/SpecterOps/Nemesis/blob/main/compose.yaml):
5353

5454
| ENV Variable | Default Value | Description |
55-
| ---------------------- | ------------- | ------------------------------------------------------------------------------- |
55+
|------------------------|---------------|---------------------------------------------------------------------------------|
5656
| `SNIPPET_LENGTH` | 512 | Bytes of context length around Nosey Parker matches to pull in for findings |
5757
| `MAX_CONCURRENT_FILES` | 2 | Maximum number of concurrent files to scan (raising increases resources needed) |
5858
| `MAX_FILE_SIZE_MB` | 200 | Maximum file size to scan (in megabytes) |

docs/file_enrichment_modules.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Then in this folder, run `poetry add X` to add a new library. The dynamic module
2929

3030
## Tips / Tricks
3131

32-
The async `should_process()` function determines if the module should run on a file. You can either check the name or any other component of the base enriched file with `file_enriched = get_file_enriched(object_id)`:
32+
The async `should_process()` function determines if the module should run on a file. You can either check the name or any other component of the base enriched file with `file_enriched = await get_file_enriched(object_id, self.asyncpg_pool)`:
3333

3434
```python
3535
...

docs/performance.md

Lines changed: 47 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,65 @@
11
# Nemesis Performance Tuning
22

3-
This document details different ways to monitor and tune Nemesis's performance. Nemesis's performs differently depending on a variety of factors, including the host's architecture and resources (particularly CPU and RAM) and the workload (e.g. the number of files and imbalances in the the number of documents, .NET assemblies, source code, etc.).
3+
This document details different ways to monitor and tune Nemesis's performance. Nemesis performs differently depending on a variety of factors, including the host's architecture and resources (particularly CPU, RAM, and disk speed) and the workload (e.g. the # of files and imbalances in the number of documents, .NET assemblies, source code, etc.).
44

5-
If workflows begin to fail, or you are experiencing major performance issues (as diagnosed by the [Troubleshooting](troubleshooting.md) document) there are a few tunable parameters that can help. Alternatively, if your performance is fine already and you want to potentially increase performance more or potentially reduce CPU/RAM usage (to save $$$), you can adjust these values. Most/all of these values involve altering behaviors the docker services responsible for file enrichment, namely the `file-enrichment`, `document-conversion`, and `noseyparker-scanner` services.
5+
If workflows begin to fail, or you are experiencing major performance issues, there are a few tunable settings that can help. Alternatively, if your performance is fine already and you want to potentially increase performance more or potentially reduce CPU/RAM usage (to save $$$), you can adjust these values. This document primarily focuses on increasing performance, but you can of course adjust the settings down to decrease resources.
6+
7+
# Hardware Resourcing
8+
The first thing to check is if Nemesis has enough hardware resources.
9+
10+
## CPU
11+
12+
Under load, monitor CPU usage (e.g. with `top`/`htop` or the "Node Exporter" Grafana dashboard (if monitoring is enabled in Nemesis). If all cores are at lower utilization or not maxed out, continue following through this guide. Otherwise, you'll need to increase CPU resources for Nemesis since Nemesis is primarily CPU bound.
13+
14+
## RAM
15+
Under load, monitor RAM usage (e.g. with `top`/`htop`, `free -h`, or the "Node Exporter" Grafana dashboard if monitoring is enabled in Nemesis). Ensure that all memory is not being used; otherwise, you will need to increase RAM.
16+
17+
Note that Nemesis will buffer/cache memory if it can. Minio in particular will use any available RAM to cache file data in RAM. This memory is reclaimable, and therefore is still useable by other services/applications. We recommend having at least 1Gb of cache memory available. More may improve performance, but for the most part Nemesis is CPU bound, not RAM bound. You can apply [docker compose memory limits](https://docs.docker.com/reference/compose-file/deploy/#resources) to specific services if you want to constrain how much RAM minio consumes.
18+
19+
## Disk
20+
The requirements will vary widely here depending on your workload size. A general rule of thumb is 3x the size of all the files being uploaded. Use SSDs if possible.
621

722
# Analyzing Your Workload
823
## Analyzing Queues
9-
Normally people realize Nemesis isn't going fast enough after uploading a bunch of files and it taking forever to process. Usually this is indicative that a bunch of files get queued up for processing, but aren't be processed fast enough. You can confirm this by [analyzing the message queues](./troubleshooting.md#analyze-message-queues).
24+
Normally people realize Nemesis isn't going fast enough after uploading a bunch of files and it taking forever to process. Usually this is indicative that files get queued up for processing, but aren't processed fast enough. You can confirm this by [analyzing the message queues](./troubleshooting.md#analyze-message-queues) in Nemesis/RabbitMQ.
25+
26+
In RabbitMQ, `Ready` counts signify messages waiting to be processed and the `delivery / get` rates (messages per second) will give you an idea of the processing speed. The following table maps the service to queue mappings:
27+
28+
| Docker Service | Queue Name | Description |
29+
|---------------------|---------------------------------|-------------------------------------------------------------------------------------------|
30+
| file_enrichment | files-new_file | Uploaded files that haven't begun processing |
31+
| document_conversion | files-document_conversion_input | Files waiting to go through document_conversion (strings, text extraction, PDF conversion) |
32+
| dotnet_service | dotnet-dotnet_input | Files waiting for .NET decompilation and inspect assembly |
33+
| noseyparker-scanner | noseyparker-noseyparker_input | Files waiting to be scanned by noseyparker |
34+
35+
If the queue message rates are too slow, you can adjust some settings to try and increase performance. The following sections detail the best bang-for-the-buck service-specific adjustments you can make.
1036

11-
The message queue Ready counts (messages waiting to be processed) and rate of delivery will give you an idea of
12-
-
37+
### file_enrichment
38+
Every uploaded file is first placed on the `files-new_file` queue. The file_enrichment service consumes files from the queue and processes each one with the [applicable enrichment modules](https://github.com/SpecterOps/Nemesis/tree/main/libs/file_enrichment_modules). To improve file_enrichment performance, analyze its CPU usage with `docker compose stats file-enrichment` or in the "Docker Monitoring" dashboard in Grafana.
1339

40+
The first thing to tune is making sure file_enrichment is efficiently using a single core (currently, the file_enrichment service does not take full advantage of parallelism). Good utilization will look like ~90-110% CPU usage. i.e. the worker thread is taking full advantage of a single core. If CPU utilization is low, increase the number of workers with the `ENRICHMENT_MAX_PARALLEL_WORKFLOWS` environment variable (default is 5, meaning 5 workers). You'll also want to make sure this isn't set too high, causing workers to compete for CPU amongst themselves. If you increase to ~100 workers, then you'll also need to adjust Dapr's RabbitMQ `prefetchCount` count in [file.yaml](https://github.com/SpecterOps/Nemesis/blob/main/infra/dapr/components/pubsub/files.yaml).
1441

42+
If additional cores are available, you can scale the file_enrichment container by adding replicas. Do this by modifying both the [compose.yaml](https://github.com/SpecterOps/Nemesis/blob/main/compose.yaml#L327) and [compose.prod.yaml](https://github.com/SpecterOps/Nemesis/blob/main/compose.prod.build.yaml#L34) files, uncommenting the disabled `file-enrichment-###` placeholder replicas therein. Feel free to add more by following the same pattern, if wanted.
1543

16-
# File Submission
44+
### document_conversion
45+
Every file is added to the `files-document_conversion_input` queue. The document_conversion service consumes files from the queue and extracts text, runs `strings` on the file, and converts documents to PDFs. To improve document_conversion performance, analyze its CPU usage with `docker compose stats document-conversion` or in the "Docker Monitoring" dashboard in Grafana. The document_conversion service can take full advantage of parallelism (so adding replicas is not necessary since a single instance can utilize multiple cores). However, the [compose.yaml](https://github.com/SpecterOps/Nemesis/blob/main/compose.yaml#L565) has [resource limits](https://docs.docker.com/reference/compose-file/deploy/#resources) that restrict the document-conversion service to 2 cores by default (adjust it if needed). In addition, you can adjust the `DOCUMENTCONVERSION_MAX_PARALLEL_WORKFLOWS` environment variable to adjust the number of workers (2 workers by default).
1746

47+
### noseyparker-scanner
48+
Every text file is added to the `noseyparker-noseyparker_input` queue. The noseyparker-scanner service consumes files from the queue and scans them with noseyparker. To improve noseyparker-scanner performance, analyze its CPU usage with `docker compose stats noseyparker-scanner` or in the "Docker Monitoring" dashboard in Grafana. The noseyparker-scanner service can take full advantage of parallelism (so adding replicas is not necessary since a single instance can utilize multiple cores). However, the [compose.yaml](https://github.com/SpecterOps/Nemesis/blob/main/compose.yaml#L129) has [resource limits](https://docs.docker.com/reference/compose-file/deploy/#resources) that restrict the noseyparker-scanner service to 2 cores by default (adjust it if needed). In addition, you can adjust the `NOSEYPARKER_MAX_CONCURRENT_FILES` environment variable to adjust the number of workers (2 workers by default).
1849

1950

51+
# Dapr Scaling
52+
Nemesis uses [Dapr Workflows](https://docs.dapr.io/developing-applications/building-blocks/workflow/workflow-overview/) to build durable and reliable enrichment pipelines. Underneath, the workflows are managed by Dapr's scheduler service, which shares usage of the Postgres database with Nemesis.
2053

21-
# Useful Prometheus Metrics
22-
Minio
23-
```
24-
minio_cluster_usage_objects_count{}
25-
```
54+
You may need to scale the Dapr infrastructure if you considerably increase the performance of the file_enrichment and/or document_conversion services. Scaling Dapr is beyond the scope of this document, but here's some indicators when you may need to:
55+
- Significant sustained CPU usage (> 80%-90%) by the scheduler container and/or Postgres container.
56+
- Workflows begin failing frequently.
57+
- You notice frequent activity failures/retries in Jaeger traces.
2658

59+
If this is the case, first try increasing the number of scheduler instances ([example](https://github.com/olitomlinson/dapr-workflow-testing/blob/main/compose-1-3.yml#L111-L152)). Dapr does not support more than 3 scheduler instances unless you migrate to using [an external etcd store](https://docs.dapr.io/concepts/dapr-services/scheduler/#external-etcd-database). If Postgres begins to be the bottleneck, you may consider using a separate Postgres instance to store Dapr state.
2760

28-
# Jaeger
29-
See how long a particular activity takes:
30-
```
31-
curl -sk --user 'n:n' "https://localhost:7443/jaeger/api/traces?service=file-enrichment&operation=activity%7C%7Crun_enrichment_modules&limit=2000" | jq -r '
32-
[
33-
.data[]
34-
.spans[]
35-
| select(.operationName == "activity||run_enrichment_modules")
36-
| .duration
37-
] as $durs
38-
| {
39-
count: ($durs | length),
40-
min_ms: ($durs | min / 1000),
41-
max_ms: ($durs | max / 1000),
42-
avg_ms: (($durs | add / ($durs | length)) / 1000)
43-
}
44-
'
45-
```
61+
Additional resources:
62+
- [Tuning Dapr Scheduler for Production](https://www.diagrid.io/blog/tuning-dapr-scheduler-for-production)
63+
- [Dapr Scheduler control plane service overview](https://docs.dapr.io/concepts/dapr-services/scheduler/)
4664

4765

48-
# TODO: Need to document these
49-
- Adjust enrichment workers
50-
- Add some metrics around throughput and RAM/CPU consumption
51-
- Adjust dapr workflow/activity concurrency settings
52-
- How to tune your system to the right settings
53-
- Seeing gaps between activities in Jaeger
54-
- Useful Grafana/prometheus metrics
55-
- Scaling the Dapr scheduler:
56-
- How to determine if scheduler is slow down
57-
- Creating scheduler replicas
58-
- Disk performance recommendations (Fast/ramdisk suggestion)
59-
- Use separate DB instance from app

0 commit comments

Comments
 (0)