Skip to content

Code, prompts, and artifacts for our DFDS (co-located with DFRWS EU 2026, Sweden) paper evaluating local LLMs on long chat corpora using a two-stage, context-window-aware pipeline to produce structured investigative reports, with reproducible scoring and ACM-published results.

License

Notifications You must be signed in to change notification settings

NetherlandsForensicInstitute/local-llm-chat-report-benchmark

Repository files navigation

Global vs Local LLM — Reproduction Guide

This repository contains the code, prompts, and instructions to reproduce the experiments behind our paper submitted to DFRWS EU 2026.

Order of operations: Experiment 1 (ground truth)Dataset splitExperiment 2 / Stage 1Experiment 2 / Stage 2EvaluationFiguresAppendices.


Introduction (adapted from the paper’s abstract)

We compare global (frontier) and local LLMs on a realistic digital-forensics task: creating a structured investigative report from a large multi-device chat corpus. To handle long evidence sequences on local models with limited context windows, we implement a two-stage pipeline: (1) per-part extraction of investigation-relevant records and (2) cross-part synthesis into a full report with an individuals/roles table and a chronological timeline that cites verbatim Trace IDs. We construct a ground truth using a capable global model and evaluate model reports with a combination of LLM-assisted grading and deterministic validation of Trace IDs (including an equivalence map for near-duplicates).

Our results highlight a practical trade-off between quality and cost/latency: global models deliver the strongest end-to-end accuracy out-of-the-box, while carefully prompted local models, run via an OpenAI-compatible REST endpoint, can approach useful performance with significantly lower cost and controllability—particularly when the pipeline enforces strong constraints on evidence citation and timeline consistency. We release our prompts, runner, and validation scripts to enable fully reproducible comparisons across model choices and hardware setups.


Contents


Prerequisites

  • Python 3.10+
  • (Optional) LM Studio or any OpenAI-compatible REST endpoint to run local models.
  • A .env file (optional) with HF_TOKEN if you use tokenization via Hugging Face in the splitter.
python -m venv .venv
source .venv/bin/activate   # (Windows: .venv\Scripts\activate)
pip install -r requirements.txt

Data & Layout

Other key scripts and artifacts:

Folders used by the pipeline:

  • Split parts for Stage 1 are expected (by default) in split_chatconversations/ with pattern *_*.txt.
  • Stage-1 outputs: output/stage-1/<model>/...
  • Stage-2 outputs: output/stage-2/...

Experiment 1 — Ground truth creation

1.1 Generate the ground-truth report

Run your chosen capable model (e.g., hosted or local) with the contents of prompt_1.md and the full corpus as instructed in the prompt. Save the result to ground_truth_report.md at repo root.

1.2 Create Trace-ID Equivalence Map (dedup near-duplicates)

This consolidates duplicate or near-duplicate trace IDs across devices/exports for fair scoring downstream.

1.3 Generate a validation script for trace IDs

This script cross-checks every ID from the equivalence tables against the full corpus.

python ground_truth_trace_id_validation_and_equivalence/validate_trace_ids.py \
  --equivalence ground_truth_trace_id_validation_and_equivalence/equivalence_map_tables.md \
  --corpus chat_conversations/crystalclear_chats_10.v3.txt \
  --out output/evaluation/validation_results.csv

Dataset split (prep for Exp. 2)

We split the full corpus into token-limited parts for local-context models.

  • Script: split_chat_conversations.py
  • Default tokenizer: The script loads a Hugging Face tokenizer (see code). Provide HF_TOKEN in .env if needed.
  • Outputs: numbered parts. You have two options to match experiment_2.py’s default pattern:

Option A — Write parts into split_chatconversations/ (no extra flags later)

mkdir -p split_chatconversations
python split_chat_conversations.py \
  --split \
  --file chat_conversations/crystalclear_chats_10.v3.txt \
  --max-tokens 29000 \
  --outdir split_chatconversations

Option B — Keep parts in chat_conversations/ and tell the runner where they are

python split_chat_conversations.py \
  --split \
  --file chat_conversations/crystalclear_chats_10.v3.txt \
  --max-tokens 29000

# later: add --pattern "chat_conversations/crystalclear_chats_10.v3_*.txt" to experiment_2.py

Experiment 2 — Two-stage pipeline

The runner script does both stages in one invocation. It reads prompts/stage_1.md for Stage 1 and prompts/stage_2.md for Stage 2. Models are a fixed list inside the script:

["google_gemma-3-12b-it-qat", "qwen3-14b", "phi-4-reasoning", "gpt-oss-20b", "google_gemma-3-27b-it-qat"]

Stage 1 — Per-part extraction (inside the runner)

  • Runner: experiment_2.py
  • Prompt: prompts/stage_1.md
  • Input default pattern: split_chatconversations/*_*.txt (override with --pattern)
  • Output: output/stage-1/<model>/<split_base>_summary.txt (+ optional <split_base>_thoughts.txt)
  • Stats CSV: output/stage-1/stage-1_summary_stats.csv with columns
    model,input_file,summary_file,thoughts_file,seconds,tokens

What it does: For each model and each split file, it posts {system: stage_1.md, user: <split text>} to the REST API, extracts an optional <think>…</think> block, writes the summary, and appends a row to the Stage‑1 CSV. Recovery feature: if a summary already exists, that file is skipped.

Stage 2 — Cross-part synthesis (inside the runner)

  • Prompt: prompts/stage_2.md
  • Inputs: All *_summary.txt files from Stage 1 for the current model (concatenated in-order with clear START/END markers)
  • Output: output/stage-2/<model>_<YYYYMMDDHHMM>_report.md (+ optional _thoughts.md)
  • Stats CSV: output/stage-2/stage-2_report_stats.csv with columns
    model,report_file,thoughts_file,tokens,has_thoughts,duration

Recovery feature: If a report for a model already exists in output/stage-2/ (wildcard <model>_*_report.md), that model is skipped in Stage 2.

Example run (default pattern in split_chatconversations/)

python experiment_2.py \
  --api-url "http://localhost:1234/v1/chat/completions"

Example run (if your split parts are in chat_conversations/)

python experiment_2.py \
  --api-url "http://localhost:1234/v1/chat/completions" \
  --pattern "chat_conversations/crystalclear_chats_10.v3_*.txt"

Gemini 2.5 Pro — Two-stage (manual) comparison

To compare a global model using the same two-stage logic, we ran Gemini 2.5 Pro manually (outside experiment_2.py):

  1. Stage 1 (manual): Use prompts/stage_1.md over the full corpus (or your preferred chunking in the UI) and save a single concatenated file:

  2. Stage 2 (manual): Use prompts/stage_2.md with that concatenated summary as input and save the final report under output/stage-2/:

We keep the same output structure so that evaluation treats Gemini 2.5 Pro like any local model.


Evaluation

Whole-report scoring

Timeline-only scoring

Recompute/verify evaluation math

python validate_evaluation_scores.py

This re-computes key metrics and the final weighted score to sanity-check the LLM-graded outputs in output/evaluation/report_evaluation.csv.

Tables

Below are the Markdown tables generated from the pipeline and used in the paper (kept under LaTeX assets).


Prompts — quick index

Core prompts (pipeline):


Prompts used for figures

Figure ordering: Figure 4 = Roles classification, Figure 5 = Scatterplot.


Appendix A — Dataset

Split parts in chat_conversations/:

Each part corresponds to a token-bounded slice (~29k) while preserving device and conversation boundaries where possible.


Appendix B — Per-model artifacts

For each model below, Stage 1 writes per-part summaries to output/stage-1/<model>/ and Stage 2 writes the final report to output/stage-2/. Exact filenames from our runs are linked below.

google_gemma-3-12b-it-qat

qwen3-14b

phi-4-reasoning

gpt-oss-20b

google_gemma-3-27b-it-qat

gemini-2.5-pro (manual two-stage)

Notes: The runner also writes _thoughts.txt/_thoughts.md files if a model emits a <think>…</think> block; these are not used in Stage 2 or evaluation.


Appendix C — Hardware & runtime notes

  • GPU: NVIDIA RTX 4500 Ada (24 GB VRAM). This Ada-generation workstation GPU comfortably serves 12–20B class instruction-tuned models at interactive speeds with 4-bit quantization; larger 27B models run with reduced throughput. Your mileage may vary based on CPU, RAM, and storage bandwidth.
  • Serving: LM Studio with REST API enabled, version 0.3.25 (Build 2). We used the default OpenAI-compatible endpoint at http://localhost:1234/v1/chat/completions.
  • Batching: The runner sends one split file per request per model. Stage-2 concatenates Stage-1 summaries with START/END markers for deterministic parsing.

Gotchas & notes

  • Paths & patterns: The runner defaults to split_chatconversations/*_*.txt. Either place your split parts there (Option A) or pass --pattern to point at where your parts live (Option B).
  • Token budget: Stage-1 splitting was tuned to ~29k tokens per part. If your local models differ, adjust --max-tokens during splitting to stay within their context window.
  • Trace IDs: Stage-1 and Stage-2 prompts require that every timeline/event entry references verbatim Trace IDs. This is critical for evaluation alignment.
  • Thought tags: If a model emits <think>…</think>, the runner saves it separately and excludes it from reports.

© 2025 — Reproduction guide for the “Global vs Local LLM” experiments.

About

Code, prompts, and artifacts for our DFDS (co-located with DFRWS EU 2026, Sweden) paper evaluating local LLMs on long chat corpora using a two-stage, context-window-aware pipeline to produce structured investigative reports, with reproducible scoring and ACM-published results.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published