This repository contains the code, prompts, and instructions to reproduce the experiments behind our paper submitted to DFRWS EU 2026.
Order of operations: Experiment 1 (ground truth) → Dataset split → Experiment 2 / Stage 1 → Experiment 2 / Stage 2 → Evaluation → Figures → Appendices.
- 📄 Draft paper (PDF):
DFRWS2026EU.pdf
We compare global (frontier) and local LLMs on a realistic digital-forensics task: creating a structured investigative report from a large multi-device chat corpus. To handle long evidence sequences on local models with limited context windows, we implement a two-stage pipeline: (1) per-part extraction of investigation-relevant records and (2) cross-part synthesis into a full report with an individuals/roles table and a chronological timeline that cites verbatim Trace IDs. We construct a ground truth using a capable global model and evaluate model reports with a combination of LLM-assisted grading and deterministic validation of Trace IDs (including an equivalence map for near-duplicates).
Our results highlight a practical trade-off between quality and cost/latency: global models deliver the strongest end-to-end accuracy out-of-the-box, while carefully prompted local models, run via an OpenAI-compatible REST endpoint, can approach useful performance with significantly lower cost and controllability—particularly when the pipeline enforces strong constraints on evidence citation and timeline consistency. We release our prompts, runner, and validation scripts to enable fully reproducible comparisons across model choices and hardware setups.
- Prerequisites
- Data & Layout
- Experiment 1 — Ground truth creation
- Dataset split (prep for Exp. 2)
- Experiment 2 — Two-stage pipeline
- Evaluation
- Prompts — quick index
- Prompts used for figures
- Appendix A — Dataset
- Appendix B — Per-model artifacts
- Appendix C — Hardware & runtime notes
- Gotchas & notes
- Python 3.10+
- (Optional) LM Studio or any OpenAI-compatible REST endpoint to run local models.
- A
.envfile (optional) withHF_TOKENif you use tokenization via Hugging Face in the splitter.
python -m venv .venv
source .venv/bin/activate # (Windows: .venv\Scripts\activate)
pip install -r requirements.txt- Corpus (full):
chat_conversations/crystalclear_chats_10.v3.txt
Other key scripts and artifacts:
- Runner:
experiment_2.py - Splitter:
split_chat_conversations.py - Ground truth report (generated in Exp. 1):
ground_truth_report.md - Evaluation helper (numeric check):
validate_evaluation_scores.py
Folders used by the pipeline:
- Split parts for Stage 1 are expected (by default) in
split_chatconversations/with pattern*_*.txt.- Stage-1 outputs:
output/stage-1/<model>/...- Stage-2 outputs:
output/stage-2/...
-
Prompt:
prompts/prompt_1.md -
Output:
ground_truth_report.md -
Model thoughts (optional):
ground_truth_report_thoughts.md
Run your chosen capable model (e.g., hosted or local) with the contents of prompt_1.md and the full corpus as instructed in the prompt. Save the result to ground_truth_report.md at repo root.
- Prompt:
ground_truth_trace_id_validation_and_equivalence/prompt 1 Create a Trace ID Equivalence Map.md - Inputs:
ground_truth_report.md+ the full chat corpus - Output:
ground_truth_trace_id_validation_and_equivalence/equivalence_map_tables.md
This consolidates duplicate or near-duplicate trace IDs across devices/exports for fair scoring downstream.
- Prompt:
ground_truth_trace_id_validation_and_equivalence/prompt 2 Create trace id validation script.md - Output (code):
ground_truth_trace_id_validation_and_equivalence/validate_trace_ids.py
This script cross-checks every ID from the equivalence tables against the full corpus.
python ground_truth_trace_id_validation_and_equivalence/validate_trace_ids.py \
--equivalence ground_truth_trace_id_validation_and_equivalence/equivalence_map_tables.md \
--corpus chat_conversations/crystalclear_chats_10.v3.txt \
--out output/evaluation/validation_results.csvWe split the full corpus into token-limited parts for local-context models.
- Script:
split_chat_conversations.py - Default tokenizer: The script loads a Hugging Face tokenizer (see code). Provide
HF_TOKENin.envif needed. - Outputs: numbered parts. You have two options to match
experiment_2.py’s default pattern:
mkdir -p split_chatconversations
python split_chat_conversations.py \
--split \
--file chat_conversations/crystalclear_chats_10.v3.txt \
--max-tokens 29000 \
--outdir split_chatconversationspython split_chat_conversations.py \
--split \
--file chat_conversations/crystalclear_chats_10.v3.txt \
--max-tokens 29000
# later: add --pattern "chat_conversations/crystalclear_chats_10.v3_*.txt" to experiment_2.pyThe runner script does both stages in one invocation. It reads
prompts/stage_1.mdfor Stage 1 andprompts/stage_2.mdfor Stage 2. Models are a fixed list inside the script:
["google_gemma-3-12b-it-qat", "qwen3-14b", "phi-4-reasoning", "gpt-oss-20b", "google_gemma-3-27b-it-qat"]
- Runner:
experiment_2.py - Prompt:
prompts/stage_1.md - Input default pattern:
split_chatconversations/*_*.txt(override with--pattern) - Output:
output/stage-1/<model>/<split_base>_summary.txt(+ optional<split_base>_thoughts.txt) - Stats CSV:
output/stage-1/stage-1_summary_stats.csvwith columns
model,input_file,summary_file,thoughts_file,seconds,tokens
What it does: For each model and each split file, it posts {system: stage_1.md, user: <split text>} to the REST API, extracts an optional <think>…</think> block, writes the summary, and appends a row to the Stage‑1 CSV. Recovery feature: if a summary already exists, that file is skipped.
- Prompt:
prompts/stage_2.md - Inputs: All
*_summary.txtfiles from Stage 1 for the current model (concatenated in-order with clear START/END markers) - Output:
output/stage-2/<model>_<YYYYMMDDHHMM>_report.md(+ optional_thoughts.md) - Stats CSV:
output/stage-2/stage-2_report_stats.csvwith columns
model,report_file,thoughts_file,tokens,has_thoughts,duration
Recovery feature: If a report for a model already exists in output/stage-2/ (wildcard <model>_*_report.md), that model is skipped in Stage 2.
python experiment_2.py \
--api-url "http://localhost:1234/v1/chat/completions"python experiment_2.py \
--api-url "http://localhost:1234/v1/chat/completions" \
--pattern "chat_conversations/crystalclear_chats_10.v3_*.txt"To compare a global model using the same two-stage logic, we ran Gemini 2.5 Pro manually (outside experiment_2.py):
-
Stage 1 (manual): Use
prompts/stage_1.mdover the full corpus (or your preferred chunking in the UI) and save a single concatenated file: -
Stage 2 (manual): Use
prompts/stage_2.mdwith that concatenated summary as input and save the final report underoutput/stage-2/:
We keep the same output structure so that evaluation treats Gemini 2.5 Pro like any local model.
-
Prompt:
prompts/prompt_4_evaluation.md -
Inputs:
- Ground truth:
ground_truth_report.md - Trace-ID equivalence map:
ground_truth_trace_id_validation_and_equivalence/equivalence_map_tables.md
- Ground truth:
-
Output:
-
Aggregated CSV:
output/evaluation/report_evaluation.csv -
Per-model evaluation markdown:
output/evaluation/google_gemma-3-12b-it-qat_report_evaluation.mdoutput/evaluation/google_gemma-3-27b-it-qat_report_evaluation.mdoutput/evaluation/gpt-oss-20b_report_evaluation.mdoutput/evaluation/phi-4-reasoning_report_evaluation.mdoutput/evaluation/qwen3-14b_report_evaluation.mdoutput/evaluation/gemini-2.5-pro_report_evaluation.md
-
- Prompt:
prompts/prompt_5_generate_model_timeline_evaluation.md - Inputs: GT timeline table; model name; model’s timeline table
- Output columns:
Model,Groundtruth,Model,Correct,Wrong,Wrong event,Wrong date,Missed
- Script:
validate_evaluation_scores.py
python validate_evaluation_scores.pyThis re-computes key metrics and the final weighted score to sanity-check the LLM-graded outputs in output/evaluation/report_evaluation.csv.
Below are the Markdown tables generated from the pipeline and used in the paper (kept under LaTeX assets).
-
Table 1 — Dataset overview: dataset size, #devices, #conversations, and split statistics.
latex_figures_and_tables/tables/table_1_dataset.md -
Table 2 — Models compared: model names, parameter sizes, quantization, and serving details used in our runs.
latex_figures_and_tables/tables/table_2_modesl.md -
Table 4 — Stage-1 and Stage-2 outputs: counts of extracted records, unique trace IDs, and final report lengths per model.
latex_figures_and_tables/tables/table_4_stage1_and_stage2.md -
Table 4 (timing) — Stage-1 runtime: per-part/runtime aggregates for Stage-1 by model (median, p90, total).
latex_figures_and_tables/tables/table_4_stage_1_timing.md -
Table 5 — Retrieved Trace IDs: recall of Trace IDs relative to ground truth (with equivalence mapping), including graded credit.
latex_figures_and_tables/tables/table_5_Retrieved%20IDs.md -
Table 6 — Timeline comparison: one-to-one alignment results (Correct, Wrong event, Wrong date, Missed) per model.
latex_figures_and_tables/tables/table_6_timeline_comparison.md -
Table 7 — Performance scores only: entity accuracy, trace-ID retrieval, role attribution, timeline and factual consistency, reasoning quality.
latex_figures_and_tables/tables/table_7_only_performance_scores.md -
Table 7 — Performance + timing: final weighted scores alongside runtime metrics.
latex_figures_and_tables/tables/table_7_performce_score_and_timing.md
Core prompts (pipeline):
- Ground truth:
prompts/prompt_1.md→ground_truth_report.md - Stage 1 (per split part):
prompts/stage_1.md→*_summary.txt - Stage 2 (per model):
prompts/stage_2.md→[model]_YYYYMMDDHHMM_report.md - Whole-report evaluation:
prompts/prompt_4_evaluation.md - Timeline evaluation:
prompts/prompt_5_generate_model_timeline_evaluation.md
Figure ordering: Figure 4 = Roles classification, Figure 5 = Scatterplot.
-
Figure 4 — Roles classification JSON
- Prompt:
latex_figures_and_tables/figures/prompt_for_converting_report_roles_to_json_for_figure_4.md - Output:
latex_figures_and_tables/figures/roles_classification.json
- Prompt:
-
Figure 5 — Scatterplot (performance vs. time)
- Prompt file (repo may use one of these names):
- Full corpus:
chat_conversations/crystalclear_chats_10.v3.txt - Dataset overview (Table 1):
latex_figures_and_tables/tables/table_1_dataset.md
Split parts in chat_conversations/:
chat_conversations/crystalclear_chats_10.v3_01.txtchat_conversations/crystalclear_chats_10.v3_02.txtchat_conversations/crystalclear_chats_10.v3_03.txtchat_conversations/crystalclear_chats_10.v3_04.txtchat_conversations/crystalclear_chats_10.v3_05.txtchat_conversations/crystalclear_chats_10.v3_06.txtchat_conversations/crystalclear_chats_10.v3_07.txtchat_conversations/crystalclear_chats_10.v3_08.txtchat_conversations/crystalclear_chats_10.v3_09.txtchat_conversations/crystalclear_chats_10.v3_10.txt
Each part corresponds to a token-bounded slice (~29k) while preserving device and conversation boundaries where possible.
For each model below, Stage 1 writes per-part summaries to output/stage-1/<model>/ and Stage 2 writes the final report to output/stage-2/. Exact filenames from our runs are linked below.
- Stage-1 summaries:
output/stage-1/google_gemma-3-12b-it-qat/crystalclear_chats_10.v3_01_summary.txt·02·03·04·05·06·07·08·09·10 - Stage-2 report:
output/stage-2/google_gemma-3-12b-it-qat_202509171736_report.md
- Stage-1 summaries:
output/stage-1/qwen3-14b/crystalclear_chats_10.v3_01_summary.txt·02·03·04·05·06·07·08·09·10 - Stage-1 thoughts (if present):
output/stage-1/qwen3-14b/crystalclear_chats_10.v3_01_thoughts.txt·02·03·04·05·06·07·08·09·10 - Stage-2 report:
output/stage-2/qwen3-14b_202509171739_report.md· Thoughts:output/stage-2/qwen3-14b_202509171739_report_thoughts.md
- Stage-1 summaries:
output/stage-1/phi-4-reasoning/crystalclear_chats_10.v3_01_summary.txt·02·03·04·05·06·07·08·09·10 - Stage-1 thoughts (if present):
output/stage-1/phi-4-reasoning/crystalclear_chats_10.v3_01_thoughts.txt·02·03·04·05·06·07·08·09·10 - Stage-2 report:
output/stage-2/phi-4-reasoning_202509171757_report.md· Thoughts:output/stage-2/phi-4-reasoning_202509171757_report_thoughts.md
- Stage-1 summaries:
output/stage-1/gpt-oss-20b/crystalclear_chats_10.v3_01_summary.txt·02·03·04·05·06·07·08·09·10 - Stage-1 thoughts (if present):
output/stage-1/gpt-oss-20b/crystalclear_chats_10.v3_01_thoughts.txt·02·03·04·05·06·07·08·09·10 - Stage-2 report:
output/stage-2/gpt-oss-20b_202509171759_report.md
- Stage-1 summaries:
output/stage-1/google_gemma-3-27b-it-qat/crystalclear_chats_10.v3_01_summary.txt·02·03·04·05·06·07·08·09·10 - Stage-2 report:
output/stage-2/google_gemma-3-27b-it-qat_202509171934_report.md
- Stage-1 concatenated summary:
output/stage-1/gemini-2.5-pro/gemini-2.5-pro-summaries.md - Stage-2 report:
output/stage-2/gemini-2.5-pro-report-from-summaries.md
Notes: The runner also writes
_thoughts.txt/_thoughts.mdfiles if a model emits a<think>…</think>block; these are not used in Stage 2 or evaluation.
- GPU: NVIDIA RTX 4500 Ada (24 GB VRAM). This Ada-generation workstation GPU comfortably serves 12–20B class instruction-tuned models at interactive speeds with 4-bit quantization; larger 27B models run with reduced throughput. Your mileage may vary based on CPU, RAM, and storage bandwidth.
- Serving: LM Studio with REST API enabled, version 0.3.25 (Build 2). We used the default OpenAI-compatible endpoint at
http://localhost:1234/v1/chat/completions. - Batching: The runner sends one split file per request per model. Stage-2 concatenates Stage-1 summaries with START/END markers for deterministic parsing.
- Paths & patterns: The runner defaults to
split_chatconversations/*_*.txt. Either place your split parts there (Option A) or pass--patternto point at where your parts live (Option B). - Token budget: Stage-1 splitting was tuned to ~29k tokens per part. If your local models differ, adjust
--max-tokensduring splitting to stay within their context window. - Trace IDs: Stage-1 and Stage-2 prompts require that every timeline/event entry references verbatim Trace IDs. This is critical for evaluation alignment.
- Thought tags: If a model emits
<think>…</think>, the runner saves it separately and excludes it from reports.
© 2025 — Reproduction guide for the “Global vs Local LLM” experiments.