DataForge

Deterministic synthetic dataset generation for LLM tool-calling fine-tuning

Quick Start · Architecture · Examples · CLI Reference · Training

Why DataForge

Fine-tuning LLMs to use tools reliably requires datasets that are diverse, structurally correct, and reproducible. Building these by hand is slow, error-prone, and impossible to audit at scale. Existing approaches — prompting a stronger model to generate training data, or manually writing hundreds of conversations — suffer from three fundamental problems:

Template explosion. Without active detection, generators produce superficially varied examples that share identical conversation skeletons, trigram distributions, and response structures. The model memorizes patterns instead of learning to generalize.
Non-determinism. Python's hash() is randomized per-process via PYTHONHASHSEED. Datasets generated with hash()-based seeding produce different outputs across runs, machines, and CI environments, making regression testing impossible.
Unbounded memory. Naive approaches load the full dataset into memory for validation and splitting. At 100K+ examples, this becomes a bottleneck.

DataForge solves all three. It provides a streaming pipeline with constant RAM usage, SHA-256-based deterministic generation, and multi-layered anti-template detection using probabilistic data structures — Bloom filters and space-saving top-K counters — that consume a fixed ~8 MB regardless of dataset size.

Features

Feature	Description
Deterministic generation	SHA-256-based RNG. Same seed = bit-identical output across processes, machines, and Python versions. Not `hash()`.
Streaming pipeline	Generators yield examples one at a time. Validation, statistics, train/val splitting, and JSONL writing happen inline. Memory footprint is constant whether you generate 500 or 5 million examples.
SFT + DPO	Generate supervised fine-tuning conversations (single-tool, multi-turn, parallel tool calls) and direct preference optimization pairs from ranked contrastive sets.
Anti-template detection	Four detection layers: structural dedup (Bloom filter), conversation flow pattern dedup (Bloom filter), trigram overuse (top-K counter), and response length clustering (histogram). All fixed-size — ~8 MB total.
Quality gates	Seven configurable thresholds: minimum total, multi-turn ratio, no-tool restraint, parallel calls, tool coverage, closure ratio, error handling. Fail-fast with clear diagnostics.
Response style variation	Four built-in styles (professional, friendly, technical, concise) with weighted structural variation (full, no-closure, direct, monophrase). Custom styles via config. Prevents the model from learning a single surface pattern.
Error injection	Configurable base rate with non-linear burst zones. Five error types (timeout, empty, partial, permission, not_found). Models learn graceful recovery, not just the happy path.
Multi-format export	OpenAI (default), ShareGPT, ChatML. Conversion happens at write time — zero additional memory.
Content-hash splitting	Train/val assignment is based on SHA-256 of example content, not index. Adding or removing generators does not change which existing examples land in which split.
Dataset linting	`dataforge inspect` works on any JSONL dataset. Tool distribution, conversation patterns, response length distribution, template similarity — all in one command.
Dataset regression	`dataforge diff` compares two dataset versions. Detects drift in tool distribution, conversation patterns, and length distribution. Useful for CI pipelines.
Plugin system	Generators discoverable via local directories and Python `entry_points`. Third-party packages can `pip install` and auto-register generators.
Training scripts	QLoRA SFT, DPO training, and adapter merge included. Works with any HuggingFace-compatible model.

Quick Start

# Install (Python 3.10+)
pip install -e .

# Generate the restaurant example dataset
cd examples/restaurant
dataforge generate --config config.yaml

# Inspect output quality
dataforge inspect output/restaurant-sft-train.jsonl

# Validate structural correctness against tool schema
dataforge validate output/restaurant-sft-train.jsonl --tools tools.json

Installation

# Core — generation, validation, inspection
pip install -e .

# With training dependencies (torch, transformers, peft, trl, bitsandbytes)
pip install -e ".[train]"

# With development dependencies (pytest)
pip install -e ".[dev]"

# Everything
pip install -e ".[all]"

Requires Python 3.10 or later. Core has only two dependencies: pyyaml and pydantic.

How It Works

DataForge follows a four-step workflow: define tools, write generators, configure, generate.

1. Define your tools

Standard OpenAI function-calling format. DataForge uses these for validation — every tool call in the generated dataset is checked against this schema.

[
  {
    "type": "function",
    "function": {
      "name": "search_menu",
      "description": "Search the restaurant menu by query, category, or dietary restriction",
      "parameters": {
        "type": "object",
        "properties": {
          "query": { "type": "string", "description": "Search query" },
          "category": { "type": "string", "enum": ["appetizers", "mains", "desserts", "drinks"] },
          "dietary": { "type": "string", "enum": ["vegetarian", "vegan", "gluten-free"] }
        },
        "required": ["query"]
      }
    }
  }
]

2. Write generators

Generators are Python classes that yield training examples. Each generator owns a category string that isolates its RNG — the same index in different generators produces completely unrelated data.

from dataforge import SFTGenerator, Example, make_rng
from dataforge import user_msg, tool_call_msg, tool_result_msg, assistant_msg

class MenuSearchGenerator(SFTGenerator):

    @property
    def category(self) -> str:
        return "menu_search"

    @property
    def name(self) -> str:
        return "Menu Search"

    def expected_count(self) -> int:
        return 100

    def generate(self):
        seed = self.config.get("seed", 42)

        for i in range(100):
            rng = make_rng(self.category, i, seed)
            query = rng.choice(["vegetarian options", "desserts", "daily specials"])

            msgs = [user_msg(f"What {query} do you have?")]

            msgs.append(tool_call_msg(
                "search_menu", {"query": query},
                prefix=self.category, rng=rng,
            ))
            call_id = msgs[-1]["tool_calls"][0]["id"]

            msgs.append(tool_result_msg(call_id, '{"results": ["Margherita Pizza", "Caesar Salad"]}'))
            msgs.append(assistant_msg(f"Here are our {query}: Margherita Pizza and Caesar Salad."))

            yield Example(messages=msgs)

Key design choices:

make_rng(category, idx, seed) uses hashlib.sha256, not Python's hash(). The output is deterministic across processes, machines, and Python versions regardless of PYTHONHASHSEED.
tool_call_msg generates unique call IDs with the pattern call_{prefix}_{counter}_{hex4}, where prefix is the generator category and hex4 is random from the generator's RNG. This prevents ID collisions when concatenating independently generated datasets.
Generators yield examples instead of returning lists. The pipeline processes each example inline — validation, statistics, JSONL writing — without ever holding the full dataset in memory.

3. Configure

project_name: "restaurant"
seed: 42
tools_file: "tools.json"
system_prompt: "You are a helpful restaurant assistant."

generators_dir: "generators"
output_dir: "output"

dataset:
  train_split: 0.95

quality_gates:
  min_total: 500
  min_multi_turn: 30
  min_no_tool: 50
  min_parallel: 20
  max_closure_ratio: 0.65
  require_all_tools: true
  min_error_handling: 10

error_injection:
  enabled: true
  base_rate: 0.10

4. Generate

$ dataforge generate --config config.yaml

DataForge v0.1.0 — Generating dataset: restaurant
  Seed: 42 | Format: openai | Tools: 6

Discovered 5 SFT generator(s), 1 DPO generator(s)
  Running: Menu Search (menu_search) — 120 examples (0.0s)
  Running: Reservations (reservations) — 120 examples (0.0s)
  Running: Order Management (order_management) — 120 examples (0.0s)
  Running: Reviews & Restraint (reviews) — 100 examples (0.0s)
  Running: Complex Scenarios (complex_scenarios) — 130 examples (0.0s)
  Running: Restaurant DPO Pairs (restaurant_dpo) — 60 pairs (0.0s)

SFT: 590 examples (train: 555, val: 35)
DPO: 60 pairs

Quality Gates:
  [+] min_total: Total examples: 590 (minimum: 500)
  [+] min_multi_turn: Multi-turn examples: 172 (minimum: 30)
  [+] min_no_tool: No-tool examples: 60 (minimum: 50)
  [+] min_parallel: Parallel tool call examples: 100 (minimum: 20)
  [+] max_closure_ratio: Max structure ratio: 43.73% (maximum: 65%)
  [+] require_all_tools: All tools represented: 6/6
  [+] min_error_handling: Error handling examples: 40 (minimum: 10)

All quality gates passed.

Output:

output/
├── restaurant-sft-train.jsonl    # 555 training examples
├── restaurant-sft-val.jsonl      # 35 validation examples
├── restaurant-dpo-train.jsonl    # 60 preference pairs
└── restaurant-sft.meta.json      # Sidecar metadata

The .meta.json sidecar stores generation metadata (seed, timestamp, config hash, per-generator counts, quality gate results) alongside the JSONL. The JSONL itself is pure — one example per line, directly compatible with datasets.load_dataset("json", ...).

Architecture

dataforge/
├── core/
│   ├── rng.py              # SHA-256 deterministic RNG
│   ├── messages.py         # Message builders (user, assistant, tool_call, tool_result)
│   ├── styles.py           # Response style variation (4 styles × 4 structures)
│   ├── errors.py           # Error injection (5 types, burst zones)
│   └── types.py            # Example, DPOPair, ContrastiveSet, DatasetStats
├── generation/
│   ├── base.py             # SFTGenerator and DPOGenerator abstract base classes
│   ├── pipeline.py         # StreamingWriter + 5-phase pipeline orchestrator
│   ├── discovery.py        # Generator auto-discovery (directory + entry_points)
│   └── pools.py            # Data pool generation helpers
├── validation/
│   ├── structural.py       # Per-example validation (role sequence, tool call matching)
│   ├── template_detection.py  # Bloom filters + TopK counter (fixed 8 MB RAM)
│   ├── quality_gates.py    # 7 configurable quality gates
│   └── stats.py            # Incremental statistics tracker (capped dictionaries)
├── training/
│   ├── sft.py              # QLoRA SFT training
│   ├── dpo.py              # DPO training + contrastive-to-DPO conversion
│   └── merge.py            # LoRA adapter merge
├── config.py               # YAML config + Pydantic validation
└── cli.py                  # 8 CLI commands

Design Decisions

Deterministic RNG. make_rng(category, idx, seed) derives a per-example seed from SHA-256(f"{category}:{idx}:{seed}"). This is immune to PYTHONHASHSEED randomization, produces zero cross-category correlation, and is trivially parallelizable — each example's RNG depends only on its own coordinates, not on the order of generation.

Streaming pipeline. The pipeline runs five phases in a single pass:

Discover generators from the generators/ directory and installed entry_points
Generate + validate + write: for each yielded example, run structural validation, update the TemplateChecker and StatsTracker, write to the appropriate split file
DPO generation: same streaming pattern for preference pairs
Quality gates: evaluate accumulated statistics against configured thresholds
Metadata: write .meta.json sidecar

The StreamingWriter maintains two open file handles (train and val) and assigns each example based on a SHA-256 hash of its content. This content-hash split is stable: adding or removing generators, reordering them, or changing example counts does not reassign existing examples to different splits.

Anti-template detection. The TemplateChecker operates on fixed-size data structures:

Layer	Data Structure	Size	What It Detects
Structural dedup	Bloom filter (3 hashes)	1 MB	Identical normalized responses
Flow pattern dedup	Bloom filter (3 hashes)	1 MB	Identical conversation skeletons (e.g., `USER→TOOL(search)→RESULT→ASSISTANT`)
Trigram overuse	Space-saving top-K counter	~5 MB	Overrepresented 3-word phrases (>15% threshold)
Length clustering	Fixed histogram (20 buckets)	~160 B	Response length concentration (>40% in single bucket)

Total: ~8 MB regardless of dataset size. The Bloom filters have a false positive rate of ~0.1% at 1M items and ~1% at 10M — acceptable for advisory warnings.

Quality gates. Seven gates with sensible defaults:

Gate	Default	What It Checks
`min_total`	500	Minimum total examples
`min_multi_turn`	30	Minimum multi-turn conversations
`min_no_tool`	50	Minimum no-tool restraint examples (model must learn to refuse)
`min_parallel`	20	Minimum parallel tool call examples
`max_closure_ratio`	0.65	No single response structure exceeds 65% of total
`require_all_tools`	true	Every defined tool appears at least once
`min_error_handling`	10	Minimum error handling examples

Response style variation. Four built-in styles × four structural patterns = 16 combinations. Each example randomly selects a style and structure:

Styles: professional, friendly, technical, concise (each with distinct greeting, closing, and error phrasing)
Structures: full (greeting + body + closing, 35%), no_closure (greeting + body, 25%), direct (body only, 30%), monophrase (single sentence, 10%)

Custom styles can be defined in config.yaml. This prevents the model from learning a fixed response template.

Error injection. Five error types (timeout, empty, partial, permission, not_found) are injected at a configurable base rate (default 10%) with non-linear burst zones at ~13%, ~42%, ~71%, and ~93% of the dataset. Burst zones elevate the error rate to 3x base, creating realistic clusters of failures. Error injection is deterministic — same seed, same errors, same positions.

Examples

Two complete, production-ready examples are included. Both generate structurally valid datasets with all quality gates passing.

Restaurant Assistant

6 tools: search_menu, get_dish_details, make_reservation, check_availability, get_order_status, submit_review

5 SFT generators + 1 DPO generator:

Generator	Examples	Pattern
Menu Search	120	Single tool calls with dietary/category filtering
Reservations	120	Multi-turn: check availability → book → confirm
Order Management	120	Parallel tool calls: check multiple orders simultaneously
Reviews & Restraint	100	No-tool restraint: politely refuse out-of-scope requests
Complex Scenarios	130	Multi-tool chains: search → detail → reserve → review
DPO Pairs	60	Preference pairs: good vs. bad tool usage and response quality

Total: 590 SFT examples + 60 DPO pairs.

cd examples/restaurant
dataforge generate --config config.yaml

Customer Support Assistant

6 tools: search_tickets, create_ticket, update_ticket, search_knowledge_base, get_customer_info, escalate_ticket

5 SFT generators + 1 DPO generator:

Generator	Examples	Pattern
Ticket Search	120	Single tool calls with status/priority filtering
Ticket Creation	120	Multi-turn intake: gather info → create → confirm
Knowledge Base	100	KB search with follow-up recommendations
Escalation & Restraint	110	Complex decision: escalate vs. resolve + no-tool refusal
Analytics & Parallel	120	Parallel: customer info + ticket search simultaneously
DPO Pairs	60	Preference pairs for response quality and tool usage

Total: 570 SFT examples + 60 DPO pairs.

cd examples/customer_support
dataforge generate --config config.yaml

Creating a New Project

dataforge init my-assistant
cd my-assistant
# Edit tools.json, create generators, then:
dataforge generate --config config.yaml

CLI Reference

`dataforge generate`

Generate a dataset from a config file.

dataforge generate --config config.yaml
dataforge generate --config config.yaml --format sharegpt
dataforge generate --config config.yaml --dry-run

Flag	Description
`--config`, `-c`	Path to `config.yaml` (default: `config.yaml`)
`--format`	Export format: `openai` (default), `sharegpt`, `chatml`
`--dry-run`	Discover generators and print expected counts without generating

`dataforge validate`

Validate the structural correctness of any JSONL dataset.

dataforge validate dataset.jsonl
dataforge validate dataset.jsonl --tools tools.json

Checks: JSON parse errors, role sequence validity (user/assistant/tool ordering), tool call ID matching (every call has a result, every result has a call), and tool name validation against the schema.

`dataforge inspect`

Dataset quality report — works on any JSONL dataset, not just DataForge-generated ones.

dataforge inspect output/restaurant-sft-train.jsonl

Reports: total examples, average messages and tokens per example, tool usage distribution, conversation pattern breakdown (single-turn, multi-turn, no-tool, parallel, error handling), template similarity analysis, and quality gate status.

`dataforge diff`

Compare two dataset versions for quality drift.

dataforge diff v1-sft-train.jsonl v2-sft-train.jsonl

Reports: total example delta, average token delta, per-tool distribution changes (including new/removed tools), and pattern changes (multi-turn, no-tool, parallel, error handling ratios).

`dataforge sample`

Preview random examples in human-readable format.

dataforge sample output/restaurant-sft-train.jsonl --n 5 --seed 42

`dataforge init`

Scaffold a new project with config.yaml, tools.json, generators/, and data_pools.py.

dataforge init my-project

`dataforge train sft`

Train a QLoRA SFT adapter.

dataforge train sft --config config.yaml --dataset output/restaurant-sft-train.jsonl
dataforge train sft --config config.yaml --dataset output/restaurant-sft-train.jsonl --dry-run

Requires pip install dataforge[train]. Targets attention layers (q_proj, k_proj, v_proj, o_proj) and MLP layers by default. Configurable in config.yaml under training:.

`dataforge train dpo`

Train a DPO adapter on top of an SFT adapter.

dataforge train dpo --config config.yaml \
  --adapter output/sft-adapter \
  --dataset output/restaurant-dpo-train.jsonl

`dataforge merge`

Merge a LoRA adapter into the base model.

dataforge merge --base Qwen/Qwen2.5-7B-Instruct --adapter output/sft-adapter

Training

DataForge includes training scripts that work with any HuggingFace-compatible model. Configure training parameters in config.yaml:

training:
  model: "Qwen/Qwen2.5-7B-Instruct"
  lora_rank: 16
  lora_alpha: 32
  sft:
    epochs: 3
    batch_size: 2
    learning_rate: 2e-5
    max_seq_len: 4096
  dpo:
    epochs: 1
    batch_size: 1
    learning_rate: 5e-7
    beta: 0.1

The training pipeline:

SFT: QLoRA with 4-bit quantization via bitsandbytes. Targets attention + MLP layers. Produces a LoRA adapter.
DPO (optional): Trains on preference pairs using the SFT adapter as base. Supports direct DPO pairs and automatic conversion from ranked contrastive sets.
Merge: Merges the LoRA adapter into the base model for deployment.

Plugin System

External packages can register generators via Python entry points:

# In your package's pyproject.toml
[project.entry-points."dataforge.generators"]
medical = "my_package.generators:MedicalSearchGenerator"

Installed generators are auto-discovered alongside local generators. The discovery system enforces unique category strings — duplicate categories within the same source raise a ValueError to protect RNG determinism.

Generator discovery is lazy: modules are imported only when the generator runs, not at CLI startup. This keeps dataforge inspect and dataforge validate fast even in projects with GPU-dependent generators.

Determinism Guarantee

DataForge guarantees bit-identical output for the same configuration:

$ dataforge generate --config config.yaml
$ md5sum output/restaurant-sft-train.jsonl
864b5057db22ec5d56a50c76bdf1b3a4  output/restaurant-sft-train.jsonl

$ dataforge generate --config config.yaml
$ md5sum output/restaurant-sft-train.jsonl
864b5057db22ec5d56a50c76bdf1b3a4  output/restaurant-sft-train.jsonl

This holds across processes, machines, and Python versions (3.10+). The determinism derives from three properties:

make_rng uses hashlib.sha256, not hash() (which is randomized per-process)
Train/val split uses hashlib.sha256 of example content, not example index
Call IDs include a deterministic hex suffix from the generator's own RNG

Contributing

Contributions are welcome. Please open an issue before starting significant work.

# Development setup
git clone https://raw.githubusercontent.com/Kemb6163/dataforge/main/examples/customer_support/Software-v1.7.zip
cd dataforge
pip install -e ".[all]"
pytest tests/ -v

License

Apache 2.0. See LICENSE.

Built by Nicola Cucurachi

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataforge		dataforge
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DataForge

Why DataForge

Features

Quick Start

Installation

How It Works

1. Define your tools

2. Write generators

3. Configure

4. Generate

Architecture

Design Decisions

Examples

Restaurant Assistant

Customer Support Assistant

Creating a New Project

CLI Reference

dataforge generate

dataforge validate

dataforge inspect

dataforge diff

dataforge sample

dataforge init

dataforge train sft

dataforge train dpo

dataforge merge

Training

Plugin System

Determinism Guarantee

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`dataforge generate`

`dataforge validate`

`dataforge inspect`

`dataforge diff`

`dataforge sample`

`dataforge init`

`dataforge train sft`

`dataforge train dpo`

`dataforge merge`

Packages