deepflame-ai · xiao312 · Mar 31, 2026 · Mar 31, 2026
diff --git a/README.md b/README.md
@@ -120,12 +120,39 @@ python -m dfode_kit.cli.main sample \
   --include_mesh
 ```
 
+### 5. Continue to data preparation and training
+
+Typical next steps are:
+
+```bash
+python -m dfode_kit.cli.main augment \
+  --mech /path/to/mechanisms/CH4/gri30.yaml \
+  --h5_file /path/to/run/oneD_flame_CH4_phi1/ch4_phi1_sample.h5 \
+  --output_file /path/to/data/ch4_phi1_aug.npy \
+  --dataset_num 20000
+
+python -m dfode_kit.cli.main label \
+  --mech /path/to/mechanisms/CH4/gri30.yaml \
+  --time 1e-6 \
+  --source /path/to/data/ch4_phi1_aug.npy \
+  --save /path/to/data/ch4_phi1_labeled.npy
+
+python -m dfode_kit.cli.main train \
+  --mech /path/to/mechanisms/CH4/gri30.yaml \
+  --source_file /path/to/data/ch4_phi1_labeled.npy \
+  --output_path /path/to/models/ch4_phi1_model.pt
+```
+
+See the published data workflow guide for the expected artifacts and stage boundaries:
+- https://deepflame-ai.github.io/DFODE-kit/data-workflow/
+
 ## Recommended documentation entry points
 
 If you are using the CLI, start with:
 - https://deepflame-ai.github.io/DFODE-kit/cli/
 - https://deepflame-ai.github.io/DFODE-kit/init/
 - https://deepflame-ai.github.io/DFODE-kit/run-case/
+- https://deepflame-ai.github.io/DFODE-kit/data-workflow/
 
 If you are working on the repository itself, see:
 - `AGENTS.md`
@@ -135,8 +162,7 @@ If you are working on the repository itself, see:
 
 - `dfode_kit/cli/` — CLI entrypoints and subcommands
 - `dfode_kit/cases/` — case init, presets, sampling, and DeepFlame/OpenFOAM-facing helpers
-- `dfode_kit/data/` — data contracts, HDF5 I/O, and integration helpers
-- `dfode_kit/data_operations/` — augmentation and labeling workflows
+- `dfode_kit/data/` — data contracts, HDF5 I/O, integration, augmentation, and labeling helpers
 - `dfode_kit/models/` — model architectures and registries
 - `dfode_kit/training/` — training configuration, registries, training loops, and preprocessing
 - `canonical_cases/` — canonical flame case templates

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -4,8 +4,7 @@
 
 - `dfode_kit/cli/`: CLI entrypoints and subcommands
 - `dfode_kit/cases/`: explicit case init, presets, sampling, and DeepFlame-facing helpers
-- `dfode_kit/data/`: contracts, HDF5 I/O, and integration utilities
-- `dfode_kit/data_operations/`: augmentation and labeling workflows
+- `dfode_kit/data/`: contracts, HDF5 I/O, integration, augmentation, and labeling utilities
 - `dfode_kit/models/`: model architectures and registries
 - `dfode_kit/training/`: training configuration, training loops, registries, and preprocessing
 - `docs/agents/`: agent-facing operational and planning docs
@@ -21,11 +20,30 @@ The repository now includes:
 - lightweight CI
 - documentation topology for agents and maintainers
 
-### 2. Data contracts
-A new contracts layer is being used to make HDF5 dataset assumptions explicit and testable.
+### 2. Data contracts and workflow boundaries
+A contracts layer is used to make HDF5 dataset assumptions explicit and testable.
+The canonical `dfode_kit.data` package now also owns the main data-preparation boundary:
+
+- HDF5 sampling outputs
+- HDF5-to-NumPy conversion
+- perturbation-based augmentation
+- CVODE/Cantera labeling
+- integration utilities used by downstream workflows
 
 ### 3. Config-driven training
 The training stack is moving toward explicit config objects and registries so new model architectures and trainer types can be added without editing a monolithic training loop.
 
 ### 4. Agent-friendly CLI
 The CLI now uses lighter command discovery and deferred heavy imports for improved usability in minimal environments.
+
+## Architectural end state of the recent refactor
+
+The repository has now completed the transition away from the older compatibility layout. In particular, these legacy layers are removed from `main`:
+
+- `dfode_kit/cli_tools/`
+- `dfode_kit/df_interface/`
+- `dfode_kit/data_operations/`
+- `dfode_kit/runtime_config.py`
+- legacy `dfode_core` model/train compatibility packages
+
+The current published docs should therefore treat `cli`, `cases`, `data`, `models`, `runtime`, and `training` as the only canonical implementation homes.
diff --git a/docs/cli.md b/docs/cli.md
@@ -74,15 +74,52 @@ dfode-kit sample \
 ### `augment`
 Apply perturbation-based dataset augmentation to sampled states.
 
+Example:
+
+```bash
+dfode-kit augment \
+  --mech /path/to/gri30.yaml \
+  --h5_file /path/to/sample.h5 \
+  --output_file /path/to/augmented.npy \
+  --dataset_num 20000
+```
+
 ### `label`
 Generate supervised learning targets using Cantera/CVODE time advancement.
 
+Example:
+
+```bash
+dfode-kit label \
+  --mech /path/to/gri30.yaml \
+  --time 1e-6 \
+  --source /path/to/augmented.npy \
+  --save /path/to/labeled.npy
+```
+
 ### `train`
 Train a neural-network surrogate for chemistry integration.
 
+Example:
+
+```bash
+dfode-kit train \
+  --mech /path/to/gri30.yaml \
+  --source_file /path/to/labeled.npy \
+  --output_path /path/to/model.pt
+```
+
 ### `h52npy`
 Convert HDF5 scalar-field datasets into a stacked NumPy array.
 
+Example:
+
+```bash
+dfode-kit h52npy \
+  --source /path/to/sample.h5 \
+  --save_to /path/to/sample.npy
+```
+
 ## Current design notes
 
 Recent CLI refactors improved:
@@ -92,7 +129,9 @@ Recent CLI refactors improved:
 - lazy command loading for lighter help paths,
 - more predictable command dispatch behavior.
 
-The new `init` command already supports machine-readable JSON output for planning/provenance.
+The new `init` command already supports machine-readable JSON output for planning/provenance, and `run-case` supports JSON output for preview/apply results.
+
+For the end-to-end artifact flow between `sample`, `augment`, `label`, `h52npy`, and `train`, see [Data Preparation and Training Workflow](data-workflow.md).
 
 Future work should still add:
 

diff --git a/docs/data-workflow.md b/docs/data-workflow.md
@@ -0,0 +1,179 @@
+# Data Preparation and Training Workflow
+
+This page documents the currently exposed CLI stages after a case has been initialized and run successfully.
+
+It focuses on the data pipeline from:
+
+1. finished DeepFlame/OpenFOAM case outputs
+2. sampled HDF5 state data
+3. optional HDF5-to-NumPy conversion
+4. augmented state datasets
+5. labeled supervised-learning datasets
+6. trained surrogate model artifacts
+
+## Stage boundaries
+
+The current CLI presents the data workflow as a sequence of artifact transformations.
+
+### 1. `sample`
+Input:
+- a finished case directory
+- a mechanism file
+
+Output:
+- an HDF5 file containing sampled scalar fields
+- optionally mesh datasets
+
+Example:
+
+```bash
+dfode-kit sample \
+  --mech /path/to/gri30.yaml \
+  --case /path/to/run/oneD_flame_CH4_phi1 \
+  --save /path/to/run/oneD_flame_CH4_phi1/ch4_phi1_sample.h5 \
+  --include_mesh
+```
+
+Typical contents include:
+- root metadata such as `mechanism`
+- `scalar_fields/` datasets keyed by output time
+- optional mesh datasets
+
+### 2. `h52npy`
+Input:
+- sampled HDF5 file
+
+Output:
+- stacked NumPy array of scalar fields
+
+Example:
+
+```bash
+dfode-kit h52npy \
+  --source /path/to/run/oneD_flame_CH4_phi1/ch4_phi1_sample.h5 \
+  --save_to /path/to/data/ch4_phi1_sample.npy
+```
+
+Use this when downstream workflows need a single NumPy array rather than time-indexed HDF5 datasets.
+
+### 3. `augment`
+Input:
+- sampled HDF5 file
+- mechanism file
+
+Output:
+- augmented NumPy dataset
+
+Example:
+
+```bash
+dfode-kit augment \
+  --mech /path/to/gri30.yaml \
+  --h5_file /path/to/run/oneD_flame_CH4_phi1/ch4_phi1_sample.h5 \
+  --output_file /path/to/data/ch4_phi1_aug.npy \
+  --dataset_num 20000
+```
+
+Current optional controls:
+- `--heat_limit`
+- `--element_limit`
+- `--perturb_factor`
+
+## Current note on `augment`
+
+The current CLI surface exposes `--perturb_factor`, but the present command implementation does not yet thread that value through to the underlying augmentation routine. Treat the command as functional, but the public option surface here is not yet fully normalized.
+
+### 4. `label`
+Input:
+- mechanism file
+- NumPy state dataset
+- reactor advancement time step
+
+Output:
+- labeled NumPy dataset suitable for supervised learning
+
+Example:
+
+```bash
+dfode-kit label \
+  --mech /path/to/gri30.yaml \
+  --time 1e-6 \
+  --source /path/to/data/ch4_phi1_aug.npy \
+  --save /path/to/data/ch4_phi1_labeled.npy
+```
+
+Conceptually, this stage advances each sampled state with Cantera/CVODE and writes paired source/target state data.
+
+### 5. `train`
+Input:
+- mechanism file
+- labeled NumPy dataset
+
+Output:
+- trained model artifact written to the requested output path
+
+Example:
+
+```bash
+dfode-kit train \
+  --mech /path/to/gri30.yaml \
+  --source_file /path/to/data/ch4_phi1_labeled.npy \
+  --output_path /path/to/models/ch4_phi1_model.pt
+```
+
+## Recommended artifact layout
+
+A practical directory layout is:
+
+```text
+<project-root>/
+  runs/
+    oneD_flame_CH4_phi1/
+      ch4_phi1_sample.h5
+  data/
+    ch4_phi1_sample.npy
+    ch4_phi1_aug.npy
+    ch4_phi1_labeled.npy
+  models/
+    ch4_phi1_model.pt
+```
+
+This keeps:
+- case-run artifacts near the case directory
+- derived training datasets under a separate `data/` area
+- trained models under a separate `models/` area
+
+## Current limitations and documentation gaps
+
+The CLI surface for the data pipeline is usable, but not yet as normalized as `init` and `run-case`.
+
+Current gaps include:
+- limited machine-readable JSON output for `sample`, `augment`, `label`, and `train`
+- older option naming conventions such as `--h5_file` and `--source_file`
+- thinner published documentation for training outputs and configuration detail than for case init/run
+
+These are good future cleanup targets, but the commands above describe the current behavior on `main`.
+
+## Validated minimal sequence
+
+For a validated 1D flame workflow, the current practical sequence is:
+
+```bash
+dfode-kit init oneD-flame ... --apply
+dfode-kit run-case --case /path/to/case --apply --json
+dfode-kit sample --mech /path/to/gri30.yaml --case /path/to/case --save /path/to/sample.h5 --include_mesh
+```
+
+After sampling, continue with either:
+
+```bash
+dfode-kit h52npy --source /path/to/sample.h5 --save_to /path/to/sample.npy
+```
+
+or directly with augmentation/labeling:
+
+```bash
+dfode-kit augment ...
+dfode-kit label ...
+dfode-kit train ...
+```
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -39,6 +39,36 @@ uv venv .venv
 uv pip install --python .venv/bin/python -e '.[dev]'
 ```
 
+## CLI entrypoint
+
+If the console script is installed, use:
+
+```bash
+dfode-kit --help
+```
+
+A reliable fallback inside the repository is:
+
+```bash
+.venv/bin/python -m dfode_kit.cli.main --help
+```
+
+## Runtime environment split
+
+Different stages of the workflow may require different dependencies:
+
+- lightweight repository verification: local `.venv`
+- canonical case initialization: Python environment with `cantera`
+- case execution: configured OpenFOAM + Conda + DeepFlame runtime via `dfode-kit config` and `dfode-kit run-case`
+- sampling / labeling: Python environment with `cantera`, `numpy`, and `h5py`
+
+If you are starting with the case workflow, continue to:
+
+1. [CLI](cli.md)
+2. [Canonical Case Initialization](init.md)
+3. [Runtime Configuration and Case Execution](run-case.md)
+4. [Data Preparation and Training Workflow](data-workflow.md)
+
 ## Current focus
 
 The project is being refactored toward:

diff --git a/docs/index.md b/docs/index.md
@@ -13,6 +13,7 @@ DFODE-kit is a Python toolkit for accelerating combustion chemistry integration
 - **CLI**: current `dfode-kit` commands and their purpose
 - **Canonical Case Initialization**: preset-based case setup with preview/apply/config workflows
 - **Runtime Configuration and Case Execution**: persistent machine-local environment config plus reproducible case launching
+- **Data Preparation and Training Workflow**: the current artifact flow from sampled HDF5 to labeled datasets and models
 - **Architecture**: repo layout and current refactor direction
 - **Tutorials and Workflow**: how to think about the DFODE pipeline
 - **Agent Docs**: operational guidance for coding agents and maintainers

diff --git a/docs/tutorials.md b/docs/tutorials.md
@@ -24,3 +24,15 @@ A future docs iteration can bring notebook tutorials into the published site, bu
 - repository architecture,
 - CLI guidance,
 - agent and maintainer workflow documentation.
+
+## Practical workflow entry points
+
+For reproducible command-line usage, use the published Markdown docs in this order:
+
+1. [Getting Started](getting-started.md)
+2. [CLI](cli.md)
+3. [Canonical Case Initialization](init.md)
+4. [Runtime Configuration and Case Execution](run-case.md)
+5. [Data Preparation and Training Workflow](data-workflow.md)
+
+That sequence reflects the currently validated path from case creation to sampled/training-ready datasets.