This document covers running Continued Pre-Training (CPT) on a base LLM using the temporally-filtered pre-1969 corpus. The training script (scripts/train_deepred_model.py) supports two profiles:
| Profile | Model | Parameters | Purpose | Default Data |
|---|---|---|---|---|
| dev | SmolLM2-360M | 360M | Fast iteration & validation | 5% of corpus |
| prod | TinyLlama-1.1B | 1.1B | Final production model | 100% of corpus |
Dev mode produces a quick checkpoint for validating that CPT suppresses post-1969 knowledge without destroying language quality. Prod mode runs the full multi-week training.
The script uses run orchestration to manage long-running training:
- Each run gets a human-readable name (e.g.,
dev-2026-03-07) - Re-launching with the same parameters automatically resumes from the last checkpoint
- Changing parameters requires a new run name to prevent accidental overwriting
- Completed runs stop cleanly and inform the user
- GGUF models are exported at each epoch boundary for testing in LM Studio
See ModelTraining.md for background on the CPT approach, base model selection, and the overall training roadmap.
- Strix Halo setup complete — ROCm, containers, models. See StrixHalo-Fedora-Setup.md.
- Fine-tuning container created —
setup_strixhalo.pystagetraining_toolboxcreates thestrix-halo-finetuningcontainer (image:kyuz0/amd-strix-halo-llm-finetuning). This container ships with gfx1151-compiled PyTorch from AMD's TheRock nightly builds. - Base models downloaded —
setup_strixhalo.pydownloads both SmolLM2-360M and TinyLlama-1.1B to/mnt/data/models/. - Training corpus tokenized —
create_training_corpus.pyproducestrain.binandval.bin. See TrainingCorpus-Setup.md. - llama.cpp GGUF tools installed —
setup_strixhalo.pystagetraining_gguf_toolsclones llama.cpp to/mnt/data/llama.cppand installs its Python requirements in the fine-tuning container. This enables automatic GGUF export at epoch boundaries.
All training commands must run inside the container. Start it and enter an interactive shell:
podman start strix-halo-finetuning
podman exec -it strix-halo-finetuning bashOnce inside (bash-5.3$ prompt), activate the venv:
source /opt/venv/bin/activate
cd /mnt/data/DeepRedAIVerify PyTorch has ROCm/HIP support:
python3 -c "import torch; print('PyTorch', torch.__version__, '| HIP', torch.version.hip)"
# Expected: PyTorch 2.12.0a0+rocm7.12.0a20260307 | HIP 7.12.60610 (or similar)One-liner alternative (no interactive shell):
podman exec strix-halo-finetuning bash -c \ 'source /opt/venv/bin/activate && cd /mnt/data/DeepRedAI && python3 scripts/train_deepred_model.py'
Important: The host venv (
/mnt/data/venv) does NOT have gfx1151-compatible PyTorch and will segfault on GPU operations.
The training script reads pre-tokenized binary data (train.bin / val.bin). Each model requires its own tokenized corpus because tokenizer vocabularies differ (SmolLM2: 49,152 tokens, TinyLlama: 32,000 tokens).
python3 scripts/create_training_corpus.py --tokenizer SmolLM2-360M --percent 5
python3 scripts/create_training_corpus.py --tokenizer SmolLM2-360M --finalizeFor a longer dev run, increase --percent (or tokenize 100% and limit at training time with --data-percent).
The full corpus should already exist from corpus preparation (see TrainingCorpus-Setup.md):
/mnt/data/training_corpus/TinyLlama-1.1B/
├── train.bin (3.6 GB, 1.93B tokens)
├── val.bin (38 MB, 19.4M tokens)
└── manifest.json
Launch the default dev profile — SmolLM2-360M, 5% data, 3 epochs:
python3 scripts/train_deepred_model.pyThis creates a run named dev-2026-03-07 (using today's date) and outputs to /mnt/data/training_output/dev-2026-03-07/. The script shows an estimated completion date and time at startup.
Verify the GPU and pipeline work before committing to a long run:
python3 scripts/train_deepred_model.py --data-percent 1 --max-steps 100| Data % | ~Tokens (3 epochs) | ~Time at 1.2K tok/s |
|---|---|---|
| 1% | ~60M | ~14 hours |
| 5% | ~300M | ~69 hours |
| 10% | ~600M | ~139 hours |
| 100% | ~6B | ~58 days |
Simply re-run the same command:
python3 scripts/train_deepred_model.pyThe script detects the existing dev-2026-03-07 run, verifies parameters match, and automatically resumes from the last checkpoint. No --resume flag needed.
You can also interrupt gracefully with Ctrl+C — the script saves a checkpoint before exiting.
At each epoch boundary, the script exports a GGUF model for testing in LM Studio:
/mnt/data/training_output/dev-2026-03-07/
├── gguf/
│ ├── dev-2026-03-07-epoch1.gguf ← after epoch 1
│ ├── dev-2026-03-07-epoch2.gguf ← after epoch 2
│ ├── dev-2026-03-07-epoch3.gguf ← after epoch 3
│ └── dev-2026-03-07-final.gguf ← after final evaluation
Copy a GGUF to LM Studio and test temporal compliance:
# Copy to LM Studio models directory
cp /mnt/data/training_output/dev-2026-03-07/gguf/dev-2026-03-07-epoch1.gguf \
/mnt/data/lmstudio/models/GGUF export defaults to q8_0 quantization. Change with --gguf-quant q4_k_m for smaller files. Disable entirely with --no-gguf.
Note: GGUF export uses
convert_hf_to_gguf.pyfrom llama.cpp (installed bysetup_strixhalo.pystagetraining_gguf_tools). If you need to override the path, use--llama-cpp-path.
Once dev results look good, launch production training:
python3 scripts/train_deepred_model.py --profile prodThis creates a run named prod-2026-03-07 with production defaults:
- Model: TinyLlama-1.1B
- Data: 100% of corpus (1.93B tokens)
- Epochs: 5 (~9.6B tokens total)
- Effective batch: 128 sequences (262K tokens)
- LR: 3e-4 → 3e-5 cosine decay
- Estimated time: ~3-5 weeks
Resume after interruption works the same way — re-run the same command.
Training progress is logged to both console and files:
# Live training log
tail -f /mnt/data/training_output/dev-2026-03-07/train.log
# Metrics (JSON lines — step, loss, lr, tokens/sec, MFU)
cat /mnt/data/training_output/dev-2026-03-07/metrics.jsonl | python3 -m json.tool
# Generated text samples (temporal compliance checks)
less /mnt/data/training_output/dev-2026-03-07/samples.logWhat to watch for:
- Loss decreasing — Steady decline indicates learning. Expect faster drops during warmup.
- Val loss — Should track train loss. If val loss rises while train loss falls, you are overfitting.
- Text samples — Check that generated text is coherent and era-appropriate (pre-1969 content).
- MFU — Model FLOPS Utilization. 25-35% is expected for RDNA 3.5.
Each training launch is tracked by a run name and a parameter fingerprint:
| Scenario | What happens |
|---|---|
| First launch | Creates run directory, saves run_meta.json, starts training |
| Re-launch, same parameters | Auto-resumes from last checkpoint |
| Re-launch, different parameters | Blocks with a diff of changed parameters |
| Re-launch, training completed | Shows completion message and exits |
By default, runs are named {profile}-{YYYY-MM-DD} (e.g., dev-2026-03-07). Override with --run-name:
python3 scripts/train_deepred_model.py --run-name dev-experiment-cosineWhen a run finishes, re-launching the same command shows:
Run 'dev-2026-03-07' is COMPLETED
Finished: 2026-03-10T14:32:00
Output: /mnt/data/training_output/dev-2026-03-07
To start a fresh run:
# Auto-increment name (dev-2026-03-07-2, dev-2026-03-07-3, ...)
python3 scripts/train_deepred_model.py --new-run
# Or use a custom name
python3 scripts/train_deepred_model.py --run-name dev-round2If you change parameters (e.g., --lr 1e-4) for an existing run, the script blocks:
ERROR: Run 'dev-2026-03-07' exists with different parameters.
Changed parameters:
lr: 0.0003 -> 0.0001
To start a new run with these parameters, use:
--run-name <custom-name>
Provide a new name to proceed:
python3 scripts/train_deepred_model.py --lr 1e-4 --run-name dev-lr1e4The explicit --resume flag bypasses run orchestration entirely:
python3 scripts/train_deepred_model.py --resume /mnt/data/training_output/dev-2026-03-07/latestAll parameters have profile-specific defaults. CLI flags override any profile default.
| Parameter | Dev Default | Prod Default | CLI Flag |
|---|---|---|---|
| Model | SmolLM2-360M | TinyLlama-1.1B | --profile |
| Epochs | 3 | 5 | --epochs |
| Learning rate | 3e-4 | 3e-4 | --lr |
| Min LR | 3e-5 | 3e-5 | --min-lr |
| Warmup steps | 500 | 2000 | --warmup-steps |
| Micro-batch | 8 | 4 | --micro-batch-size |
| Grad accumulation | 16 | 32 | --gradient-accumulation-steps |
| Effective batch | 128 seqs | 128 seqs | (computed) |
| Weight decay | 0.1 | 0.1 | --weight-decay |
| Max grad norm | 1.0 | 1.0 | --max-grad-norm |
| Data % | 5% | 100% | --data-percent |
| Eval interval | 250 steps | 500 steps | --eval-interval |
| Save interval | 1000 steps | 2000 steps | --save-interval |
| Log interval | 10 steps | 10 steps | --log-interval |
| Sample interval | 500 steps | 1000 steps | --sample-interval |
| Flag | Default | Effect |
|---|---|---|
--run-name NAME |
{profile}-YYYY-MM-DD |
Custom run name |
--new-run |
off | Start new run even if previous is completed (auto-increments name) |
| Flag | Default | Effect |
|---|---|---|
--no-gguf |
off | Disable GGUF export at epoch boundaries |
--gguf-quant TYPE |
q8_0 |
Quantization type (q8_0, q4_k_m, f16, etc.) |
--llama-cpp-path PATH |
$DEEPRED_ROOT/llama.cpp |
Path to llama.cpp directory |
| Flag | Default | Effect |
|---|---|---|
--compile |
off | Use torch.compile() for potential speedup (experimental on ROCm) |
--no-gradient-checkpointing |
off | Disable gradient checkpointing (faster but uses more memory) |
--num-workers N |
auto | DataLoader workers (auto = min(cpu_count, 8); 0 = single-process) |
--attn-implementation |
auto | Attention implementation (auto/sdpa/eager/flash_attention_2) |
| Flag | Default |
|---|---|
--model-path |
/mnt/data/models/{model_name}/ |
--corpus-dir |
/mnt/data/training_corpus/{model_name}/ |
--output-dir |
/mnt/data/training_output/{run_name}/ |
/mnt/data/training_output/dev-2026-03-07/
├── run_meta.json # Run orchestration state (name, fingerprint, status)
├── config.json # Full training configuration
├── train.log # Human-readable training log
├── metrics.jsonl # Per-step JSON metrics (loss, lr, tok/s, MFU)
├── samples.log # Generated text samples at each sample_interval
├── latest/ # Most recent checkpoint (for resume)
│ ├── config.json # Model config
│ ├── model.safetensors # Model weights
│ ├── tokenizer files...
│ └── training_state.pt # Optimizer + step/epoch state
├── best/ # Best checkpoint (lowest validation loss)
├── epoch-1/ # Model-only checkpoint after epoch 1
├── epoch-2/ # Model-only checkpoint after epoch 2
├── checkpoint-1000/ # Named checkpoint at step 1000
├── final/ # Final model after training completes
└── gguf/ # GGUF exports for LM Studio testing
├── dev-2026-03-07-epoch1.gguf
├── dev-2026-03-07-epoch2.gguf
└── dev-2026-03-07-final.gguf
latest/contains the optimizer state for resume capability (~3× model size due to Adam state).- Named checkpoints (
epoch-*,checkpoint-*,best/,final/) contain only model weights and tokenizer — lightweight and directly usable for inference or GGUF conversion. run_meta.jsontracks run status (running/completed), the parameter fingerprint, and timestamps.
The training script runs entirely on the Strix Halo integrated GPU via ROCm. Key optimizations:
- BF16 mixed precision — Forward pass in BF16 via
torch.autocast, FP32 master weights for optimizer stability. - Gradient checkpointing — Trades ~30% extra compute time for up to 60% memory savings on activations.
- Fused AdamW — Entire Adam update in a single GPU kernel.
- CPU data loading — Multiple DataLoader workers with
persistent_workersandprefetch_factor=4.
The fine-tuning container sets ROCm environment variables automatically:
export ROCM_PATH=/opt/rocm-7.0
export HSA_OVERRIDE_GFX_VERSION=11.0.0 # Required for gfx1151
export ROCBLAS_USE_HIPBLASLT=1 # Optimized matrix math
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:TrueWith 128 GB unified memory, training comfortably fits both models:
| Model | FP32 Weights | Optimizer (Adam) | Gradients | Activations* | Total |
|---|---|---|---|---|---|
| SmolLM2-360M | 1.4 GB | 2.8 GB | 1.4 GB | ~1-3 GB | ~7-9 GB |
| TinyLlama-1.1B | 4.4 GB | 8.8 GB | 4.4 GB | ~3-8 GB | ~21-26 GB |
*With gradient checkpointing enabled. Without: multiply activations by ~3-4×.
The A4000 cannot be used for distributed training with the Strix Halo due to cross-vendor incompatibility. It continues its role as the inference server for SFT dataset generation, embedding computation, and interactive testing via LM Studio.
You are likely running outside the fine-tuning container. Training must run inside strix-halo-finetuning:
podman start strix-halo-finetuning
podman exec -it strix-halo-finetuning bash
source /opt/venv/bin/activate
cd /mnt/data/DeepRedAI
python3 scripts/train_deepred_model.pyCommon mistake: Pasting all lines at once. The
podman exec -it ... bashopens a sub-shell — wait for thebash-5.3$prompt before typing further commands.
Run create_training_corpus.py for the correct model:
python3 scripts/create_training_corpus.py --tokenizer SmolLM2-360M --percent 5
python3 scripts/create_training_corpus.py --tokenizer SmolLM2-360M --finalizeReduce --micro-batch-size (try 4, 2, or 1). Ensure --no-gradient-checkpointing is NOT set. Reduce --num-workers if CPU memory is the bottleneck.
- Verify ROCm:
python3 -c "import torch; print(torch.version.hip)" - Check training is on GPU (look for "GPU 0:" in the log, not "CPU")
- Try
--compilefor potential speedup - Increase
--micro-batch-sizeto improve GPU utilization (if memory allows)
- Validate corpus data:
python3 -c "import numpy as np; d = np.fromfile('/mnt/data/training_corpus/SmolLM2-360M/train.bin', dtype=np.uint16); print(d.shape, d.min(), d.max())" - Verify tokenizer matches the corpus (token IDs should be within vocab range)
- Try reducing learning rate (
--lr 1e-4)
The script logs a warning if llama.cpp is not found. Re-run the setup stage:
sudo -E python3 scripts/setup_strixhalo.py --stage training_gguf_toolsThis clones llama.cpp to /mnt/data/llama.cpp and installs Python requirements in the container. To override the path at runtime: --llama-cpp-path /path/to/llama.cpp
After CPT completes:
- Evaluate — Check
samples.logand test thefinal.ggufin LM Studio for temporal compliance - SFT for Deep Red persona — LoRA fine-tuning on ChatML data (see ModelTraining.md, Prod Phase 5)
- Deploy — The GGUF models in
gguf/are ready for LM Studio; thefinal/directory is standard HuggingFace format fortransformersor PEFT