Skip to content

Commit 7904601

Browse files
committed
add docs
1 parent 75aa3bc commit 7904601

File tree

1 file changed

+41
-0
lines changed

1 file changed

+41
-0
lines changed

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ The library now supports reasoning traces through the `reasoning_content` field
2525
- [Using the library](#using-the-library)
2626
- [Data format](#data-format)
2727
- [Reasoning content support](#reasoning-content-support-1)
28+
- [Continual pretraining mode](#continual-pretraining-mode)
2829
- [Documentation](#documentation)
2930
- [Learning about the training arguments](#learning-about-training-arguments)
3031
- [`TrainingArgs`](#trainingargs)
@@ -122,6 +123,46 @@ The library now supports an optional `reasoning_content` field in addition to th
122123
}
123124
```
124125

126+
## Continual pretraining mode
127+
128+
In addition to instruction tuning, the library can run document-style continual pretraining on raw text corpora.
129+
Enable this by supplying a block size when invoking `main_ds.py`:
130+
131+
```bash
132+
torchrun main_ds.py \
133+
--model_name_or_path mistralai/Mistral-7B-v0.1 \
134+
--data_path /data/documents.jsonl \
135+
--ckpt_output_dir ./checkpoints \
136+
--effective_batch_size 128 \
137+
--max_batch_len 60000 \
138+
--block-size 8192 \
139+
--document-column-name text # optional, defaults to "document"
140+
```
141+
142+
- `--block-size` (required) toggles continual pretraining and controls how many tokens are packed into each block.
143+
- `--document-column-name` (optional) specifies which JSONL field contains the raw document text.
144+
145+
The same options are available programmatically via `TrainingArgs.pretraining_config`:
146+
147+
```python
148+
from instructlab.training import TrainingArgs, PretrainingConfig
149+
150+
train_args = TrainingArgs(
151+
model_name_or_path="mistralai/Mistral-7B-v0.1",
152+
data_path="documents.jsonl",
153+
ckpt_output_dir="./checkpoints",
154+
max_seq_len=4096,
155+
max_batch_len=40000,
156+
effective_batch_size=128,
157+
pretraining_config=PretrainingConfig(
158+
block_size=2048,
159+
document_column_name="text", # optional
160+
),
161+
)
162+
```
163+
164+
When a pretraining config is provided, `process_documents_for_pretraining()` is invoked under the hood to tokenize raw documents before training.
165+
125166
**Standard message structure:**
126167

127168
```json

0 commit comments

Comments
 (0)