@@ -25,6 +25,7 @@ The library now supports reasoning traces through the `reasoning_content` field
2525- [ Using the library] ( #using-the-library )
2626- [ Data format] ( #data-format )
2727 - [ Reasoning content support] ( #reasoning-content-support-1 )
28+ - [ Continual pretraining mode] ( #continual-pretraining-mode )
2829- [ Documentation] ( #documentation )
2930- [ Learning about the training arguments] ( #learning-about-training-arguments )
3031 - [ ` TrainingArgs ` ] ( #trainingargs )
@@ -122,6 +123,46 @@ The library now supports an optional `reasoning_content` field in addition to th
122123}
123124```
124125
126+ ## Continual pretraining mode
127+
128+ In addition to instruction tuning, the library can run document-style continual pretraining on raw text corpora.
129+ Enable this by supplying a block size when invoking ` main_ds.py ` :
130+
131+ ``` bash
132+ torchrun main_ds.py \
133+ --model_name_or_path mistralai/Mistral-7B-v0.1 \
134+ --data_path /data/documents.jsonl \
135+ --ckpt_output_dir ./checkpoints \
136+ --effective_batch_size 128 \
137+ --max_batch_len 60000 \
138+ --block-size 8192 \
139+ --document-column-name text # optional, defaults to "document"
140+ ```
141+
142+ - ` --block-size ` (required) toggles continual pretraining and controls how many tokens are packed into each block.
143+ - ` --document-column-name ` (optional) specifies which JSONL field contains the raw document text.
144+
145+ The same options are available programmatically via ` TrainingArgs.pretraining_config ` :
146+
147+ ``` python
148+ from instructlab.training import TrainingArgs, PretrainingConfig
149+
150+ train_args = TrainingArgs(
151+ model_name_or_path = " mistralai/Mistral-7B-v0.1" ,
152+ data_path = " documents.jsonl" ,
153+ ckpt_output_dir = " ./checkpoints" ,
154+ max_seq_len = 4096 ,
155+ max_batch_len = 40000 ,
156+ effective_batch_size = 128 ,
157+ pretraining_config = PretrainingConfig(
158+ block_size = 2048 ,
159+ document_column_name = " text" , # optional
160+ ),
161+ )
162+ ```
163+
164+ When a pretraining config is provided, ` process_documents_for_pretraining() ` is invoked under the hood to tokenize raw documents before training.
165+
125166** Standard message structure:**
126167
127168``` json
0 commit comments