Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Commit 7bd1dbe

Browse files
authored
More recipes for BERT 3-, 6-layer models (#338)
1 parent 2a42c43 commit 7bd1dbe

File tree

5 files changed

+507
-0
lines changed

5 files changed

+507
-0
lines changed
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
<!--
2+
Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing,
11+
software distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
---
18+
# General Variables
19+
num_epochs: &num_epochs 30
20+
21+
# Pruning Hyperparameters
22+
init_sparsity: &init_sparsity 0.00
23+
final_sparsity: &final_sparsity 0.70
24+
pruning_start_epoch: &pruning_start_epoch 2
25+
pruning_end_epoch: &pruning_end_epoch 20
26+
update_frequency: &pruning_update_frequency 0.01
27+
28+
29+
# Modifiers
30+
training_modifiers:
31+
- !EpochRangeModifier
32+
end_epoch: 30
33+
start_epoch: 0.0
34+
35+
pruning_modifiers:
36+
- !LayerPruningModifier
37+
end_epoch: -1.0
38+
layers: ['bert.encoder.layer.1', 'bert.encoder.layer.2', 'bert.encoder.layer.3', 'bert.encoder.layer.4', 'bert.encoder.layer.5', 'bert.encoder.layer.7', 'bert.encoder.layer.8', 'bert.encoder.layer.9', 'bert.encoder.layer.10']
39+
start_epoch: -1.0
40+
update_frequency: -1.0
41+
42+
- !GMPruningModifier
43+
params:
44+
- re:bert.encoder.layer.*.attention.self.query.weight
45+
- re:bert.encoder.layer.*.attention.self.key.weight
46+
- re:bert.encoder.layer.*.attention.self.value.weight
47+
- re:bert.encoder.layer.*.attention.output.dense.weight
48+
- re:bert.encoder.layer.*.intermediate.dense.weight
49+
- re:bert.encoder.layer.*.output.dense.weight
50+
start_epoch: *pruning_start_epoch
51+
end_epoch: *pruning_end_epoch
52+
init_sparsity: *init_sparsity
53+
final_sparsity: *final_sparsity
54+
inter_func: cubic
55+
update_frequency: *pruning_update_frequency
56+
leave_enabled: True
57+
mask_type: unstructured
58+
log_types: __ALL__
59+
---
60+
61+
# BERT Model with Dropped and Pruned Encoder Layers
62+
63+
This recipe defines a dropping and pruning strategy to sparsify three encoder layers of a BERT model at 70% sparsity. It was used together with knowledge distillation to create a sparse model that achieves 90.5% recovery from the F1 metric of the baseline model on the SQuAD dataset. (We use the teacher model fine-tuned for 2 epochs as the baseline for comparison.)
64+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
65+
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
66+
67+
## Weights and Biases
68+
69+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/osq61nzi?workspace=user-neuralmagic)
70+
71+
## Training
72+
73+
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
74+
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
75+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
76+
77+
*training command*
78+
79+
python transformers/examples/pytorch/question-answering/run_qa.py \
80+
--model_name_or_path bert-base-uncased \
81+
--distill_teacher $MODEL_DIR/bert-base-12layers \
82+
--distill_hardness 1.0 \
83+
--distill_temperature 2.0 \
84+
--dataset_name squad \
85+
--do_train \
86+
--do_eval \
87+
--fp16 \
88+
--evaluation_strategy epoch \
89+
--per_device_train_batch_size 16 \
90+
--learning_rate 5e-5 \
91+
--max_seq_length 384 \
92+
--doc_stride 128 \
93+
--output_dir $MODEL_DIR/sparse70_3layers \
94+
--cache_dir cache \
95+
--preprocessing_num_workers 6 \
96+
--seed 42 \
97+
--num_train_epochs 30 \
98+
--recipe ../recipes/bert-base-3layers_prune70.md \
99+
--onnx_export_path $MODEL_DIR/sparse70_3layers/onnx \
100+
--save_strategy epoch \
101+
--save_total_limit 2
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
<!--
2+
Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing,
11+
software distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
---
18+
# General Variables
19+
num_epochs: &num_epochs 30
20+
21+
# Pruning Hyperparameters
22+
init_sparsity: &init_sparsity 0.00
23+
final_sparsity: &final_sparsity 0.80
24+
pruning_start_epoch: &pruning_start_epoch 2
25+
pruning_end_epoch: &pruning_end_epoch 20
26+
update_frequency: &pruning_update_frequency 0.01
27+
28+
29+
# Modifiers
30+
training_modifiers:
31+
- !EpochRangeModifier
32+
end_epoch: 30
33+
start_epoch: 0.0
34+
35+
pruning_modifiers:
36+
- !LayerPruningModifier
37+
end_epoch: -1.0
38+
layers: ['bert.encoder.layer.1', 'bert.encoder.layer.2', 'bert.encoder.layer.3', 'bert.encoder.layer.4', 'bert.encoder.layer.5', 'bert.encoder.layer.7', 'bert.encoder.layer.8', 'bert.encoder.layer.9', 'bert.encoder.layer.10']
39+
start_epoch: -1.0
40+
update_frequency: -1.0
41+
42+
- !GMPruningModifier
43+
params:
44+
- re:bert.encoder.layer.*.attention.self.query.weight
45+
- re:bert.encoder.layer.*.attention.self.key.weight
46+
- re:bert.encoder.layer.*.attention.self.value.weight
47+
- re:bert.encoder.layer.*.attention.output.dense.weight
48+
- re:bert.encoder.layer.*.intermediate.dense.weight
49+
- re:bert.encoder.layer.*.output.dense.weight
50+
start_epoch: *pruning_start_epoch
51+
end_epoch: *pruning_end_epoch
52+
init_sparsity: *init_sparsity
53+
final_sparsity: *final_sparsity
54+
inter_func: cubic
55+
update_frequency: *pruning_update_frequency
56+
leave_enabled: True
57+
mask_type: unstructured
58+
log_types: __ALL__
59+
---
60+
61+
# BERT Model with Dropped and Pruned Encoder Layers
62+
63+
This recipe defines a dropping and pruning strategy to sparsify three encoder layers of a BERT model at 80% sparsity. It was used together with knowledge distillation to create a sparse model that achieves 89% recovery from the F1 metric of the baseline model on the SQuAD dataset. (We use the teacher model fine-tuned for 2 epochs as the baseline for comparison.)
64+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
65+
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
66+
67+
## Weights and Biases
68+
69+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/izq3uyq9?workspace=user-neuralmagic)
70+
71+
## Training
72+
73+
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
74+
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
75+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
76+
77+
*training command*
78+
79+
python transformers/examples/pytorch/question-answering/run_qa.py \
80+
--model_name_or_path bert-base-uncased \
81+
--distill_teacher $MODEL_DIR/bert-base-12layers \
82+
--distill_hardness 1.0 \
83+
--distill_temperature 2.0 \
84+
--dataset_name squad \
85+
--do_train \
86+
--do_eval \
87+
--fp16 \
88+
--evaluation_strategy epoch \
89+
--per_device_train_batch_size 16 \
90+
--learning_rate 5e-5 \
91+
--max_seq_length 384 \
92+
--doc_stride 128 \
93+
--output_dir $MODEL_DIR/sparse80_3layers \
94+
--cache_dir cache \
95+
--preprocessing_num_workers 6 \
96+
--seed 42 \
97+
--num_train_epochs 30 \
98+
--recipe ../recipes/bert-base-3layers_prune80.md \
99+
--onnx_export_path $MODEL_DIR/sparse80_3layers/onnx \
100+
--save_strategy epoch \
101+
--save_total_limit 2
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
<!--
2+
Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing,
11+
software distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
---
18+
# General Variables
19+
num_epochs: &num_epochs 30
20+
21+
# Pruning Hyperparameters
22+
init_sparsity: &init_sparsity 0.00
23+
final_sparsity: &final_sparsity 0.90
24+
pruning_start_epoch: &pruning_start_epoch 2
25+
pruning_end_epoch: &pruning_end_epoch 20
26+
update_frequency: &pruning_update_frequency 0.01
27+
28+
29+
# Modifiers
30+
training_modifiers:
31+
- !EpochRangeModifier
32+
end_epoch: 30
33+
start_epoch: 0.0
34+
35+
pruning_modifiers:
36+
- !LayerPruningModifier
37+
end_epoch: -1.0
38+
layers: ['bert.encoder.layer.1', 'bert.encoder.layer.2', 'bert.encoder.layer.3', 'bert.encoder.layer.4', 'bert.encoder.layer.5', 'bert.encoder.layer.7', 'bert.encoder.layer.8', 'bert.encoder.layer.9', 'bert.encoder.layer.10']
39+
start_epoch: -1.0
40+
update_frequency: -1.0
41+
42+
- !GMPruningModifier
43+
params:
44+
- re:bert.encoder.layer.*.attention.self.query.weight
45+
- re:bert.encoder.layer.*.attention.self.key.weight
46+
- re:bert.encoder.layer.*.attention.self.value.weight
47+
- re:bert.encoder.layer.*.attention.output.dense.weight
48+
- re:bert.encoder.layer.*.intermediate.dense.weight
49+
- re:bert.encoder.layer.*.output.dense.weight
50+
start_epoch: *pruning_start_epoch
51+
end_epoch: *pruning_end_epoch
52+
init_sparsity: *init_sparsity
53+
final_sparsity: *final_sparsity
54+
inter_func: cubic
55+
update_frequency: *pruning_update_frequency
56+
leave_enabled: True
57+
mask_type: unstructured
58+
log_types: __ALL__
59+
---
60+
61+
# BERT Model with Dropped and Pruned Encoder Layers
62+
63+
This recipe defines a dropping and pruning strategy to sparsify three encoder layers of a BERT model at 90% sparsity. It was used together with knowledge distillation to create a sparse model that achieves 86% recovery from the F1 metric of the baseline model on the SQuAD dataset. (We use the teacher model fine-tuned for 2 epochs as the baseline for comparison.)
64+
Training was done using one V100 GPU at half precision using a training batch size of 16 with the
65+
[SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers).
66+
67+
## Weights and Biases
68+
69+
- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/2xb5dree?workspace=user-neuralmagic)
70+
71+
## Training
72+
73+
To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md).
74+
Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation.
75+
Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options.
76+
77+
*training command*
78+
79+
python transformers/examples/pytorch/question-answering/run_qa.py \
80+
--model_name_or_path bert-base-uncased \
81+
--distill_teacher $MODEL_DIR/bert-base-12layers \
82+
--distill_hardness 1.0 \
83+
--distill_temperature 2.0 \
84+
--dataset_name squad \
85+
--do_train \
86+
--do_eval \
87+
--fp16 \
88+
--evaluation_strategy epoch \
89+
--per_device_train_batch_size 16 \
90+
--learning_rate 5e-5 \
91+
--max_seq_length 384 \
92+
--doc_stride 128 \
93+
--output_dir $MODEL_DIR/sparse90_3layers \
94+
--cache_dir cache \
95+
--preprocessing_num_workers 6 \
96+
--seed 42 \
97+
--num_train_epochs 30 \
98+
--recipe ../recipes/bert-base-3layers_prune90.md \
99+
--onnx_export_path $MODEL_DIR/sparse90_3layers/onnx \
100+
--save_strategy epoch \
101+
--save_total_limit 2

0 commit comments

Comments
 (0)