neuralmagic
diff --git a/‎integrations/transformers/README.md‎
Lines changed: 94 additions & 4 deletions b/‎integrations/transformers/README.md‎
Lines changed: 94 additions & 4 deletions
diff --git a/‎integrations/transformers/distill_trainer_qa.py‎
Lines changed: 85 additions & 0 deletions b/‎integrations/transformers/distill_trainer_qa.py‎
Lines changed: 85 additions & 0 deletions
@@ -33,7 +33,7 @@ python run_qa.py  \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --do_train \
- --per_device_train_batch_size 12 \
+ --per_device_train_batch_size 16 \
  --learning_rate 3e-5 \
  --max_seq_length 384 \
  --doc_stride 128 \
@@ -44,6 +44,7 @@ python run_qa.py  \
  --seed 42 \
  --num_train_epochs 2 \
  --nm_prune_config recipes/90sparsity1shot.yaml
+ --fp16
 ```
 
 #### Evaluation
@@ -52,7 +53,7 @@ python run_qa.py  \
  --model_name_or_path bert-base-uncased-99sparsity-10total8gmp/ \
  --dataset_name squad \
  --do_eval \
- --per_device_eval_batch_size 12 \
+ --per_device_eval_batch_size 16 \
  --output_dir bert-base-uncased-99sparsity-10total8gmp/ \
  --overwrite_output_dir \
  --cache_dir cache \
@@ -75,7 +76,7 @@ To demostrate the effect that various pruning regimes and techniques can have we
 
 | base model name       | sparsity 	| total train epochs    | prunned | one shot |pruning epochs| F1 Score 	| EM Score  |
 |-----------------------|----------	|-----------------------|---------|----------|--------------|----------	|-----------|
-| bert-base-uncased 	|0        	|1                  	|no       |no        |0            	|09.685     |3.614      |
+| bert-base-uncased 	|0        	|1                  	|no       |no        |0            	|09.685     |03.614      |
 | bert-base-uncased 	|0        	|2                  	|no       |no        |0            	|88.002     |80.634     |
 | bert-base-uncased 	|0        	|10                 	|no       |no        |0            	|87.603     |79.130     |
 | bert-base-uncased 	|80       	|1                  	|yes      |yes       |0          	|25.141     |15.998     |
@@ -84,14 +85,103 @@ To demostrate the effect that various pruning regimes and techniques can have we
 | bert-base-uncased 	|90       	|1                  	|yes      |yes       |0           	|16.064     |07.786     |
 | bert-base-uncased 	|90       	|2                   	|yes      |no        |0            	|64.185     |50.946     |
 | bert-base-uncased 	|90       	|10                 	|yes      |no        |8            	|79.091     |68.184     |
-| bert-base-uncased 	|95       	|1                  	|yes      |yes       |0           	|10.501     |4.929      |
+| bert-base-uncased 	|95       	|1                  	|yes      |yes       |0           	|10.501     |04.929     |
 | bert-base-uncased 	|95       	|2                   	|yes      |no        |0            	|24.445     |14.437     |
 | bert-base-uncased 	|95       	|10                 	|yes      |no        |8            	|72.761  	|60.407     |
+| bert-base-uncased 	|97       	|10                 	|yes      |no        |6            	|70.260  	|57.021     |
 | bert-base-uncased 	|99         |1                   	|yes      |yes       |0             |09.685     |03.614     |
 | bert-base-uncased 	|99       	|2                   	|yes      |no        |0            	|17.433     |07.871     |
 | bert-base-uncased 	|99         |10                    	|yes      |no        |8             |47.306    	|32.564     |
 
+## Training With distillation
+In addition to a simple QA model we provide implementation which can leverage teacher-student distillation. The usage of the distillation code is virually identical to the non distilled model but commands are as follow. 
 
+#### Training 
+```bash
+python run_distill_qa.py  \
+ --teacher_model_name_or_path spacemanidol/neuralmagic-bert-squad-12layer-0sparse\
+ --student_model_name_or_path bert-base-uncased \
+ --dataset_name squad \
+ --do_train \
+ --per_device_train_batch_size 16 \
+ --learning_rate 3e-5 \
+ --max_seq_length 384 \
+ --doc_stride 128 \
+ --output_dir distill_2epoch/ \
+ --overwrite_output_dir \
+ --cache_dir cache \
+ --preprocessing_num_workers 4 \
+ --seed 42 \
+ --num_train_epochs 2 \
+ --nm_prune_config recipes/noprune2epoch.yaml
+ --fp16
+```
+
+#### Evaluation
+```bash
+python run_qa.py  \
+ --model_name_or_path bert-base-uncased-99sparsity-10total8gmp/ \
+ --dataset_name squad \
+ --do_eval \
+ --per_device_eval_batch_size 16 \
+ --output_dir bert-base-uncased-99sparsity-10total8gmp/ \
+ --overwrite_output_dir \
+ --cache_dir cache \
+ --preprocessing_num_workers 4 \
+```
+#### ONNX Export
+```bash
+python run_qa.py  \
+ --model_name_or_path bert-base-uncased-99sparsity-10total8gmp/
+ --do_eval  \
+ --dataset_name squad \
+ --do_onnx_export \
+ --onnx_export_path bert-base-uncased-99sparsity-10total8gmp/ \
+ --cache_dir cache \
+ --preprocessing_num_workers 4 \
+```
+### Distillation Results
+Sparsity 80, 90, 97
+| base model name       | sparsity 	|Distilled| prunned |train epochs|pruning epochs| F1 Score | EM Score |
+|-----------------------|----------	|---------|---------|------------|--------------|----------|----------|
+| bert-base-uncased 	|0        	|no       |no       |2           |0             |88.32442  |81.10690  |
+| bert-base-uncased 	|80        	|no       |no       |30          |18            |84.06276  |74.63576  |
+| bert-base-uncased 	|90        	|no       |no       |30          |18            |79.64549  |68.50520  |
+| bert-base-uncased 	|97       	|no       |no       |30          |18            |70.42570  |57.29423  |
+| bert-base-uncased 	|0        	|yes      |no       |2           |0             |89.02277  |82.03406  |
+| bert-base-uncased 	|80        	|yes      |yes      |30          |18            |88.03192  |80.81362  |
+| bert-base-uncased 	|90        	|yes      |yes      |30          |18            |85.63751  |77.41721  |
+| bert-base-uncased 	|97       	|yes      |yes      |30          |18            |  |  |
+
+### Distillation, Pruning, Layer Dropping
+To explore the effect of model pruning compared to layer dropping we train models to sparsity to match the amount of parameters in models with layers droppend. Results feature both with and without distillation. For distillation we use hard distillation and a a trained teacher model which is trained on SQUAD for 2 epochs and achieves an 88.32442/81.10690 F1/EM. A 9 layer model is roughly equivalent to 20% sparsity, 6 layer to 40%, 3 layer to 60%, 1 layer to 72%. 
+
+| base model name       | sparsity 	| params                |Distilled| prunned | layers   |pruning epochs| F1 Score | EM Score  |
+|-----------------------|----------	|-----------------------|---------|---------|----------|--------------|----------|-----------|
+| bert-base-uncased 	|0        	|108,893,186         	|no       |no       |12        |0             |88.32442  |81.10690   |
+| bert-base-uncased 	|0        	|87,629,570         	|no       |no       |9         |0             |86.70732  |78.81740   |
+| bert-base-uncased 	|0        	|66,365,954             |no       |no       |6         |0             |81.63629  |72.66793   |
+| bert-base-uncased 	|0        	|45,102,338            	|no       |no       |3         |0             |51.75267  |39.11069   |
+| bert-base-uncased 	|0        	|30,926,594            	|no       |no       |1         |0             |26.22600  |17.32261   |
+| bert-base-uncased 	|20        	|108,893,186         	|no       |yes      |12        |8             |87.19622  |79.16746   |
+| bert-base-uncased 	|40       	|108,893,186         	|no       |yes      |12        |8             |86.27294  |78.07947   |
+| bert-base-uncased 	|60        	|108,893,186         	|no       |yes      |12        |8             |86.4412   |77.94702   |
+| bert-base-uncased 	|72        	|108,893,186         	|no       |yes      |12        |8             |85.49873  |76.43330   |
+| bert-base-uncased 	|80        	|66,365,954         	|no       |yes      |6         |8             |77.86777  |67.07663   |
+| bert-base-uncased 	|90        	|66,365,954         	|no       |yes      |6         |8             |73.51963  |61.22044   |
+| bert-base-uncased 	|97        	|66,365,954         	|no       |yes      |6         |8             |67.27468  |53.85998   |
+| bert-base-uncased 	|0        	|108,893,186         	|yes      |no       |12        |0             |89.02277  |82.03406   |
+| bert-base-uncased 	|0        	|87,629,570         	|yes      |no       |9         |0             |87.94176  |80.46358   |
+| bert-base-uncased 	|0        	|66,365,954             |yes      |no       |6         |0             |83.4553   |75.03311   |
+| bert-base-uncased 	|0        	|45,102,338            	|yes      |no       |3         |0             |43.82823  |33.05581   |
+| bert-base-uncased 	|0        	|30,926,594           	|yes      |no       |1         |0             |28.10105  |18.5052    |
+| bert-base-uncased 	|20        	|108,893,186         	|yes      |yes      |12        |18            |  |     |
+| bert-base-uncased 	|40       	|108,893,186         	|yes      |yes      |12        |18            |  |     |
+| bert-base-uncased 	|60        	|108,893,186         	|yes      |yes      |12        |18            |  |     |
+| bert-base-uncased 	|72        	|108,893,186         	|yes      |yes      |12        |18            |  |     |
+| bert-base-uncased 	|80        	|66,365,954         	|yes      |yes      |6         |8             |  |     |
+| bert-base-uncased 	|90        	|66,365,954         	|yes      |yes      |6         |8             |  |     |
+| bert-base-uncased 	|97        	|66,365,954         	|yes      |yes      |6         |8             |  |     |
 
 ## Script origin and how to integrate sparseml with other Transformers projects
 This script is based on the example BERT-QA implementation in transformers found [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_qa.py). 
 
@@ -0,0 +1,85 @@
+# neuralmagic: no copyright
+# flake8: noqa
+# fmt: off
+# isort: skip_file
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Union
+
+import torch
+from torch import nn
+import torch.nn.functional as F
+from torch import Tensor
+
+from transformers import Trainer, is_datasets_available, is_torch_tpu_available
+from transformers.trainer_utils import PredictionOutput
+
+from trainer_qa import QuestionAnsweringTrainer
+
+class DistillQuestionAnsweringTrainer(QuestionAnsweringTrainer):
+    def __init__(self, *args, eval_examples=None, post_process_function=None, teacher=None, loss=None, batch_size=8, max_sequence_length=384,distill_hardness =0.5, temperature=2.0, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.eval_examples = eval_examples
+        self.post_process_function = post_process_function
+        self.loss = loss
+        self.teacher = teacher
+        self.batch_size = batch_size
+        self.temperature = temperature
+        self.distill_hardness = distill_hardness
+        self.criterion = nn.CrossEntropyLoss()
+        self.max_sequence_length = max_sequence_length
+
+    def compute_loss(self, model, inputs, return_outputs=False):
+        """
+        How the loss is computed by Trainer. Modified for Distilation using student teacher framework modified for distilation. 
+        """
+        input_device = inputs["input_ids"].device
+        outputs = model(**inputs)
+        start_logits_student = outputs["start_logits"]
+        end_logits_student = outputs["end_logits"]
+        start_logits_label = inputs["start_positions"]
+        end_logits_label = inputs["start_positions"]
+        self.teacher = self.teacher.to(input_device)
+        with torch.no_grad():
+            teacher_output = self.teacher(
+                            input_ids=inputs["input_ids"],
+                            token_type_ids=inputs["token_type_ids"],
+                            attention_mask=inputs["attention_mask"],
+                        )
+        start_logits_teacher = teacher_output["start_logits"]
+        end_logits_teacher = teacher_output["end_logits"]
+        loss_start = (
+            F.kl_div(
+                input=F.log_softmax(start_logits_student / self.temperature, dim=-1),
+                target=F.softmax(start_logits_teacher / self.temperature, dim=-1),
+                reduction="batchmean",
+            )
+            * (self.temperature ** 2)
+        )
+        loss_end = (
+            F.kl_div(
+                input=F.log_softmax(end_logits_student / self.temperature, dim=-1),
+                target=F.softmax(end_logits_teacher / self.temperature, dim=-1),
+                reduction="batchmean",
+            )
+            * (self.temperature ** 2)
+        )
+        teacher_loss = (loss_start + loss_end) / 2.0
+        loss_start = self.criterion(start_logits_student, start_logits_label)
+        loss_end = self.criterion(end_logits_student, end_logits_label)
+        label_loss = (loss_start + loss_end) / 2.0
+        loss = ((1-self.distill_hardness) * label_loss) + (self.distill_hardness * teacher_loss)
+        return loss