Check out the scripts folder for bash run scripts. We can run the standard fine tuning under three different settings. The first setting is to keep the Mistral 7B weights fixed and then only allow the MHA regression head to be fine tuned.
python -u /path/CL_MISTRAL7B_REACT/CL_LLM_REACT/fine_tune_MHA_head.py -p /path/CL_MISTRAL7B_REACT/model_files \
-xl /path/CL_MISTRAL7B_REACT/data/Suzuki-Miyaura/aap9112_Data_File_S1.xlsx \
-N 2 \
-rs 1
The second setting is to jointly fine tune Mistral 7B + MHA regression head. Krishnan : This is the standard running with out LoRA. The submit.sh takes care of this. I have increased the batch size to 128 inside the MISTRAL7B_MHA_LOADER.py.
python -u /path/CL_MISTRAL7B_REACT/CL_LLM_REACT/Jointly_fine_tune_Mistral7B_and_MHA_head.py -p /path/CL_MISTRAL7B_REACT/model_files \
-xl /path/CL_MISTRAL7B_REACT/data/Suzuki-Miyaura/aap9112_Data_File_S1.xlsx \
-N 2 \
-rs 1
The third setting is to jointly fine tune Mistral 7B + MHA regression head with Low-Rank Adaptation (LoRA).
python -u /path/CL_MISTRAL7B_REACT/CL_LLM_REACT/Jointly_fine_tune_Mistral7B_and_MHA_head_with_LORA.py -p /path/CL_MISTRAL7B_REACT/model_files \
-xl /path/CL_MISTRAL7B_REACT/data/Suzuki-Miyaura/aap9112_Data_File_S1.xlsx \
-N 27 \ # Number of epochs
-rs 7193 \ # Seed
-r 16 \ # LoRA rank
-s 4 \ # LoRA scale : The optimal value for our dataset is 4 x rank. Provides comparable results to full fine tuning!
-lr 2e-4 #Learning rate
-nh 4 #Number of attention heads for read out regression MHA : Optimized
python -u /path/CL_MISTRAL7B_REACT/CL_LLM_REACT/task_aware_fine_tune_Mistral7B_and_MHA_head_with_LORA_no_CL.py -p /path/CL_MISTRAL7B_REACT/model_files \
-xl /path/CL_MISTRAL7B_REACT/data/Suzuki-Miyaura/aap9112_Data_File_S1.xlsx \
-N 27 \ # Number of epochs
-rs 7193 \ # Random seed
-r 16 \ # LoRA rank : optimized
-s 4 \ # LoRA scale : optimized
-lr 2e-4 \ #Learning rate
-nh 4 #Number of attention heads for read out regression MHA : Optimized
We have different variations of the task aware expereince replay implemented in our workflow. The variations come in the form of how the gradients are computed. We do get similar results between the two implementations.
python -u /path/CL_MISTRAL7B_REACT/CL_LLM_REACT/task_aware_fine_tune_Mistral7B_and_MHA_head_with_LORA_combined_loss_grad_Experience_Replay.py -p /path/CL_MISTRAL7B_REACT/model_files -xl /path/CL_MISTRAL7B_REACT/data/Suzuki-Miyaura/aap9112_Data_File_S1.xlsx -N 15 -rs 7193 -r 16 -s 4 -lr 2e-4 -nh 4
python -u /path/CL_MISTRAL7B_REACT/CL_LLM_REACT/task_aware_fine_tune_Mistral7B_and_MHA_head_with_LORA_separate_grad_tree_Experience_Replay.py -p /path/CL_MISTRAL7B_REACT/model_files -xl /path/CL_MISTRAL7B_REACT/data/Suzuki-Miyaura/aap9112_Data_File_S1.xlsx -N 15 -rs 7193 -r 16 -s 4 -lr 2e-4 -nh 4
CL with memory buffer
python -u /path/CL_MISTRAL7B_REACT/CL_LLM_REACT/task_aware_fine_tune_Mistral7B_and_MHA_head_with_LORA_combined_loss_grad_Experience_Replay_With_Buffer.py -p $sourcepath/model_files -xl $sourcepath/data/Suzuki-Miyaura/aap9112_Data_File_S1.xlsx -N 17 -rs 7193 -r 16 -s 4 -lr 2e-4 -rf 0.25 -nh 4