Skip to content

SakerElias/LLM_Classification_Finetuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 LLM_Classification_Finetuning

This repository contains our work for the Machine Learning Project, where we try to solve a text classification challenge from this Kaggle competition.

The goal is to explore how pretrained transformer-based language models can be efficiently adapted for human preference prediction in LLM-generated responses, using modern fine-tuning and lightweight modeling strategies.


⚙️ Installation

Clone the repository:

git clone https://github.com/SakerElias/LLM_Classification_Finetuning.git
cd LLM_Classification_Finetuning

Create a virtual environment (recommended: Python 3.10):

python -m venv .venv

Activate the environment:

Windows (PowerShell):

.venv\Scripts\Activate.ps1

macOS / Linux:

source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Download the trainset (train.csv) from the kaggle competition and put it inside data folder as train.csv

🧩 Project Structure

LLM_Classification_Finetuning/
│
├── artifacts/                # contains saved embeddings for models
├── data/                     # train.csv, test.csv, sample_submission.csv
├── notebooks/                # EDA and modeling notebooks
├── outputs/                  # contains submissions for each step (csv files)
├── requirements.txt
├── README.md
└── report.pdf

📊 Notebooks & Modeling Summary

EDA In this notebook we perform a quick Exploratory Data Analysis to try to understand the nature of our dataset and which features could be useful for modelling, especially in step 1

step1.ipynb The baseline model uses simple lexical and structural features (e.g., length, paragraph count, list usage, quotes) identified through exploratory data analysis (EDA).
We train a multinomial Logistic Regression model using Stratified K-Fold Cross-Validation to ensure balanced evaluation across classes.

step2.ipynb all-MiniLM-L6-v2 is used to generate embeddings model. Prompt-response pairs are created ,and responses are concatenated into a single feature vector. A Logistic Regression classifier trained on these embeddings. Embedding model have a similar performance to the lexical baseline.

step3.ipynb This notebook allows to choose between 4 modeling options : Use a calibrated Logistic Regression model with added lexical features compared to step 1 Use an upgraded version of all-MiniLM-L6-v2 with a calibrated classifier on top Use LoRA to fine-tune a larger embedding model : DeBERTa and use it with a classifier Use ensembling techniques to mix between the first and the second model.

step3 (kaggle version) is an adapted version of step 3 to Kaggle's environment and restrictions for submission (no internet rule typically)

step4.ipynb In this notebook we perform a deeper Error and Bias Analysis of three of our models from step 3 (Lexical (Isotonic), Embeddings (Isotonic), and Ensemble (Weighted + Temperature Scaling))

About

This repository contains work for the Machine Learning Project Course, where we apply LLM fine-tuning techniques to solve a text classification problem from a Kaggle competition. The goal is to explore how pretrained transformer-based language models can be adapted for domain-specific classification tasks efficiently.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors