This repository contains two Python scripts that fine-tune a roberta-base model for Named Entity Recognition (NER) on the PLOD-CW-25 dataset, with 25% and 50% additional samples from the PLODv2-filtered dataset. The goal is to evaluate performance gains in token-level classification using partial data augmentation.
| Filename | Description |
|---|---|
RoBERTa+25%data.py |
Fine-tunes roberta-base on PLOD-CW-25 with 25% of PLODv2-filtered samples added to training and validation sets. |
RoBERTa+50%data.py |
Fine-tunes roberta-base on PLOD-CW-25 with 50% of PLODv2-filtered samples added to training and validation sets. |
- PLOD-CW-25: Legal NER dataset with annotated tokens from case law.
- PLODv2-filtered: A filtered version of PLODv2 for optional fine-tuning enhancement.
Each script:
- Loads both datasets.
- Randomly samples a fraction of
PLODv2-filtered. - Merges it with the original training and validation splits.
- Converts the data into Hugging Face Datasets format.
- Model:
roberta-base - Task: Token classification
- Tags:
O,B-AC,B-LF,I-LF - Tokenizer:
RobertaTokenizerFastwithadd_prefix_space=True - Optimizer:
Adafactor - Epochs: 3
- Batch Size: 16
- Scheduler: Constant LR
- Evaluation Metric:
seqeval - Reports:
- Overall: Precision, Recall, F1, Accuracy
- Entity-wise: Per-class F1
- Visuals: Confusion Matrix & Bar Plot for metrics
Each script generates:
- 📋 Classification report
- 📊 Bar chart for precision/recall/F1 by entity
- 🔲 Confusion matrix for true vs. predicted labels
| Metric | 25% Additional Data | 50% Additional Data |
|---|---|---|
| Precision | ~0.88 | ~0.90 |
| Recall | ~0.89 | ~0.91 |
| F1 Score | ~0.88–0.89 | ~0.90+ |
| Accuracy | ~0.89 | ~0.90 |
pip install datasets transformers huggingface_hub evaluate seqeval nbconvert- Scripts are optimized for execution in Google Colab.
- Designed for experimentation with data augmentation in NER tasks.
- Results can help in assessing trade-offs between more data vs. training time.
{
"precision": 0.89,
"recall": 0.91,
"f1": 0.90,
"accuracy": 0.90
}📁 Check the detailed output in the confusion matrix and classification report printed at the end of each script.
Aaditya Singh – MSc Data Science
For academic use and performance benchmarking of NLP models in the legal domain.