π Project Highlights:
- Specialized AI model for GDPR compliance guidance
- Fine-tuned Google's Gemma 2B using a 3-Stage Training Pipeline (SFT -> Dynamic Rejection -> DPO)
- Implemented QLoRA for efficient, resource-friendly training
- Designed to provide accurate, relevant responses to GDPR-related inquiries
- Qualitative evaluation using LLM-as-a-judge (GPT-4o)
This project develops an advanced AI model specialized in providing guidance on GDPR (General Data Protection Regulation) compliance.
By fine-tuning Google's Gemma 2B model using Direct Preference Optimization (DPO) and a GDPR-specific dataset,
we've created a powerful tool to assist organizations with data protection queries and regulatory compliance.
π Hugging Face Model: cycloevan/gdpr_gemma-2-2b π GitHub Repository: seok-hee97/gdpr-gemma2
- GDPR Expertise: Specialized in GDPR compliance and data protection regulations.
- DPO Alignment: Utilizes Direct Preference Optimization (DPO) with Dynamic Rejection for precise alignment with GDPR principles.
- Resource Efficient: Implements 4-bit quantization using QLoRA for efficient training on standard hardware.
- Comprehensive Evaluation: Combines ROUGE/BLEU scores with qualitative assessment via GPT-4o.
- Base Model:
google/gemma-2-2b-it - Fine-tuning Method: 3-Stage Pipeline (SFT -> Dynamic Rejection -> DPO)
- Training Dataset: sims2k/GDPR_QA_instruct_dataset
- Quantization: 4-bit (QLoRA)
- Judge Model:
gpt-4o(OpenAI API) for qualitative evaluation.
/gdpr-gemma2
βββ src/ # Source Modules
β βββ config.py # Hyperparameters & Local Paths
β βββ data_loader.py # Multi-stage data processing
β βββ sft_train.py # [Stage 1] Knowledge injection
β βββ generate_rejections.py # [Stage 2] Dynamic data prep (Dynamic Rejection)
β βββ dpo_train.py # [Stage 3] Preference alignment
β βββ inference.py # Hybrid inference engine
β βββ eval.py # ROUGE/BLEU/BertScore Evaluation
β βββ judge.py # LLM-as-a-judge (GPT-4o) Assessment
β βββ filter_rejections.py # DPO pair quality filter (length/citation bias)
β βββ diagnose_prompt.py # Prompt-format A/B diagnostic
β βββ push_to_hub.py # Merge LoRA adapter & push to HF Hub
βββ data/ # Dataset storage (.cache included)
βββ models/ # Model artifacts (Base/SFT/DPO)
βββ eval/ # Evaluation results and LLM-judge reports
βββ app.py # Streamlit Web Interface
βββ Dockerfile # Containerized Deployment
βββ requirements.txt # Python dependencies
conda create -n gdpr-env python=3.11 -y
conda activate gdpr-env
pip install -r requirements.txtTo achieve industry-standard performance, follow these steps:
- Stage 1 (SFT): Teach the model GDPR facts.
python -m src.sft_train
- Stage 2 (Data Prep): Generate real-world rejections from the SFT model (Dynamic Rejection).
python -m src.generate_rejections
- Stage 3 (DPO): Align the model to prefer expert answers over SFT errors.
python -m src.dpo_train
- Benchmark:
python -m src.eval - Qualitative Judge:
python -m src.judge - Web Assistant:
streamlit run app.py
Honest assessment of what this model does not do well, and the concrete next steps that would address each gap. All limitations are grounded in the evaluation results above.
- Base model at ceiling β
gemma-2-2b-italready handles GDPR questions at a level that 316 Q&A samples cannot meaningfully surpass. All fine-tuned variants (SFT, DPO v1, DPO v2) match or underperform the base model on qualitative metrics at n=50. - Dynamic Rejection signal quality β Data inspection revealed that auto-generated rejections have systematic length asymmetry (rejected ~56% of chosen length) and citation density gap (1.41 vs 3.75 articles/sample), with only 9% Jaccard overlap on cited articles. In ~2/3 of inspected pairs, the "rejected" answer was not clearly worse than "chosen." DPO learns spurious "longer + more citations" signals rather than genuine accuracy.
- Article citation accuracy ~2.5-2.6/5 β The model occasionally hallucinates GDPR article numbers or misapplies references. This is a retrieval problem β fine-tuning cannot reliably encode 100+ GDPR articles into a 2B model from 316 samples.
- English only β Trained on
sims2k/GDPR_QA_instruct_dataset(English-only). Althoughgemma-2-2b-itis multilingual, this fine-tune is not aligned for non-English GDPR queries. - Static knowledge snapshot β Reflects the regulation text only; does not incorporate post-training EDPB guidelines, CJEU rulings, or national supervisory authority decisions.
| Experiment | Hypothesis | Result |
|---|---|---|
| n=10 β n=50 re-evaluation | Initial n=10 results showed DPO outperforming Base; suspected noise | Confirmed: n=10 was a false positive. Base β₯ all variants at n=50 |
| Prompt mismatch diagnosis | inference.py adds system prompt absent from training β harms fine-tuned models | Rejected: A/B test showed no meaningful difference between formats |
| Tier 1 fix (filter + IPO + Ξ²=0.3) | Remove length/citation bias from DPO pairs, use robust loss | DPO v2 regressed further β data reduction (316β103) outweighed noise removal |
- RAG (Retrieval-Augmented Generation) β Index GDPR article text in a vector DB and retrieve at inference time. This directly addresses article hallucination, which fine-tuning alone cannot solve. Most promising next step.
- Targeted negative generation β Use GPT-4o to create controlled rejections with intentionally wrong article citations, producing a cleaner preference signal than self-generated negatives.
- Dataset expansion β Augment beyond 316 samples using GPT-4o synthesis or additional legal Q&A datasets to provide more headroom for fine-tuning.
Benchmarked on DGX Spark β quantitative metrics on 100 samples, qualitative LLM-as-a-Judge (GPT-4o) on 50 samples (re-evaluated from initial n=10 to ensure statistical reliability).
| Metric | Base | SFT | DPO v1 | DPO v2 (Tier 1) |
|---|---|---|---|---|
| ROUGE-L | 0.2072 | 0.2331 | 0.2252 | 0.2165 |
| BLEU | 0.0838 | 0.1146 | 0.1034 | 0.1045 |
| BertScore F1 | 0.8432 | 0.8541 | 0.8527 | 0.8486 |
| Criterion | Base | SFT | DPO v1 | DPO v2 (Tier 1) |
|---|---|---|---|---|
| Legal Correctness | 3.18 | 3.18 | 3.06 | 2.86 |
| Article Accuracy | 2.64 | 2.52 | 2.50 | 2.50 |
| Compliance Alignment | 3.62 | 3.62 | 3.40 | 3.18 |
| Clarity | 4.10 | 4.12 | 3.74 | 3.32 |
The base gemma-2-2b-it model achieves the highest (or tied-highest) scores across all qualitative criteria. Neither SFT nor DPO produces statistically significant improvements at n=50.
- SFT improves surface-level text overlap (ROUGE/BLEU) because it directly maximizes reference-text likelihood, but adds no measurable gain on legal quality metrics.
- DPO v1 (standard sigmoid, Ξ²=0.1) slightly degrades all judge scores vs Base.
- DPO v2 (Tier 1 fix: IPO loss, Ξ²=0.3, filtered data 316β103 pairs) regresses further β the 66% data reduction outweighed the noise-reduction benefit.
This pattern is consistent with the known failure conditions of self-generated preferences (SPIN; Chen et al., 2024) on small datasets with a strong instruction-tuned base model. See Limitations below for the full analysis.
- Gemma β Official site
- Gemma 2 model card
- Gemma 2 announcement
- Gemma docs
- Gemma Cookbook (GitHub)
- Aligning DPO Gemma 2B-it (Cookbook notebook)
- sims2k/GDPR_QA_instruct_dataset (HF) β primary training set
- sims2k/GDPR_QA_instruct_eval_dataset (HF) β evaluation split
- tamjidrahat/gdpr-dataset (GitHub)
- Is Your Policy Compliant? β ACM paper (2022)
- Fine-tuning Gemma with QLoRA β Google Developer Experts
- Fine-Tune Gemma Using QLoRA β Samvardhan
- Low-Rank Adapter (LoRA) Explained
- Fine-tuning LLMs w/ Example Code (YouTube)
- Getting Started with Gemma using HuggingFace Libraries
- Sherlock Holmes Q&A with Gemma fine-tuning (Kaggle)