Skip to content

This project implements a Retrieval Augmented Generation (RAG) index to efficiently query articles and retrieve the most relevant content while filtering out harmful or toxic queries.

Notifications You must be signed in to change notification settings

filippopellizzari/RAG-ArticleQueryFilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Article Query Filter

A hybrid retrieval system for news articles that combines dense vector search and BM25 keyword matching, with a BERT-based toxicity gate on incoming queries.

Built as a self-contained ML engineering project to demonstrate systematic retrieval experimentation, modular Python packaging, and production-ready API design.


Architecture

Raw corpus (CSV)
      │
      ▼
┌──────────────────────┐   SentenceSplitter (256 tok)  ┌────────────────┐
│      src/data        │ ────────────────────────────► │  src/indexing  │
│  minimal_clean_text  │                               │  ChromaDB      │
└─────────────┘                                     │  bge-large-v1.5│
                                                    └───────┬────────┘
                                                            │
                        ┌───────────────────────────────────┤
                        │                                   │
               ┌────────▼────────┐                 ┌────────▼────────┐
               │  Vector Search  │                 │  BM25 Retriever │
               │  bge-large-v1.5 │                 │  word tokeniser │
               └────────┬────────┘                 └────────┬────────┘
                        │                                   │
                        └──────────────┬────────────────────┘
                                       │ Reciprocal Rank Fusion
                                       ▼
                              Ranked document UUIDs
                                       │
                              ┌────────▼────────┐
                              │ src/query_filter │
                              │  BERT toxicity  │
                              └────────┬────────┘
                                       │
                              ┌────────▼────────┐
                              │    src/api       │
                              │    FastAPI       │
                              └─────────────────┘

Results

Five retrieval strategies evaluated on 500 queries (seed 42) from a labelled set of 2 330. Full methodology and findings: results/retrieval_experiments.md

Experiment F1 Recall Precision
Baseline — bge-small, top_k=2 0.323 0.271 0.440
BM25 word-tokenised, top_k=5 0.396 0.460 0.373
bge-base vector, top_k=5 0.349 0.396 0.337
bge-base + BM25 hybrid (RRF) 0.403 0.440 0.401
bge-large + BM25 hybrid (RRF) 0.405 0.441 0.408

+25.5% F1 improvement over baseline. Key insight: on this entity-heavy news corpus, BM25 dominates standalone vector search; a fair RRF hybrid closes the precision gap.


Setup

# Install dependencies (uv recommended)
uv sync

# Or with pip
pip install -e .

Requirements: Python 3.11–3.12, ~5 GB disk (models + vector DB).


Usage

1. Prepare the corpus

prepare-data

Produces dataset/corpus_minimal_clean.csv (whitespace-normalised, casing preserved).

2. Run retrieval experiments

run-experiments

Builds ChromaDB indices for each embedding model (cached on re-runs) and prints the full results table. See results/retrieval_experiments.md for the pre-computed results.

3. Run the tests

# Install dev extras first (pytest + httpx)
uv sync --extra dev

pytest tests/

All tests run without a pre-built index or GPU — heavy dependencies (ChromaDB, BERT, embedding models) are mocked at the test boundary.

4. Start the API

uvicorn src.api:app --reload

Interactive docs at http://127.0.0.1:8000/docs.

Note: The API loads the v2_bge_large index at startup. Run run-experiments at least once before serving to ensure the index is built.

Endpoints

Method Path Description
GET /health Liveness check
POST /query?query=… Return ranked article UUIDs

The toxicity filter rejects harmful queries with HTTP 422 before retrieval runs.


Dataset

File Description
dataset/corpus.csv 609 news articles (title, body, uuid, category, …)
dataset/corpus_minimal_clean.csv Whitespace-normalised — produced by prepare-data
dataset/queries.csv 2 330 labelled queries with ground-truth article UUIDs

Tech stack

Component Library
Vector indexing LlamaIndex + ChromaDB
Dense embeddings BAAI/bge-large-en-v1.5 via HuggingFace
Keyword retrieval BM25 (rank_bm25) with word-level tokeniser
Rank fusion Reciprocal Rank Fusion (Cormack et al., 2009)
Toxicity filter JungleLee/bert-toxic-comment-classification
API FastAPI
Linting Ruff

Project layout

src/
├── config.py        centralised constants (paths, model names, tuning params)
├── data.py          corpus text cleaning + prepare-data entry point
├── indexing.py      ChromaDB index construction and loading
├── retrieval.py     RRF, hybrid retrieval, reranking utilities
├── evaluation.py    retrieval metrics (recall, precision, F1, hit rate, MRR)
├── query_filter.py  lazy-loaded BERT toxicity classifier
├── experiments.py   5-experiment runner + run-experiments entry point
└── api.py           FastAPI application

results/
└── retrieval_experiments.md   experiment report (methodology + results + findings)

tests/                         tests

About

This project implements a Retrieval Augmented Generation (RAG) index to efficiently query articles and retrieve the most relevant content while filtering out harmful or toxic queries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages