RAG-PDF-Query

This project implements a Retrieval-Augmented Generation (RAG) pipeline to answer questions about a collection of PDF documents. The pipeline combines text extraction, vectorization, and a semantic search using ChromaDB to retrieve relevant information, followed by a GPT-4-powered answer generation step.

1. Features

Extracts text from a collection of PDF files.
Processes and chunks the text into paragraphs for efficient vectorization.
Stores the vectorized data into a vector database using ChromaDB.
Retrieves relevant context for a given query using semantic similarity.
Generates accurate and contextually grounded answers using GPT-4 for three test questions.
Interactive CLI to allow users to ask additional questions beyond predefined ones.

2. Requirements

Python Environment

Ensure you have Python 3.8 or later installed.

Install Dependencies

Install the required libraries via pip:

pip install -r requirements.txt

Additional Setup

Add your OpenAI API key to a .env file in the root directory.

OPENAI_API_KEY=your_openai_api_key

3. Usage

Prepare Your PDF Files

Place all relevant PDF files in the ../data/ folder.

Run the Script

Run the script using:

python3 main.py

Enter Query

The script will answer predefined questions given in the test_questions module and display results. Then, it will enter the interactive mode and prompt you to ask additional questions.

Type your question, an answer will be provided based on the document contex using the RAG-pipeline.

You can exit the interactive mode by typing

exit

4. Configuration

The predefined questions are located in test_questions.py. You can edit this file to add or modify questions.

The embedding model is defined in utils.py.

IMPORTANT NOTE: make sure that the selected embedding model supports the language of the data files and questions !

5. Folder Structure

.
├── data/                 # Folder for PDF files
├── src/                  # Source code folder
│   ├── main.py           # Control file - entry point of the project
│   ├── extract.py        # PDF text extraction and chunking functions
│   ├── vectorize.py      # Vectorization of paragraphs
│   ├── database.py       # ChromaDB collection and storage functions
│   ├── query.py          # Context retrieval functions
│   ├── utils.py          # Utilities and global settings
│   ├── prompts.py        # System and user prompt templates
│   ├── test_questions.py # Predefined questions
├── .env                  # Environment variables (e.g., OpenAI API key)
├── requirements.txt      # Python dependencies

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-PDF-Query

1. Features

2. Requirements

Python Environment

Install Dependencies

Additional Setup

3. Usage

Prepare Your PDF Files

Run the Script

Enter Query

4. Configuration

5. Folder Structure

About

Uh oh!

Releases

Packages

Languages

emirkocer/RAG-PDF-Query

Folders and files

Latest commit

History

Repository files navigation

RAG-PDF-Query

1. Features

2. Requirements

Python Environment

Install Dependencies

Additional Setup

3. Usage

Prepare Your PDF Files

Run the Script

Enter Query

4. Configuration

5. Folder Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages