Skip to content

M1325-source/endee

 
 

Repository files navigation

🔬 IndustrialDocSearch — RAG-Powered Technical Document Q&A

Semantic Search & Retrieval-Augmented Generation (RAG) over Industrial Safety and Gas Detection Documentation, powered by the Endee Vector Database.


📋 Table of Contents


🎯 Project Overview

Industrial organizations — particularly in sectors like oil & gas, chemical plants, and manufacturing — maintain large volumes of technical documentation: safety procedures, calibration manuals, compliance guidelines, and emergency protocols.

IndustrialDocSearch is an AI-powered application that allows engineers and safety personnel to query this documentation in plain English and receive precise, contextually grounded answers — without having to manually search through hundreds of PDFs.


🔍 Problem Statement

Challenge: Industrial safety teams face a critical knowledge-retrieval problem. Technical documents are voluminous, domain-specific, and time-sensitive. Traditional keyword search fails because:

  • Engineers phrase questions differently from how documents are written
  • Critical safety thresholds are scattered across multiple documents
  • There's no way to cross-reference related procedures quickly

Solution: A vector-search-powered RAG pipeline that:

  1. Understands the semantic meaning of questions (not just keywords)
  2. Retrieves the most relevant documentation snippets using Endee
  3. Generates precise, citation-grounded answers using an LLM

🏗️ System Design & Architecture

┌─────────────────────────────────────────────────────────────┐
│                        INGESTION PIPELINE                    │
│                                                             │
│  JSON Documents ──► Sentence Transformer ──► Endee Index   │
│  (title + content)    (384-dim vectors)       (HNSW / INT8) │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                        QUERY PIPELINE                        │
│                                                             │
│  User Question                                              │
│       │                                                     │
│       ▼                                                     │
│  Sentence Transformer  ──►  384-dim query vector            │
│       │                                                     │
│       ▼                                                     │
│  Endee k-NN Search     ──►  Top-K similar documents         │
│  (cosine similarity)         (with metadata)                │
│       │                                                     │
│       ▼                                                     │
│  RAG Prompt Assembly   ──►  context + question              │
│       │                                                     │
│       ▼                                                     │
│  Gemini 1.5 Flash      ──►  Grounded natural language answer │
│  (optional, free tier)                                      │
│       │                                                     │
│       ▼                                                     │
│  Streamlit UI          ──►  Answer + Retrieved Docs         │
└─────────────────────────────────────────────────────────────┘

Key Design Decisions

Decision Rationale
Endee as vector store High-performance HNSW indexing, easy Docker deployment, Python SDK
all-MiniLM-L6-v2 embeddings 384-dim, fast inference, excellent semantic search quality for technical text
Cosine similarity + INT8 precision Memory-efficient; normalized vectors are ideal for cosine space
RAG over fine-tuning Documents can be updated without retraining; grounded answers reduce hallucination
Gemini 1.5 Flash Free tier, low latency, excellent at following RAG prompts

🗄️ How Endee is Used

Endee serves as the core vector database for this application. Here's exactly how:

1. Index Creation

from endee import Endee, Precision

client = Endee()
client.create_index(
    name="industrial_docs",
    dimension=384,          # matches all-MiniLM-L6-v2 output
    space_type="cosine",    # cosine similarity for normalized vectors
    precision=Precision.INT8  # quantized for memory efficiency
)

2. Document Ingestion (Upsert)

index = client.get_index(name="industrial_docs")
index.upsert([
    {
        "id": "doc_001",
        "vector": [0.023, -0.112, ...],   # 384-dim embedding
        "meta": {
            "title": "H2S Gas Detection – Safety Guidelines",
            "category": "Safety",
            "content": "Hydrogen sulfide (H2S) is a colorless..."
        }
    },
    # ... more documents
])

3. Semantic Search (k-NN Query)

query_vector = embedder.embed_single("What is the IDLH for H2S?")

results = index.query(
    vector=query_vector,
    top_k=5,
)
# Returns: [{ id, similarity, meta }]

4. Metadata Filtering

# Filter to only Safety category documents
results = index.query(
    vector=query_vector,
    top_k=5,
    filters={"category": {"$eq": "Safety"}}
)

Endee handles all vector storage, HNSW indexing, and approximate nearest-neighbor search, allowing the application layer to focus purely on embedding quality and answer generation.


✨ Features

  • 🔍 Semantic Search — Find relevant documents by meaning, not just keywords
  • 🤖 RAG Q&A — AI-generated answers grounded in retrieved documents (with Gemini API key)
  • 📂 Category Filtering — Filter results by document category
  • 🎚️ Adjustable Top-K — Control how many documents to retrieve
  • 💬 Example Queries — One-click example questions to get started
  • 🔄 One-click Re-indexing — Rebuild the Endee index from the sidebar

🛠️ Tech Stack

Component Technology
Vector Database Endee (self-hosted via Docker)
Embeddings sentence-transformers all-MiniLM-L6-v2
LLM (optional) Google Gemini 1.5 Flash (free tier)
UI Streamlit
Language Python 3.10+

🚀 Setup & Installation

Prerequisites

  • Python 3.10+
  • Docker + Docker Compose
  • (Optional) Free Google AI Studio API key for Gemini

Step 1 — Clone this repository

git clone https://github.com/M1325-source/endee
cd endee

⭐ Please also star the original endee-io/endee repository!


Step 2 — Start the Endee Vector Database

docker compose up -d

This pulls the official endeeio/endee-server:latest image and starts it on port 8080.

Verify it's running:

docker ps
# You should see: endee-server   Up

You can also open the Endee dashboard at: http://localhost:8080


Step 3 — Install Python dependencies

pip install -r requirements.txt

Step 4 — Index the documents into Endee

python ingest.py

This will:

  1. Load 15 industrial safety & gas detection documents from data/documents.json
  2. Generate 384-dim embeddings for each document using all-MiniLM-L6-v2
  3. Upsert all vectors + metadata into Endee's industrial_docs index

Step 5 — Run the application

streamlit run app.py

Open http://localhost:8501 in your browser.


(Optional) Enable AI-generated answers

Get a free API key from Google AI Studio and paste it into the Gemini API Key field in the sidebar. This activates RAG mode where answers are generated by Gemini 1.5 Flash using the retrieved documents as context.


📁 Project Structure

endee/                          ← This repository (forked from endee-io/endee)
├── app.py                      ← Streamlit application (main entry point)
├── ingest.py                   ← CLI script to index documents into Endee
├── requirements.txt
├── docker-compose.yml          ← Runs Endee vector database
│
├── src/
│   ├── embedder.py             ← Sentence embedding generation
│   ├── indexer.py              ← Endee index creation & document upsert
│   └── retriever.py            ← k-NN search + RAG answer generation
│
├── data/
│   └── documents.json          ← 15 industrial safety documents (dataset)
│
└── [original Endee source]     ← C++ vector DB source (endee-io/endee)
    ├── src/
    ├── infra/
    ├── CMakeLists.txt
    └── ...

💬 Demo Queries

Try these in the app:

Query Expected Result
What are the alarm thresholds for oxygen deficiency? Oxygen Deficiency Monitoring doc
How often should a dew point analyzer be calibrated? Dew Point Analyzer Maintenance
Where should H2S sensors be placed? Fixed Gas Detection Installation
What to do in a gas leak emergency? Emergency Response Plan
Catalytic bead vs infrared sensor – which is better for methane? Sensor Technology Comparison
OSHA confined space requirements OSHA 1910.146 Compliance doc

📄 License

This project is built on top of Endee, which is licensed under the Apache License 2.0.

The application code in app.py, src/, data/, and ingest.py is original work created for the ENDEE Campus Placement Drive project evaluation.

About

Endee.io – A high-performance vector database, designed to handle up to 1B vectors on a single node, delivering significant performance gains through optimized indexing and execution. Also available in cloud https://endee.io/

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%