This project is a simple search engine built on a database of various Wikipedia documents. It demonstrates how to index, process, and search a large collection of articles using information retrieval techniques.
- Document Indexing: Wikipedia articles are fetched and stored in a SQLite database.
- Text Preprocessing: Includes tokenization, stemming, and stopword removal to normalize queries and documents.
- TF-IDF Representation: Documents are represented as sparse vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) scheme.
- Latent Semantic Indexing (LSI/SVD): Optionally, queries and documents can be projected into a lower-dimensional semantic space using Singular Value Decomposition (SVD) for improved relevance (Latent Semantic Indexing).
- Fast Nearest Neighbor Search: Uses the HNSW (Hierarchical Navigable Small World) algorithm for efficient similarity search.
- Frontend: A React-based frontend allows users to search, select the number of results, and toggle LSI on or off.
- Data Collection: Wikipedia pages are fetched using the MediaWiki API and stored in a SQLite database.
- Preprocessing: Each document is tokenized, stemmed (using NLTK), and filtered for stopwords.
- Vectorization: Documents are converted into TF-IDF vectors.
- Dimensionality Reduction: Optionally, SVD is applied to reduce noise and capture latent semantic relationships.
- Similarity Search: When a user submits a query, it is processed in the same way as the documents and compared using cosine similarity (or in the reduced SVD space).
- Results: The most relevant documents are returned and displayed in the frontend.
- Backend: Python, Flask, SQLite, NLTK, scikit-learn, hnswlib
- Frontend: React, TypeScript, Material-UI (MUI), Vite
-
Backend:
- Install dependencies:
pip install -r requirements.txt - Follow ALL the steps in jupyter notebook So this is kinda the worst part about running this project because recreating the database from jupter takes a long time and also you need to create pickle files
- Run the backend:
python app.py
- Install dependencies:
-
Frontend:
- Install dependencies:
npm install - Run the frontend:
npm run dev
- Install dependencies:
"gta"
I've added skeleton screen for improved perceived performance

- The first query may take a while as the index loads into memory.
- LSI (Latent Semantic Indexing) can improve result quality, especially for ambiguous or broad queries.
- The project is for educational purposes and demonstrates core IR concepts like stemming, TF-IDF, and SVD.
PS. I also wanted to deploy this project but after creating Azure Container Instance of the backend the costs were way too high for this toy project, so to experience it in all of it's glory you have to set it up yourself 😊

