Simple Wikipedia Search Engine

This project is a simple search engine built on a database of various Wikipedia documents. It demonstrates how to index, process, and search a large collection of articles using information retrieval techniques.

Features

Document Indexing: Wikipedia articles are fetched and stored in a SQLite database.
Text Preprocessing: Includes tokenization, stemming, and stopword removal to normalize queries and documents.
TF-IDF Representation: Documents are represented as sparse vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) scheme.
Latent Semantic Indexing (LSI/SVD): Optionally, queries and documents can be projected into a lower-dimensional semantic space using Singular Value Decomposition (SVD) for improved relevance (Latent Semantic Indexing).
Fast Nearest Neighbor Search: Uses the HNSW (Hierarchical Navigable Small World) algorithm for efficient similarity search.
Frontend: A React-based frontend allows users to search, select the number of results, and toggle LSI on or off.

How it Works

Data Collection: Wikipedia pages are fetched using the MediaWiki API and stored in a SQLite database.
Preprocessing: Each document is tokenized, stemmed (using NLTK), and filtered for stopwords.
Vectorization: Documents are converted into TF-IDF vectors.
Dimensionality Reduction: Optionally, SVD is applied to reduce noise and capture latent semantic relationships.
Similarity Search: When a user submits a query, it is processed in the same way as the documents and compared using cosine similarity (or in the reduced SVD space).
Results: The most relevant documents are returned and displayed in the frontend.

Technologies Used

Backend: Python, Flask, SQLite, NLTK, scikit-learn, hnswlib
Frontend: React, TypeScript, Material-UI (MUI), Vite

Running the Project

Backend:
- Install dependencies:
  pip install -r requirements.txt
- Follow ALL the steps in jupyter notebook So this is kinda the worst part about running this project because recreating the database from jupter takes a long time and also you need to create pickle files
- Run the backend:
  python app.py
Frontend:
- Install dependencies:
  npm install
- Run the frontend:
  npm run dev

Example Query

"gta"

SVD OFF:

SVD ON:

I've added skeleton screen for improved perceived performance

Notes

The first query may take a while as the index loads into memory.
LSI (Latent Semantic Indexing) can improve result quality, especially for ambiguous or broad queries.
The project is for educational purposes and demonstrates core IR concepts like stemming, TF-IDF, and SVD.

PS. I also wanted to deploy this project but after creating Azure Container Instance of the backend the costs were way too high for this toy project, so to experience it in all of it's glory you have to set it up yourself 😊

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
backend		backend
frontend		frontend
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Wikipedia Search Engine

Features

How it Works

Technologies Used

Running the Project

Example Query

SVD OFF:

SVD ON:

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simple Wikipedia Search Engine

Features

How it Works

Technologies Used

Running the Project

Example Query

SVD OFF:

SVD ON:

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages