Skip to content

🧠 NLP pipeline to extract mining project names & locations from PDFs using spaCy NER + GeoNames geolocation.

License

Notifications You must be signed in to change notification settings

basedavishkar/GeoMine-NER-Geolocation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

55 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ› οΈ GeoMine: Mining Project Extraction from PDF Reports (NER + GeoLocation)

Built with spaCy Gemini API License: MIT Python

Extract named entities and locations from geological PDFs using spaCy NER and Gemini LLMs. Built for mining industry document parsing.

GeoMine is a modular information extraction pipeline built to process unstructured mining documents, identify project names using Named Entity Recognition (NER), and estimate their geographic locations using large language models. Designed with simplicity, modularity, and clarity in mind, it's suitable for both production and research use cases.


πŸ“Œ Overview

Given a collection of multi-page geological PDF reports, this pipeline:

  1. Extracts readable text from each page
  2. Identifies mining project names using a custom-trained spaCy NER model
  3. Infers coordinates using LLM prompting (Gemini API)
  4. Outputs a structured JSONL record for each project mention

✨ Features

  • πŸ“„ Accurate text extraction from noisy PDFs with pdfplumber
  • 🧠 Custom NER model for project detection (trained with Label Studio annotations)
  • 🌍 LLM-powered geolocation from contextual clues
  • πŸ’¬ Clear logging, error handling, and testable modules
  • πŸ” Reproducible and modular design for future extension

πŸš€ Getting Started

Installation

git clone https://github.com/ashkaaar/GeoMine-NER-Geolocation.git
cd GeoMine-NER-Geolocation

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

πŸ“ Project Structure

.
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ input/
β”‚   β”‚   β”œβ”€β”€ annotations.json           # NER labels from Label Studio
β”‚   β”‚   └── pdf_reports/               # Input geological PDFs
β”‚   β”œβ”€β”€ output/                        # Final extracted results
β”‚   └── temp/                          # Intermediate files (e.g., text dump)
β”œβ”€β”€ models/                            # Trained spaCy NER model
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py                      # Path setup and logging
β”‚   β”œβ”€β”€ pdf_processor.py               # PDF β†’ text
β”‚   β”œβ”€β”€ ner_trainer.py                 # Train custom NER
β”‚   β”œβ”€β”€ entity_extractor.py           # Detect entities from text
β”‚   β”œβ”€β”€ geo_locator.py                 # Geolocation logic (LLM)
β”‚   └── utils.py                       # Helpers and error handling
β”œβ”€β”€ tests/                             # Unit tests (pytest)
β”œβ”€β”€ run_pipeline.sh                    # One-command runner
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md

Run the Pipeline

bash run_pipeline.sh

Outputs will be saved to:

data/output/final_results.jsonl

🐳 Docker Setup (Optional, Recommended)

Build Docker Image

Make sure Docker Desktop is installed and running.

From the project root directory, run:

docker build -t geomine-pipeline:latest .

Run the Pipeline in Docker

Run the entire pipeline inside Docker, mounting your local data folder for input/output persistence:

docker run --rm -v "$PWD/data:/app/data" geomine-pipeline:latest
  • --rm removes the container after it finishes.
  • -v "$PWD/data:/app/data" mounts your local data folder into the container.

Debug / Interactive Mode

Open an interactive shell inside the container:

docker run -it --rm -v "$PWD/data:/app/data" geomine-pipeline:latest /bin/bash

Inside the container shell, run the pipeline manually:

./run_pipeline.sh

βš™οΈ Local Setup (Without Docker)

1. Create and activate Python virtual environment

python3 -m venv venv
source venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

3. Run the pipeline

bash run_pipeline.sh

πŸ” Output Format

{
  "pdf_file": "Report_4.pdf",
  "page_number": 3,
  "project_name": "Minyari Dome Project",
  "context_sentence": "Minyari Dome Project is located in the Paterson region of WA.",
  "coordinates": [-24.7393, 133.8807]
}

🧰 Built With

Purpose Tool/Library
Text extraction pdfplumber
NER & NLP spaCy
Annotation format Label Studio JSON
Geolocation (LLM) Gemini API (Google AI Studio)
Output structure JSONL

πŸ§ͺ Testing

pytest

Tests live in the tests/ directory.


βš™οΈ Implementation Notes

Area Details
NER Trained using project-level annotations to detect custom "PROJECT" entities
Geolocation Context-aware location prediction via Gemini or fallback rules
Data Format JSONL for line-by-line structured records
Fault Tolerance Graceful handling of empty pages, missing labels, and broken models

πŸ“¦ Deliverables

  • βœ… End-to-end runnable pipeline
  • βœ… Clean, structured JSONL outputs
  • βœ… Modular, testable Python code

πŸ’‘ Ideas for Improvement

  • Confidence scoring on both project detection and coordinate inference
  • Interactive annotation review tool
  • Gazetteer-backed geolocation fallback
  • Dockerized deployment

πŸ‘€ Author

Avishkar Dandge
GitHub

πŸ“ License

This project is licensed under the MIT License.

This project helps in extracting mining project names and estimating their coordinates from unstructured PDF reports using NER and geolocation. Ideal for geological data analysis, natural resource intelligence, and environmental reports.

About

🧠 NLP pipeline to extract mining project names & locations from PDFs using spaCy NER + GeoNames geolocation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published