Extract named entities and locations from geological PDFs using spaCy NER and Gemini LLMs. Built for mining industry document parsing.
GeoMine is a modular information extraction pipeline built to process unstructured mining documents, identify project names using Named Entity Recognition (NER), and estimate their geographic locations using large language models. Designed with simplicity, modularity, and clarity in mind, it's suitable for both production and research use cases.
Given a collection of multi-page geological PDF reports, this pipeline:
- Extracts readable text from each page
- Identifies mining project names using a custom-trained spaCy NER model
- Infers coordinates using LLM prompting (Gemini API)
- Outputs a structured
JSONLrecord for each project mention
- π Accurate text extraction from noisy PDFs with
pdfplumber - π§ Custom NER model for project detection (trained with Label Studio annotations)
- π LLM-powered geolocation from contextual clues
- π¬ Clear logging, error handling, and testable modules
- π Reproducible and modular design for future extension
git clone https://github.com/ashkaaar/GeoMine-NER-Geolocation.git
cd GeoMine-NER-Geolocation
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt.
βββ data/
β βββ input/
β β βββ annotations.json # NER labels from Label Studio
β β βββ pdf_reports/ # Input geological PDFs
β βββ output/ # Final extracted results
β βββ temp/ # Intermediate files (e.g., text dump)
βββ models/ # Trained spaCy NER model
βββ src/
β βββ config.py # Path setup and logging
β βββ pdf_processor.py # PDF β text
β βββ ner_trainer.py # Train custom NER
β βββ entity_extractor.py # Detect entities from text
β βββ geo_locator.py # Geolocation logic (LLM)
β βββ utils.py # Helpers and error handling
βββ tests/ # Unit tests (pytest)
βββ run_pipeline.sh # One-command runner
βββ requirements.txt
βββ README.md
bash run_pipeline.shOutputs will be saved to:
data/output/final_results.jsonl
Make sure Docker Desktop is installed and running.
From the project root directory, run:
docker build -t geomine-pipeline:latest .Run the entire pipeline inside Docker, mounting your local data folder for input/output persistence:
docker run --rm -v "$PWD/data:/app/data" geomine-pipeline:latest--rmremoves the container after it finishes.-v "$PWD/data:/app/data"mounts your localdatafolder into the container.
Open an interactive shell inside the container:
docker run -it --rm -v "$PWD/data:/app/data" geomine-pipeline:latest /bin/bashInside the container shell, run the pipeline manually:
./run_pipeline.shpython3 -m venv venv
source venv/bin/activatepip install -r requirements.txtbash run_pipeline.sh{
"pdf_file": "Report_4.pdf",
"page_number": 3,
"project_name": "Minyari Dome Project",
"context_sentence": "Minyari Dome Project is located in the Paterson region of WA.",
"coordinates": [-24.7393, 133.8807]
}| Purpose | Tool/Library |
|---|---|
| Text extraction | pdfplumber |
| NER & NLP | spaCy |
| Annotation format | Label Studio JSON |
| Geolocation (LLM) | Gemini API (Google AI Studio) |
| Output structure | JSONL |
pytestTests live in the tests/ directory.
| Area | Details |
|---|---|
| NER | Trained using project-level annotations to detect custom "PROJECT" entities |
| Geolocation | Context-aware location prediction via Gemini or fallback rules |
| Data Format | JSONL for line-by-line structured records |
| Fault Tolerance | Graceful handling of empty pages, missing labels, and broken models |
- β End-to-end runnable pipeline
- β Clean, structured JSONL outputs
- β Modular, testable Python code
- Confidence scoring on both project detection and coordinate inference
- Interactive annotation review tool
- Gazetteer-backed geolocation fallback
- Dockerized deployment
Avishkar Dandge
GitHub
This project is licensed under the MIT License.
This project helps in extracting mining project names and estimating their coordinates from unstructured PDF reports using NER and geolocation. Ideal for geological data analysis, natural resource intelligence, and environmental reports.