Skip to content

MichaelTheMay/gMapsFullPurposeScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google Maps Scraper

A professional-grade Google Maps scraper that extracts business metadata from list-view results. Features bandwidth optimization for free proxy tiers, job resume capability, deduplication, and a real-time monitoring dashboard.

Features

  • Category 1 Extraction - Scrapes list-view data without clicking into individual listings
  • Bandwidth Optimized - Blocks images/fonts/CSS for minimal data usage (perfect for WebShare.io 1GB free tier)
  • Job Resume - Interrupt and resume scrapes at any time
  • Multi-Query Batching - Run multiple search queries in a single session
  • Cross-Scrape Deduplication - Exclude businesses from previous scrapes
  • Rich Terminal UI - Real-time progress with live business feed monitor
  • SQLite Storage - WAL mode for concurrent read/write with MD5 deduplication
  • Parallel Scraping - Multi-process scraping for faster results
  • Website Analysis - Analyze business websites for quality scoring
  • Health Scoring - Composite scoring system for lead qualification
  • Report Generation - Auto-generate HTML/JSON opportunity reports

Quickstart

# 1. Clone the repository
git clone https://github.com/yourusername/gMapsFullPurposeScraper.git
cd gMapsFullPurposeScraper

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt
playwright install chromium

# 4. Seed the cities database
python setup.py

# 5. Run the interactive wizard (recommended)
python main.py --wizard

# Or run directly:
python main.py --query "Pet Cremation" --preset top_10 --no-proxy

Installation

Prerequisites

  • Python 3.9+
  • pip

Steps

  1. Install Python dependencies:

    pip install -r requirements.txt
  2. Install Playwright browsers:

    playwright install chromium
  3. Seed city database:

    python setup.py
  4. Configure proxy (optional): Edit config.yaml and set your proxy credentials:

    proxy:
      enabled: true
      server: "http://user:pass@proxy.webshare.io:8080"

Usage

Interactive Wizard

The wizard guides you through configuration:

python main.py --wizard

Basic Scraping

# Single query with city preset
python main.py --query "Pet Cremation" --preset top_100

# Multiple queries (semicolon-separated)
python main.py --queries "Pet Cremation; Pet Cemetery; Animal Hospital" --preset top_10

# Custom cities
python main.py --query "Veterinarian" --custom "Dallas, TX; Miami, FL; Austin, TX"

# With live monitor dashboard
python main.py --query "Pet Cremation" --preset top_10 --live

# Without proxy (for testing)
python main.py --query "Pet Cremation" --preset top_10 --no-proxy

Job Management

# List resumable jobs
python main.py --list-jobs

# Resume an interrupted job
python main.py --resume <JOB_ID>

Export Data

# List all result tables
python main.py --list-tables

# Export specific table to CSV
python main.py --export results_PetCremation_top100_20251215

# Export all tables as JSON
python main.py --export-all --format json

Search Businesses

# Search all fields
python main.py --search "cremation"

# Search by specific field
python main.py --search-name "Pet Heaven"
python main.py --search-phone "555-1234"
python main.py --search-city "Dallas"

# Export search results
python main.py --search "dallas" --export-search results

Cross-Scrape Deduplication

Exclude businesses already found in previous scrapes:

python main.py --query "Pet Cremation" --preset top_100 --exclude-tables "results_Pet*"

City Presets

Preset Cities Description
top_10 10 Largest US metros
top_100 100 Major cities
top_1000 1,000 Medium+ cities
top_2500 2,500 All tracked cities

Architecture

main.py          CLI entry point and Rich UI orchestration
wizard.py        Interactive CLI wizard for guided setup
scraper.py       Playwright automation and listing extraction
db.py            SQLite database with WAL mode
job_manager.py   Job persistence and resume capability
monitor.py       Live dashboard with sparklines and business feed
setup.py         City database seeding
config.yaml      Configuration file

Data Flow

  1. CLI parses arguments and loads config
  2. DatabaseManager creates dynamic result table
  3. Scraper iterates cities and extracts listings
  4. Deduplication via MD5 hash of (name + phone)
  5. Job state persisted for resume capability

Configuration

Key settings in config.yaml:

Setting Description
proxy.server Proxy URL with credentials
scrape.scroll_limit Scrolls per city (10-20 results each)
bandwidth.budget_mb Auto-pause threshold
optimization.block_* Toggle resource blocking

Bandwidth Optimization

For WebShare.io's 1GB free tier, the default settings block images, fonts, CSS, and media to minimize bandwidth usage. The scraper auto-pauses when approaching the budget limit.

Output

Results are stored in SQLite (maps_data.db) with tables named:

results_{query}_{preset}_{timestamp}

Export formats: CSV (default), JSON, Excel

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors