Skip to content

A high-performance, production-ready Python web scraping framework. Supports static and dynamic websites, intelligent content extraction with fallback, async scraping, anti-bot handling, multi-format export (CSV, JSON, Excel, XML, SQLite), and resumable sessions.

Notifications You must be signed in to change notification settings

0khacha/web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Universal Web Scraper

Intelligent β€’ High-Performance β€’ Production-Ready

A powerful web scraping tool with smart extraction, async performance, and multi-format export capabilities.


πŸ“‹ Table of Contents


Features

  • Smart Extraction: Automatically detects content structure with intelligent fallback
  • Async Performance: High-speed concurrent scraping with configurable concurrency
  • Multi-Format Export: CSV, JSON, Excel, XML, SQLite support
  • State Management: Resume interrupted scraping sessions
  • Middleware System: Request/response processing pipeline
  • Pagination Support: Automatic page traversal
  • Anti-Bot Handling: Stealth mechanisms and user-agent rotation
  • Comprehensive Logging: Debug extraction issues easily

Quick Start

Installation

  1. Install Python dependencies:

    pip install -r requirements.txt
  2. Install Playwright browsers:

    playwright install

Basic Usage

# Simple scraping
python main.py "https://quotes.toscrape.com"

# With configuration
python main.py "https://quotes.toscrape.com" --config quotes_toscrape_com

# Async mode with concurrency
python main.py "https://example.com" --async --concurrency 5

# Export to specific formats
python main.py "https://example.com" --export-format csv,json,excel

Usage Examples

E-commerce Scraping

# Extract products from a test site
python main.py "http://books.toscrape.com" --config books_toscrape --verbose

News & Blogs

# Extract quotes and authors
python main.py "https://quotes.toscrape.com" --config quotes_toscrape_com --export-format json

Data Tables

# Extract global COVID statistics
python main.py "https://www.worldometers.info/coronavirus/" --config covid_worldometers --export-format csv

Dynamic Content (JavaScript Sites)

# Scrape JavaScript-rendered content
python main.py "https://quotes.toscrape.com/js/" --headful --verbose

Pagination

# Scrape multiple pages with async
python main.py "http://books.toscrape.com" --config books_toscrape --async --concurrency 3

Command Options

Required

Option Description
url Target URL to scrape

Scraper Options

Option Description
--config Configuration key from selectors.json
--async Enable high-speed async scraping

Extraction Options

Option Description
--container CSS Selector for container
--fields Fields mapping (e.g. title:h1,price:.price)
--max-items Maximum number of items to scrape
--max-pages Maximum number of pages to scrape

Performance Options

Option Description
--concurrency Number of parallel requests (default: 5)
--rate-limit Requests per second (default: 1.0)
--no-state Disable state management

Output Options

Option Description
--output-name Output filename (default: results)
--export-format Export formats: csv,json,excel,xml,sqlite

Debug Options

Option Description
--headful Run browser in headful mode
--screenshot Take screenshot of page
--verbose Verbose logging

Configuration

Using Selectors Config

Create or edit config/selectors.json:

{
    "sites": [
        {
            "domain": "example.com",
            "name": "example",
            "container": ".product",
            "selectors": {
                "title": "h2.title",
                "price": ".price",
                "description": ".desc"
            },
            "pagination": {
                "type": "next_button",
                "selector": "a.next"
            }
        }
    ]
}

CLI Override

python main.py "https://example.com" --container ".product" --fields "title:h2,price:.price"

πŸ› οΈ How to Find Selectors Easily

Don't guess CSS selectors! Use our built-in helper tool to generate them automatically.

1. Run the Selector Finder

python tools/find_selectors.py "https://example.com"

2. Follow Instructions

The tool will open two windows (Browser and Inspector). Ignore the big block of Python code (imports, def run, etc.). Look inside the run() function for lines like:

# Look for these lines!
page.get_by_text("Quotes to Scrape").click()
page.locator(".author").first.click()

3. Copy Selectors

Copy the selector part from those lines. Playwright often gives you "smart" chains.

  • Simple: page.locator(".author") -> Copy .author
  • Smart: page.get_by_role("button", name="Login") -> Copy button >> text=Login or use the specific smart selector syntax if your config supports it.

Tip: Usually you just want the string inside the quotes if it's a CSS selector.

πŸ’‘ Tip for Lists

The code generator gives you a selector for the single specific item you clicked (e.g., "The world as we created..."). To scrape all items (e.g., all quotes):

  1. Don't use the text-based selector.
  2. Use the Pick Locator tool (cursor icon) in the inspector.
  3. Look for a shared class name (e.g., .quote or .product_pod).
  4. Use that class as your container selector.

Copy the generated code into your config/selectors.json.

⚑ Alternative: Automatic Extraction

Don't want to deal with selectors? You can skip the configuration entirely!

Simply run:

python main.py "https://example.com" --verbose

The scraper's Smart Extraction engine will automatically detect and extract content. Using --verbose is recommended to see exactly what data is being found.


Testing

Quick Tests

Test 1: Container-based extraction

python main.py "https://quotes.toscrape.com" --config quotes_toscrape_com --verbose

Expected: Multiple quotes with text, author, and tags

Test 2: Static content

python main.py "http://books.toscrape.com" --config books_toscrape --verbose

Expected: Multiple books with title, price, availability

Test 3: Fallback mechanism

python main.py "https://quotes.toscrape.com" --verbose

Expected: Falls back to smart extraction and extracts page content

Test 4: WebScraper.io test site

python main.py "https://webscraper.io/test-sites/e-commerce/allinone" --config webscraper_io --verbose

Expected: Product listings with name, price, description

Debugging Tips

  • Add --verbose to see detailed extraction logs
  • Check logs for:
    • "Found X containers matching..."
    • "Successfully extracted X items using configured selectors"
    • "Falling back to smart extraction" (if selectors fail)
  • Use --headful to see the browser in action
  • Use --screenshot to capture page state

Architecture

Project Structure

web-scraper/
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ default_config.json
β”‚   β”œβ”€β”€ selectors.json
β”‚   └── selectors_example.json
β”œβ”€β”€ scraper/
β”‚   β”œβ”€β”€ universal_scraper.py  # Main scraper with smart extraction
β”‚   β”œβ”€β”€ async_scraper.py      # High-performance async scraper
β”‚   β”œβ”€β”€ config_loader.py      # Configuration management
β”‚   β”œβ”€β”€ exporters.py          # Multi-format export
β”‚   β”œβ”€β”€ middleware.py         # Request/response processing
β”‚   β”œβ”€β”€ state_manager.py      # Resumable scraping
β”‚   β”œβ”€β”€ pipelines.py          # Data processing
β”‚   └── utils.py              # Utility functions
β”œβ”€β”€ tests/                    # Comprehensive test suite
β”œβ”€β”€ main.py                   # Main entry point
β”œβ”€β”€ requirements.txt          # Python dependencies
└── README.md                 # This file

Key Components

  • Middlewares: User-Agent rotation, proxy management, request retries
  • Pipelines: Data cleaning, deduplication, schema validation
  • State Manager: Tracks processed URLs for resumable scraping
  • Exporters: Multi-format output (CSV, JSON, Excel, XML, SQLite)

Troubleshooting

Browser doesn't open

Ensure Playwright browsers are installed:

playwright install

No items extracted

  1. Check selectors in config file
  2. Use --verbose to see extraction logs
  3. Try without config to test fallback extraction
  4. Use --headful to visually inspect the page

Import errors

Reinstall dependencies:

pip install -r requirements.txt --force-reinstall

Slow performance

  1. Use --async mode for faster scraping
  2. Increase --concurrency (default: 5)
  3. Adjust --rate-limit if needed

Version History

v1.0.0 (2026-01-12) - Production Release

Status: βœ… Production Ready

Key Features:

  • βœ… Universal extraction engine with smart fallback
  • βœ… Playwright-based dynamic content handling
  • βœ… Async scraping with concurrency control
  • βœ… Multiple export formats (CSV, JSON, Excel, XML, SQLite)
  • βœ… State management for resumable scraping
  • βœ… Middleware system for request/response processing
  • βœ… Configurable selectors via JSON
  • βœ… Pagination support
  • βœ… Anti-bot detection handling
  • βœ… Comprehensive logging

Improvements:

  • βœ… Fixed extraction quality - proper fallback when selectors fail
  • βœ… Enhanced logging for debugging extraction issues
  • βœ… Comprehensive documentation
  • βœ… Production-ready codebase

Requirements:

  • Python 3.8+
  • See requirements.txt for dependencies

License & Support

Current Version: 1.0.0
Release Date: 2026-01-12
Status: Production Ready

For issues or questions, refer to the documentation sections above.


Happy Scraping!

About

A high-performance, production-ready Python web scraping framework. Supports static and dynamic websites, intelligent content extraction with fallback, async scraping, anti-bot handling, multi-format export (CSV, JSON, Excel, XML, SQLite), and resumable sessions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages