Intelligent β’ High-Performance β’ Production-Ready
A powerful web scraping tool with smart extraction, async performance, and multi-format export capabilities.
- Features
- Quick Start
- Usage Examples
- Command Options
- Configuration
- Testing
- Architecture
- Troubleshooting
- Version History
- Smart Extraction: Automatically detects content structure with intelligent fallback
- Async Performance: High-speed concurrent scraping with configurable concurrency
- Multi-Format Export: CSV, JSON, Excel, XML, SQLite support
- State Management: Resume interrupted scraping sessions
- Middleware System: Request/response processing pipeline
- Pagination Support: Automatic page traversal
- Anti-Bot Handling: Stealth mechanisms and user-agent rotation
- Comprehensive Logging: Debug extraction issues easily
-
Install Python dependencies:
pip install -r requirements.txt
-
Install Playwright browsers:
playwright install
# Simple scraping
python main.py "https://quotes.toscrape.com"
# With configuration
python main.py "https://quotes.toscrape.com" --config quotes_toscrape_com
# Async mode with concurrency
python main.py "https://example.com" --async --concurrency 5
# Export to specific formats
python main.py "https://example.com" --export-format csv,json,excel# Extract products from a test site
python main.py "http://books.toscrape.com" --config books_toscrape --verbose# Extract quotes and authors
python main.py "https://quotes.toscrape.com" --config quotes_toscrape_com --export-format json# Extract global COVID statistics
python main.py "https://www.worldometers.info/coronavirus/" --config covid_worldometers --export-format csv# Scrape JavaScript-rendered content
python main.py "https://quotes.toscrape.com/js/" --headful --verbose# Scrape multiple pages with async
python main.py "http://books.toscrape.com" --config books_toscrape --async --concurrency 3| Option | Description |
|---|---|
url |
Target URL to scrape |
| Option | Description |
|---|---|
--config |
Configuration key from selectors.json |
--async |
Enable high-speed async scraping |
| Option | Description |
|---|---|
--container |
CSS Selector for container |
--fields |
Fields mapping (e.g. title:h1,price:.price) |
--max-items |
Maximum number of items to scrape |
--max-pages |
Maximum number of pages to scrape |
| Option | Description |
|---|---|
--concurrency |
Number of parallel requests (default: 5) |
--rate-limit |
Requests per second (default: 1.0) |
--no-state |
Disable state management |
| Option | Description |
|---|---|
--output-name |
Output filename (default: results) |
--export-format |
Export formats: csv,json,excel,xml,sqlite |
| Option | Description |
|---|---|
--headful |
Run browser in headful mode |
--screenshot |
Take screenshot of page |
--verbose |
Verbose logging |
Create or edit config/selectors.json:
{
"sites": [
{
"domain": "example.com",
"name": "example",
"container": ".product",
"selectors": {
"title": "h2.title",
"price": ".price",
"description": ".desc"
},
"pagination": {
"type": "next_button",
"selector": "a.next"
}
}
]
}python main.py "https://example.com" --container ".product" --fields "title:h2,price:.price"Don't guess CSS selectors! Use our built-in helper tool to generate them automatically.
python tools/find_selectors.py "https://example.com"The tool will open two windows (Browser and Inspector).
Ignore the big block of Python code (imports, def run, etc.).
Look inside the run() function for lines like:
# Look for these lines!
page.get_by_text("Quotes to Scrape").click()
page.locator(".author").first.click()Copy the selector part from those lines. Playwright often gives you "smart" chains.
- Simple:
page.locator(".author")-> Copy.author - Smart:
page.get_by_role("button", name="Login")-> Copybutton >> text=Loginor use the specific smart selector syntax if your config supports it.
Tip: Usually you just want the string inside the quotes if it's a CSS selector.
The code generator gives you a selector for the single specific item you clicked (e.g., "The world as we created..."). To scrape all items (e.g., all quotes):
- Don't use the text-based selector.
- Use the Pick Locator tool (cursor icon) in the inspector.
- Look for a shared class name (e.g.,
.quoteor.product_pod). - Use that class as your container selector.
Copy the generated code into your config/selectors.json.
Don't want to deal with selectors? You can skip the configuration entirely!
Simply run:
python main.py "https://example.com" --verboseThe scraper's Smart Extraction engine will automatically detect and extract content. Using --verbose is recommended to see exactly what data is being found.
Test 1: Container-based extraction
python main.py "https://quotes.toscrape.com" --config quotes_toscrape_com --verboseExpected: Multiple quotes with text, author, and tags
Test 2: Static content
python main.py "http://books.toscrape.com" --config books_toscrape --verboseExpected: Multiple books with title, price, availability
Test 3: Fallback mechanism
python main.py "https://quotes.toscrape.com" --verboseExpected: Falls back to smart extraction and extracts page content
Test 4: WebScraper.io test site
python main.py "https://webscraper.io/test-sites/e-commerce/allinone" --config webscraper_io --verboseExpected: Product listings with name, price, description
- Add
--verboseto see detailed extraction logs - Check logs for:
- "Found X containers matching..."
- "Successfully extracted X items using configured selectors"
- "Falling back to smart extraction" (if selectors fail)
- Use
--headfulto see the browser in action - Use
--screenshotto capture page state
web-scraper/
βββ config/
β βββ default_config.json
β βββ selectors.json
β βββ selectors_example.json
βββ scraper/
β βββ universal_scraper.py # Main scraper with smart extraction
β βββ async_scraper.py # High-performance async scraper
β βββ config_loader.py # Configuration management
β βββ exporters.py # Multi-format export
β βββ middleware.py # Request/response processing
β βββ state_manager.py # Resumable scraping
β βββ pipelines.py # Data processing
β βββ utils.py # Utility functions
βββ tests/ # Comprehensive test suite
βββ main.py # Main entry point
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Middlewares: User-Agent rotation, proxy management, request retries
- Pipelines: Data cleaning, deduplication, schema validation
- State Manager: Tracks processed URLs for resumable scraping
- Exporters: Multi-format output (CSV, JSON, Excel, XML, SQLite)
Ensure Playwright browsers are installed:
playwright install- Check selectors in config file
- Use
--verboseto see extraction logs - Try without config to test fallback extraction
- Use
--headfulto visually inspect the page
Reinstall dependencies:
pip install -r requirements.txt --force-reinstall- Use
--asyncmode for faster scraping - Increase
--concurrency(default: 5) - Adjust
--rate-limitif needed
Status: β Production Ready
Key Features:
- β Universal extraction engine with smart fallback
- β Playwright-based dynamic content handling
- β Async scraping with concurrency control
- β Multiple export formats (CSV, JSON, Excel, XML, SQLite)
- β State management for resumable scraping
- β Middleware system for request/response processing
- β Configurable selectors via JSON
- β Pagination support
- β Anti-bot detection handling
- β Comprehensive logging
Improvements:
- β Fixed extraction quality - proper fallback when selectors fail
- β Enhanced logging for debugging extraction issues
- β Comprehensive documentation
- β Production-ready codebase
Requirements:
- Python 3.8+
- See
requirements.txtfor dependencies
Current Version: 1.0.0
Release Date: 2026-01-12
Status: Production Ready
For issues or questions, refer to the documentation sections above.
Happy Scraping!