Skip to content

A flexible Python web scraper for static HTML websites Public.

Notifications You must be signed in to change notification settings

natfalcon7/generic-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generic Web Scraper

A flexible and configurable web scraper built with Python that extracts data from static HTML websites by modifying the configuration file.

About This Project

This is my first complete web scraping project. I built it while learning about HTTP requests, HTML parsing, and Python best practices. The goal was to create a reusable scraper that doesn't require code changes when targeting different websites - just configuration updates.

What I Learned

  • How to make HTTP requests and handle errors/retries
  • HTML parsing with BeautifulSoup and CSS selectors
  • Object-oriented programming patterns in Python
  • File I/O and data serialization (CSV, JSON, Excel)
  • Project structure and modular code organization
  • Configuration-driven development

Features

  • Configuration-based: Works with static HTML websites - just change the config
  • Multiple output formats: CSV, JSON, and Excel support
  • Error handling: Automatic retries with configurable delays
  • Clean output: Organized data directory
  • Simple setup: No code changes needed for different static sites

Limitations

This scraper works with static HTML websites. It does NOT work with:

  • Sites with JavaScript-rendered content (requires Selenium)
  • Sites with anti-bot protection (Cloudflare, Datadome, etc.)
  • Sites requiring authentication/login
  • Sites with CAPTCHAs
  • Dynamic single-page applications (SPAs)

Best for: Educational scraping, practice sites, simple HTML-based directories, and public catalogs.

Quick Start

Prerequisites

pip3 install requests beautifulsoup4

Optional (for Excel export):

pip3 install pandas openpyxl

Basic Usage

  1. Clone the repository:
git clone https://github.com/natfalcon7/generic-web-scraper.git
cd generic-web-scraper
  1. Run with default configuration:
python3 main.py
  1. Check the output in the data/ directory

Configuration

All scraping behavior is controlled through config/config_settings.py:

Example: Scraping Quotes

TARGET = {
    "url": "http://quotes.toscrape.com/",
    "name": "Quotes to Scrape"
}

EXTRACTION_RULES = {
    "item_container": "div.quote",  # Main container for each item
    "fields": [
        {
            "name": "quote",
            "selector": "span.text",
            "extract": "text"
        },
        {
            "name": "author",
            "selector": "small.author",
            "extract": "text"
        }
    ]
}

How to Scrape a Different Website

Note: This scraper works best with static HTML sites. For JavaScript-heavy sites, you'll need Selenium.

  1. Open config/config_settings.py
  2. Update TARGET["url"] with your target URL
  3. Inspect the website's HTML (F12 in browser)
  4. Check if the data is visible in "View Source" (if not, it's JavaScript-rendered)
  5. Update EXTRACTION_RULES:
    • item_container: CSS selector for the main container
    • fields: List of data points to extract
  6. Run python3 main.py

Project Structure

generic-web-scraper/
├── config/
│   ├── __init__.py
│   └── config_settings.py    # Configuration file
├── core/
│   ├── __init__.py
│   ├── requester.py          # HTTP request handling
│   ├── parser.py             # HTML parsing logic
│   └── saver.py              # Data export logic
├── data/                      # Output directory (created automatically)
├── main.py                    # Main entry point
└── requirements.txt           # Python dependencies

Technical Details

Built With

  • Python 3.12.3
  • Requests: HTTP library for making web requests
  • BeautifulSoup4: HTML parsing and data extraction
  • Pandas (optional): Excel export functionality

Design Patterns

  • Separation of Concerns: Each module has a single responsibility
  • Configuration-Driven: Behavior controlled through config files
  • Class-Based Architecture: Encapsulation of related functionality

Development Notes

Learning Resources: This project was built while following various web scraping tutorials and reading the official documentation for BeautifulSoup and Requests. I also studied several open-source scraping projects to understand common patterns.

Use Cases

This scraper has been successfully tested on:

  • Quote websites (quotes.toscrape.com)
  • Book catalogs (books.toscrape.com)
  • Simple directory listings with static HTML
  • Educational scraping practice sites

Note: Some commercial sites (like Justia.com) have anti-bot protection and will return errors. This is a beginner-friendly scraper designed for learning and simple use cases.

Important Notes

  • Always check a website's robots.txt and Terms of Service before scraping
  • Respect rate limits and use appropriate delays between requests
  • This tool is for educational purposes and personal use
  • Be ethical and responsible with web scraping

Future Improvements

  • Add support for pagination
  • Implement JavaScript rendering (Selenium integration)
  • Add database storage options
  • Create a simple GUI for configuration
  • Add unit tests
  • Implement proxy rotation

Contributing

This is a learning project. Suggestions and improvements are welcome! Feel free to open an issue or reach out.

Author

NATANAEL EMANUEL FLORES FALCON

About

A flexible Python web scraper for static HTML websites Public.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages