Generic Web Scraper

A flexible and configurable web scraper built with Python that extracts data from static HTML websites by modifying the configuration file.

About This Project

This is my first complete web scraping project. I built it while learning about HTTP requests, HTML parsing, and Python best practices. The goal was to create a reusable scraper that doesn't require code changes when targeting different websites - just configuration updates.

What I Learned

How to make HTTP requests and handle errors/retries
HTML parsing with BeautifulSoup and CSS selectors
Object-oriented programming patterns in Python
File I/O and data serialization (CSV, JSON, Excel)
Project structure and modular code organization
Configuration-driven development

Features

Configuration-based: Works with static HTML websites - just change the config
Multiple output formats: CSV, JSON, and Excel support
Error handling: Automatic retries with configurable delays
Clean output: Organized data directory
Simple setup: No code changes needed for different static sites

Limitations

This scraper works with static HTML websites. It does NOT work with:

Sites with JavaScript-rendered content (requires Selenium)
Sites with anti-bot protection (Cloudflare, Datadome, etc.)
Sites requiring authentication/login
Sites with CAPTCHAs
Dynamic single-page applications (SPAs)

Best for: Educational scraping, practice sites, simple HTML-based directories, and public catalogs.

Quick Start

Prerequisites

pip3 install requests beautifulsoup4

Optional (for Excel export):

pip3 install pandas openpyxl

Basic Usage

Clone the repository:

git clone https://github.com/natfalcon7/generic-web-scraper.git
cd generic-web-scraper

Run with default configuration:

python3 main.py

Check the output in the data/ directory

Configuration

All scraping behavior is controlled through config/config_settings.py:

Example: Scraping Quotes

TARGET = {
    "url": "http://quotes.toscrape.com/",
    "name": "Quotes to Scrape"
}

EXTRACTION_RULES = {
    "item_container": "div.quote",  # Main container for each item
    "fields": [
        {
            "name": "quote",
            "selector": "span.text",
            "extract": "text"
        },
        {
            "name": "author",
            "selector": "small.author",
            "extract": "text"
        }
    ]
}

How to Scrape a Different Website

Note: This scraper works best with static HTML sites. For JavaScript-heavy sites, you'll need Selenium.

Open config/config_settings.py
Update TARGET["url"] with your target URL
Inspect the website's HTML (F12 in browser)
Check if the data is visible in "View Source" (if not, it's JavaScript-rendered)
Update EXTRACTION_RULES:
- item_container: CSS selector for the main container
- fields: List of data points to extract
Run python3 main.py

Project Structure

generic-web-scraper/
├── config/
│   ├── __init__.py
│   └── config_settings.py    # Configuration file
├── core/
│   ├── __init__.py
│   ├── requester.py          # HTTP request handling
│   ├── parser.py             # HTML parsing logic
│   └── saver.py              # Data export logic
├── data/                      # Output directory (created automatically)
├── main.py                    # Main entry point
└── requirements.txt           # Python dependencies

Technical Details

Built With

Python 3.12.3
Requests: HTTP library for making web requests
BeautifulSoup4: HTML parsing and data extraction
Pandas (optional): Excel export functionality

Design Patterns

Separation of Concerns: Each module has a single responsibility
Configuration-Driven: Behavior controlled through config files
Class-Based Architecture: Encapsulation of related functionality

Development Notes

Learning Resources: This project was built while following various web scraping tutorials and reading the official documentation for BeautifulSoup and Requests. I also studied several open-source scraping projects to understand common patterns.

Use Cases

This scraper has been successfully tested on:

Quote websites (quotes.toscrape.com)
Book catalogs (books.toscrape.com)
Simple directory listings with static HTML
Educational scraping practice sites

Note: Some commercial sites (like Justia.com) have anti-bot protection and will return errors. This is a beginner-friendly scraper designed for learning and simple use cases.

Important Notes

Always check a website's robots.txt and Terms of Service before scraping
Respect rate limits and use appropriate delays between requests
This tool is for educational purposes and personal use
Be ethical and responsible with web scraping

Future Improvements

Add support for pagination
Implement JavaScript rendering (Selenium integration)
Add database storage options
Create a simple GUI for configuration
Add unit tests
Implement proxy rotation

Contributing

This is a learning project. Suggestions and improvements are welcome! Feel free to open an issue or reach out.

Author

NATANAEL EMANUEL FLORES FALCON

GitHub: @natfalcon7
LinkedIn: Natanael Falcon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generic Web Scraper

About This Project

What I Learned

Features

Limitations

Quick Start

Prerequisites

Basic Usage

Configuration

Example: Scraping Quotes

How to Scrape a Different Website

Project Structure

Technical Details

Built With

Design Patterns

Development Notes

Use Cases

Important Notes

Future Improvements

Contributing

Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
core		core
data		data
.gitignore		.gitignore
Readme.md		Readme.md
main.py		main.py
requirements.txt		requirements.txt

natfalcon7/generic-web-scraper

Folders and files

Latest commit

History

Repository files navigation

Generic Web Scraper

About This Project

What I Learned

Features

Limitations

Quick Start

Prerequisites

Basic Usage

Configuration

Example: Scraping Quotes

How to Scrape a Different Website

Project Structure

Technical Details

Built With

Design Patterns

Development Notes

Use Cases

Important Notes

Future Improvements

Contributing

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages