A flexible and configurable web scraper built with Python that extracts data from static HTML websites by modifying the configuration file.
This is my first complete web scraping project. I built it while learning about HTTP requests, HTML parsing, and Python best practices. The goal was to create a reusable scraper that doesn't require code changes when targeting different websites - just configuration updates.
- How to make HTTP requests and handle errors/retries
- HTML parsing with BeautifulSoup and CSS selectors
- Object-oriented programming patterns in Python
- File I/O and data serialization (CSV, JSON, Excel)
- Project structure and modular code organization
- Configuration-driven development
- Configuration-based: Works with static HTML websites - just change the config
- Multiple output formats: CSV, JSON, and Excel support
- Error handling: Automatic retries with configurable delays
- Clean output: Organized data directory
- Simple setup: No code changes needed for different static sites
This scraper works with static HTML websites. It does NOT work with:
- Sites with JavaScript-rendered content (requires Selenium)
- Sites with anti-bot protection (Cloudflare, Datadome, etc.)
- Sites requiring authentication/login
- Sites with CAPTCHAs
- Dynamic single-page applications (SPAs)
Best for: Educational scraping, practice sites, simple HTML-based directories, and public catalogs.
pip3 install requests beautifulsoup4Optional (for Excel export):
pip3 install pandas openpyxl- Clone the repository:
git clone https://github.com/natfalcon7/generic-web-scraper.git
cd generic-web-scraper- Run with default configuration:
python3 main.py- Check the output in the
data/directory
All scraping behavior is controlled through config/config_settings.py:
TARGET = {
"url": "http://quotes.toscrape.com/",
"name": "Quotes to Scrape"
}
EXTRACTION_RULES = {
"item_container": "div.quote", # Main container for each item
"fields": [
{
"name": "quote",
"selector": "span.text",
"extract": "text"
},
{
"name": "author",
"selector": "small.author",
"extract": "text"
}
]
}Note: This scraper works best with static HTML sites. For JavaScript-heavy sites, you'll need Selenium.
- Open
config/config_settings.py - Update
TARGET["url"]with your target URL - Inspect the website's HTML (F12 in browser)
- Check if the data is visible in "View Source" (if not, it's JavaScript-rendered)
- Update
EXTRACTION_RULES:item_container: CSS selector for the main containerfields: List of data points to extract
- Run
python3 main.py
generic-web-scraper/
├── config/
│ ├── __init__.py
│ └── config_settings.py # Configuration file
├── core/
│ ├── __init__.py
│ ├── requester.py # HTTP request handling
│ ├── parser.py # HTML parsing logic
│ └── saver.py # Data export logic
├── data/ # Output directory (created automatically)
├── main.py # Main entry point
└── requirements.txt # Python dependencies
- Python 3.12.3
- Requests: HTTP library for making web requests
- BeautifulSoup4: HTML parsing and data extraction
- Pandas (optional): Excel export functionality
- Separation of Concerns: Each module has a single responsibility
- Configuration-Driven: Behavior controlled through config files
- Class-Based Architecture: Encapsulation of related functionality
Learning Resources: This project was built while following various web scraping tutorials and reading the official documentation for BeautifulSoup and Requests. I also studied several open-source scraping projects to understand common patterns.
This scraper has been successfully tested on:
- Quote websites (quotes.toscrape.com)
- Book catalogs (books.toscrape.com)
- Simple directory listings with static HTML
- Educational scraping practice sites
Note: Some commercial sites (like Justia.com) have anti-bot protection and will return errors. This is a beginner-friendly scraper designed for learning and simple use cases.
- Always check a website's
robots.txtand Terms of Service before scraping - Respect rate limits and use appropriate delays between requests
- This tool is for educational purposes and personal use
- Be ethical and responsible with web scraping
- Add support for pagination
- Implement JavaScript rendering (Selenium integration)
- Add database storage options
- Create a simple GUI for configuration
- Add unit tests
- Implement proxy rotation
This is a learning project. Suggestions and improvements are welcome! Feel free to open an issue or reach out.
NATANAEL EMANUEL FLORES FALCON
- GitHub: @natfalcon7
- LinkedIn: Natanael Falcon