A Python CLI application for scraping Czech election results.
This project automates the extraction of 2017 Czech Parliamentary Election results from the official volby.cz website. It scrapes data from district pages, processes individual municipality results, and saves them to CSV format for analysis.
The focus is on robust web scraping, input validation, error handling, and comprehensive testing to ensure reliability.
- Python 3.x
- BeautifulSoup4 (HTML parsing)
- Requests (HTTP requests)
- Pytest (unit testing)
- CSV (data export)
election-scraper/
│
├── scraper/
│ ├── __init__.py
│ ├── main.py # CLI logic and orchestration
│ ├── logic.py # Web scraping and data parsing
│ └── utils.py # URL validation utilities
│
├── tests/
│ ├── test_logic.py # Unit tests for scraping logic
│ └── test_utils.py # Unit tests for validation
│
├── main.py # CLI entry point
├── requirements.txt # Dependencies
├── .gitignore # Git ignore rules
└── README.md # This file
- Create and activate a virtual environment:
python -m venv venv
.\venv\Scripts\Activate.ps1- Install dependencies:
pip install -r requirements.txtRun the scraper with a district URL and destination CSV file:
python main.py "https://www.volby.cz/pls/ps2017nss/ps32?xjazyk=CZ&xkraj=2&xnumnuts=2101" output.csvThe project includes comprehensive unit tests to ensure code quality and prevent regressions.
Run all tests using:
python -m pytest -vOr for quiet output:
python -m pytest -q-
test_get_municipality_links: Tests the extraction of municipality links from a district page HTML. Uses mocked HTTP responses to verify that links and names are correctly parsed without making real network calls. -
test_parse_obec: Tests parsing of election data from a single municipality page. Mocks the HTTP response and checks that voter counts, ballot counts, valid votes, and party results are accurately extracted.
-
test_validate_url_accepts_volby_url: Verifies that the URL validation function accepts valid volby.cz URLs, ensuring the scraper only processes authorized sources. -
test_validate_url_rejects_other_url: Ensures the validation rejects invalid or external URLs, protecting against misuse and potential security issues.
These tests use mocking to isolate logic from external dependencies, making them fast and reliable. They cover happy paths and basic validation scenarios.
- Add retry/backoff for network requests
- Add argument to adjust request pacing
- Add JSON or SQLite export option
- Add GitHub Actions workflow for CI
- Add integration tests using a sandboxed HTML fixture