Parliament Scraper

Web scraper for scraping Hansard Reports from the parliament website. Still a work progress but I'll look at implementing a robust scraper for scraping parliament notes and neatly formatting the output in JSON. Already there is a helpful structure to the report, providing us with question and answer pairs, debates and general talk. This can then be used for NLP projects to obtain pre/semi-labelled data. As it's parliament reports we can use party alignment to visualise and validate different aspects of natual language in a very controlled talking space with clean data.

Usage

Install conda environment with conda env create -f environment.yml, and then activate it with conda activate parliament_scraper.

OUTDATED Call the scraper with: python scraper.py <DATE1> <optional: DATE2>

Date format is YYYYMMDD, a possible 2 dates are passed, check the URL of the report you are looking to scrape, they are formatted as: https://www.parliament.nz/en/pb/hansard-debates/rhr/combined/HansD_<DATE1>_<DATE2>. If DATE1 and DATE2 are the same, only one date needs to be passed to the scraper.

Output

The scraper will write to a file called debate_DATE1_DATE2.json in the current directory.

Check example_out.json for an example of the output format.

State

~~Needs work, can ony scrape 20210708 currently~~
~~Needs to be more robust to tag structure~~
Update: Scraper code working more robust to tag structure, see notebooks/parser.ipynb, scraper.py needs to be updated with changes
BillDebate -> Debate -> Subdebate structure doesn't nest when parent is BillDebate
Code needs better formatting and structure
- parliament module and submodules need better naming and documentation
All report links can be crawled, see notebooks/crawler.ipynb
Vote parsing needs to be properly re-implemented
Needs to be more robust to tag pairs being broken up (i.e. SubsQuestion and SubsAnswer)
- need to setdefault on expected keys + log the error

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notebooks		notebooks
parliament		parliament
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
example_out.json		example_out.json
model_env.yml		model_env.yml
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parliament Scraper

Usage

Output

State

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parliament Scraper

Usage

Output

State

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages