Skip to content

Democracy-Lab/uscongress-data

Repository files navigation

Congressional Data Scrapers

The CREC (Daily Edition) and CRECB (Bound Edition) scrapers are designed to grab all raw Congressional Records content using the GovInfo API. The Daily Edition and Bound Edition of the Congressional Records are available in different formats, and organized in different ways.

About the Records

Daily Edition of the Congressional Records (CREC)

Congress.gov publishes a daily edition of the Congressional Record for every day that Congress is in session. The daily records as made available through the API go back to 1994.

Within the CREC collection, each day is generally stored in one "package," and each package has many small "granules" that make up the content of that day. Each of these granules is stored as an individual HTML file with an associated XML file containing metadata such as speaker information.

Our CREC scraper iterates through each day and grabs all HTML and corresponding XML content for the House and Senate. It does not grab the Daily Digest (a high level summary of the day's proceedings) or the Extensions of Remarks (statements or documents appended after the fact), though these can be accessed through the CREC docClasses 'extensions' and 'dailydigest' if desired.

Bound Edition of the Congressional Records (CRECB)

After each session of Congress, all daily editions are are collected and reindexed into a permanent bound edition that often takes years to compile. The bound records go back to 1873.

Within the CRECB collection, each year is broken up into parts and stored in "packages," while "granules" usually represent an entire day's proceedings, or the daily proceedings for one body (House or Senate). Each granule is stored as a PDF (a scan of the physical records), and has a corresponding XML file containing metadata.

Our CREC scraper iterates through the packages for each year to grab all PDFs and corresponding XML content. Note that it grabs data from the House, Senate, and all other indexing sections, as the sections are not broken up into granules like in the CREC collection. There is also duplicate data because often there will be a file for the House content and Senate content for a given day, as well as a combined file with House and Senate content for that day.

GovInfo CREC(B) Structure

For more information on the structure and content of the CREC and CRECB collections, please explore the GovInfo API documentation.

Scraping the Congressional Records

CREC Scraper Usage

Required Parameters:

  • output_folder name of folder to store output
  • --api-keys comma-separated list of GovInfo API keys

Optional Parameters:

  • --years number of years to go back (default 2, starting from today or year specified in year-start)
  • --year-start year to start at (scrapes backwards from there)
  • --per-key-hourly-limit optional hourly request limit per key (otherwise script automatically throttles key usage)
  • --parallel include to parallelize code, default not parallelized
  • --workers worker threads when --parallel is on, default 8

Example: Scrape all Daily Congressional Records going back from today

python CREC_scraper.py "output_dir" --years 35 --api_keys "DEMO_KEY1,DEMO_KEY2"
#set years >30 to scrape all records (daily edition starts 1995)

CREC Scraper Output

Scraped CREC content is stored in a YYYY > MM > DD folder structure, dividing up content by day. Each Day's folder contains:

  • a raw_html_YYYY-MM-DD folder with all House and Senate content stored as HTML, with one file per granule. To re-combine granlues in the correct order, each filename begins with an index (ex: "1-," "2-," ... "97-"). The rest of the filename after the index represents the granule ID.
  • a raw_xml_YYYY-MM-DD folder with all corresponding XML metadata for the House and Senate that day. It is similarly indexed, and the granule IDs in the filename should match up to those in the raw_html directory
  • a .csv file with the parsed speeches and metadata, generated by running speaker_scraper over all HTML content for the day and matching it to the XML metadata. Each of these files will have the following columns:
    • date
    • granule_id
    • title (title of debate)
    • speaker
    • text (content of speech)
    • bioGuideId (see bioguide.congress.gov)
    • full_name (speaker first name, middle initial, last name)
    • party (D, R, and possibly I)
    • state (speaker state)
    • gender (for speakers with title "Mr.," "Ms.," or "Mrs.")
    • chamber (chamber speech cae from - house or senate)
    • speaker_chamber (chamber speaker is a member of - house or senate)

CRECB Scraper Usage

Required Parameters:

  • output name of folder to store output
  • --api-keys comma-separated list of GovInfo API keys

Optional Parameters:

  • --start-year year to start at, default 1873 (scrapes forward from there)
  • --parallel include to parallelize code, default not parallelized
  • --workers worker threads when --parallel is on, default 4

Example: Scrape all Bound Congressional Records starting from 1873

python CRECB_scraper.py "output_dir" --api_keys "DEMO_KEY1,DEMO_KEY2"
#set years >30 to scrape all records (daily edition starts 1995)

CRECB Scraper Output

Scraped CREB content is stored in YYYY folders (we likely want to grab dates from the granule titles to split further, or split by session). Within each YYYY folder, there are two folders:

  • a raw_pdf_YYYY folder containing all PDF data for packages dated to that year. Note that in rare cases, a few days may be stored in a package that is titled with the wrong year (for example, a few days in December might be stored in the succeeding year). The PDF files are indexed in the same way as described above, in order to re-compile the bound edition. After the index, each filename represents the granule title (contains the date), the package id, and finally the granule id.
  • a raw_xml_YYYY folder containing corresponding metadata for each file saved in the PDF directory.

There are currently no parsed csv files, as the data must be converted to text before the speaker scraper can be run.

Speaker Parser

To identify speakers, we wrote a regular expression that matches with and extracts speaker names on indented lines in the transcripts, lines that begin with two or more spaces, and contain a speaker “identifier.” These identifiers include a wide range of formats: honorifics such as "Mr.", "Mrs.", "Ms.", "Dr.", "Chairman", and "Chairwoman" followed by a name (e.g., "Mr. SMITH", "Chairwoman JACKSON LEE"); multi-word names and names with initials or prefixes (e.g., "Mr. VAN HOLLEN", "Mr. A. B. SMITH"); and names that include diacritics or lowercase particles (e.g., "Mr. José Martínez", "Mr. de la Cruz"). It also matches lines that include the label "(continuing)" to indicate the continuation of a previous speaker, as in "Mr. SMITH (continuing).” Additionally, the parser captures institutional speaker roles introduced with "The", such as "The CLERK", "The SPEAKER", "The SPEAKER pro tempore", "The PRESIDENT", "The PRESIDENT pro tempore", "The ACTING PRESIDENT pro tempore", and "The VICE PRESIDENT." These role-based identifiers may appear with or without modifiers like "pro tempore", and the pattern allows for both colons and periods at the end.

Usage

Required Parameters:

  • output_folder name of folder to store output

Optional Parameters:

  • --years number of years to go back (default 2, starting from today or year specified in year-start)
  • --year-start year to start at (scrapes backwards from there)

For More Information ...

... see our wiki:

About

Export an analysis-ready version of the Daily Editions of the U.S. Congressional Records.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •