Adding export.socrata function by tomschenkjr · Pull Request #187 · Chicago/RSocrata

tomschenkjr · 2020-01-05T22:38:03Z

This pull request introduces a stable version of an export.socrata() function as outlined in #126. This allows users to download the contents of a data portal to a local directory. This function will download CSVs (compressed), PDFs, Word, Excel, PowerPoints, GeoJSON, Shapefiles, plain text documents (uncompressed), etc. It will not download HTML pages. As part of the process, the function also copies the data.json file to act as an index for other downloaded files.

I've proposed the version as 1.8.0.

Testing portal export

To test this function, I used the City of Norfolk, VA to export all of the data sets. Looking at their data.json file, I counted 32 data sets that were not HTML pages or did not have a downloadable file. Executing export.socrata("https://data.norfolk.gov) resulted in 32 downloaded files plus the copy of the data.json file. Thus, the expected number of files match the actual number of downloaded files.

Testing non-CSV documents

All of the testing for Norfolk resulted in compressed CSV files, however, also needed to test the ability to download non-CSV files. Kansas City, Missouri's data portal has an unusually large number of non-CSV data sets on their portal, such as PDFs, word documents, Excel documents, etc.

I tested the function on downloading files from their data portal. The function downloaded PDFs, Words, Excel, and other non-CSV files along with CSV files.

However, I did encounter frequent network timeouts after approximately 80 items were downloaded. I believe this is limited to the network and not an issue with the function itself. While this may not be a bug, it may be a limitation on the ability to export files from Socrata.

Unit Testing

I have not written a unit test. I think any unit test will take too much time and space for typical unit testing. The smallest portal download, Norfolk, elapsed over 30 minutes to complete all downloads.

In general, a recommended method for testing is to choose a reasonably small portal and do the following:

Export all files from the portal.
When finished, open the data.json file and count all of the entries with the following exceptions:
* distribution/mediaType is blank
* distribution/mediaType is text/html
* distribution/downloadURL is blank
Compare the counts of download files (except the data.json file) and the count from step (2).

Ideally, the portal being used to test contains CSV files as well as non-CSV files.

Save data.json to file system ------------------------------ A copy of the data.json file at the beginning of the download process is saved alongside the actual downloaded data. Since `export.socrata()` uses data.json as the index to download data, this will allow users to cross-reference the downloaded data with other metadata associated with it available through [Project Open Data](https://project-open-data.cio.gov). Handle non-data file --------------------- Socrata lists non-data files, such as Socrata Stories--HTML websites that contain text but no machine-readable data--in the data.json file. This causes errors when trying to download those sites because they do not have a "distribution URL". While it's arguable that these "sites" should not be included in the first place, the script now simply skips those files. Since a copy of the data.json file is downloaded (see above), users will have transparency into which URLs were not downloaded.

Socrata supports external links which direct to web pages (e.g., HTML). These would cause an error when `export.socrata()` attempted to download them. This fix will simply skip those files and proceed to the next file.

* Ignores HTML files (e.g., Socrata Pages) * Ignores on occassions there isn't any data * Will download (uncompressed) PDFs, Word, Excel, PowerPoint, plain text attachments.

Rebased branch with most recent `dev` branch and generated documentation. Merge branch 'dev' into issue126 # Conflicts: # DESCRIPTION # R/RSocrata.R

* Removed user-defined option for file output (not available yet) * Clarified documentation where `export.socrata()` files will be located. * Fixed incorrect date in `DESCRIPTION` file. * Iterating build number.

geneorama and others added 5 commits December 4, 2017 13:45

specifying namespace for file_ext, closes #140

46a488d

Ignores HTML content

8b601c6

Socrata supports external links which direct to web pages (e.g., HTML). These would cause an error when `export.socrata()` attempted to download them. This fix will simply skip those files and proceed to the next file.

Handles non-CSV file types #126

ccc4c96

* Ignores HTML files (e.g., Socrata Pages) * Ignores on occassions there isn't any data * Will download (uncompressed) PDFs, Word, Excel, PowerPoint, plain text attachments.

Rebased branch; generated documentation

c565ef9

Rebased branch with most recent `dev` branch and generated documentation. Merge branch 'dev' into issue126 # Conflicts: # DESCRIPTION # R/RSocrata.R

tomschenkjr added the enhancement label Jan 5, 2020

tomschenkjr added this to the 1.8.0 milestone Jan 5, 2020

Several clean-up items for export.socrata()

f9ec527

* Removed user-defined option for file output (not available yet) * Clarified documentation where `export.socrata()` files will be located. * Fixed incorrect date in `DESCRIPTION` file. * Iterating build number.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding export.socrata function#187

Adding export.socrata function#187
tomschenkjr wants to merge 6 commits intodevfrom
issue126

tomschenkjr commented Jan 5, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomschenkjr commented Jan 5, 2020

Testing portal export

Testing non-CSV documents

Unit Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants