Add statistics functions #836

jzemmels · 2025-12-29T23:33:37Z

Still to-do:

And there's Laura's list of to-dos here: https://code.usgs.gov/water/dataRetrieval/-/issues/450

jzemmels · 2025-12-30T23:13:28Z

To the discussion of whether base R vs. data.table is faster: I've implemented a data.table version of the get_statistics_data function and temporarily put in the read_waterdata_stats_datatable.R script until we decide if we want to add data.table as a dependency in this PR.

I've profiled the rbind-y base R version of the function against the data.table version and the comparison is noteworthy. The data.table version is much faster and uses way-way less memory.

Here's the benchmark code I ran:

bench::mark(
  "data.table" = {
    dat1 <- 
      get_statistics_data_data_table(args = list(approval_status = "approved", 
                                            # state_code = "US:42",
                                            county_code = "US:42:103",
                                            parameter_code = "00095", 
                                            mime_type = "application/json",
                                            computation_type = c("arithmetic_mean", "percentile")),
                                service = "Intervals")
    
    list(
      non_missing_ave = mean(dat1$value, na.rm = TRUE),
      missing_obs = sum(is.na(mean(dat1$value)))
    )
  },
  "base R" = {
    dat2 <-
  get_statistics_data(args = list(approval_status = "approved", 
                                  # state_code = "US:42",
                                  county_code = "US:42:103",
                                  parameter_code = "00095", 
                                  mime_type = "application/json",
                                  computation_type = c("arithmetic_mean", "percentile")),
                      service = "Intervals")
    
    list(
      non_missing_ave = mean(dat2$value, na.rm = TRUE),
      missing_obs = sum(is.na(dat2$value))
    )
  }, min_iterations = 5
)

ldecicco-USGS · 2025-12-31T14:03:40Z

To the discussion of whether base R vs. data.table is faster: I've implemented a data.table version of the get_statistics_data function and temporarily put in the read_waterdata_stats_datatable.R script until we decide if we want to add data.table as a dependency in this PR.

I've profiled the rbind-y base R version of the function against the data.table version and the comparison is noteworthy. The data.table version is much faster and uses way-way less memory.

Here's the benchmark code I ran:

bench::mark(
  "data.table" = {
    dat1 <- 
      get_statistics_data_data_table(args = list(approval_status = "approved", 
                                            # state_code = "US:42",
                                            county_code = "US:42:103",
                                            parameter_code = "00095", 
                                            mime_type = "application/json",
                                            computation_type = c("arithmetic_mean", "percentile")),
                                service = "Intervals")
    
    list(
      non_missing_ave = mean(dat1$value, na.rm = TRUE),
      missing_obs = sum(is.na(mean(dat1$value)))
    )
  },
  "base R" = {
    dat2 <-
  get_statistics_data(args = list(approval_status = "approved", 
                                  # state_code = "US:42",
                                  county_code = "US:42:103",
                                  parameter_code = "00095", 
                                  mime_type = "application/json",
                                  computation_type = c("arithmetic_mean", "percentile")),
                      service = "Intervals")
    
    list(
      non_missing_ave = mean(dat2$value, na.rm = TRUE),
      missing_obs = sum(is.na(dat2$value))
    )
  }, min_iterations = 5
)

No surprises there, if it wasn't for RDB I would have switched many years ago. Let's go for it. We can start swapping out other code later (off the top of my head I can think of a several places where I'd been tempted in the past). Let's leave those types of edits for a dedicated PR later though.

ldecicco-USGS · 2025-12-31T14:05:54Z

R/read_waterdata_stats_datatable.R

+               "site_type", "site_type_code", 
+               "country_code", "state_code", "county_code", 
+               "geometry")
+  data.table::setcolorder(combined, col_order) 


Let's make sure combined gets returned as a data frame. There are some subtle differences that users won't expect.

My intention here was to return something similar to what's returned by the other read_waterdata functions, hence the wrapping in sf::st_as_sf. ~~Do we want the result to be a "pure" data.frame instead?~~ Oh, you probably mean converting from a data.table. Nvm, I'll change that.

Oh, I just meant not a data.table, it can still be sf

ldecicco-USGS · 2025-12-31T14:24:38Z

R/read_waterdata_stats.R

+#' 
+#' @export
+#' 
+#' @param approval_status asdf


When you get to this part, don't re-invent the wheel, just copy/paste the text from the API documentation.

I was leaving this in case we wanted to expand the get_params and related functions to pull from the /statistics Swagger docs. I've not scraped Swagger documentation before, so I'm not sure how easy that would be to generalize. I can just copy + paste the API docs for now.

The OGC service has a schema service that provides descriptions and whatnot via an API (I swooned 🤩 when I saw it). I wouldn't scrape Swagger though, more technical debt than what's saved.

ldecicco-USGS · 2025-12-31T14:32:50Z

R/read_waterdata_stats.R

+
+  base_request <- construct_statistics_request(service = service, version = 0)
+
+  # TODO?: arg type checking here


In many arguments in dataRetrieval, the user can enter a number, character, vector, a date - and the code will convert it all to a character. So, you can check if it makes sense for a particular argument, but (a) don't let that code get ridiculously big trying to do everything and (b) see if the natural errors make sense as-is. If not, maybe try to clean up that error message. Usually (I don't know about here specifically), if a user puts in something really out of left field, the query will fail with a message that does imply their request was bad. That being said, the "waterdata" functions all have a similar input, at some point we can explore if a argument checker function could be written in a way that's more helpful.

ldecicco-USGS · 2025-12-31T14:47:20Z

R/read_waterdata_stats_datatable.R

+  combined <- data.table::rbindlist(combined_list)
+  combined <- combined[return_list, on = "rid"]
+
+  col_order <- c("parent_time_series_id", "monitoring_location_id", "monitoring_location_name", "parameter_code",


Ideally we can figure out a way to not write these out by hand. On the off-chance they add some columns or change the function will start to fail. (and yes, similar hard-coding has caused some headaches in the past with dataRetrieval). We added some logic (I think just in the current PR that removes max_results) to move all "id" columns to the far right (because users don't want to see those big ol'hashes first) UNLESS they are special like monitoring_location_id.

…ntation

…responses, fix data.table joining.

First pass at statistics functions

2865d8b

jzemmels requested review from cnell-usgs, ehinman and ldecicco-USGS December 29, 2025 23:33

jzemmels self-assigned this Dec 29, 2025

Add data.table version of get_statistics_data function

7e331a3

Add some documentation to internal clean_value_cols() function

37b9987

ldecicco-USGS reviewed Dec 31, 2025

View reviewed changes

jzemmels added 3 commits December 31, 2025 11:52

Move data table implementation to read_statistics_data, update docume…

a59b1fa

…ntation

Generalize to accommodate Normals and Intervals output, handle empty …

8875de4

…responses, fix data.table joining.

Unit tests

e143efb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add statistics functions #836

Add statistics functions #836

Uh oh!

jzemmels commented Dec 29, 2025 •

edited

Loading

Uh oh!

jzemmels commented Dec 30, 2025

Uh oh!

ldecicco-USGS commented Dec 31, 2025

Uh oh!

ldecicco-USGS Dec 31, 2025

Uh oh!

jzemmels Dec 31, 2025 •

edited

Loading

Uh oh!

ldecicco-USGS Dec 31, 2025

Uh oh!

ldecicco-USGS Dec 31, 2025

Uh oh!

jzemmels Dec 31, 2025

Uh oh!

ldecicco-USGS Dec 31, 2025

Uh oh!

ldecicco-USGS Dec 31, 2025

Uh oh!

ldecicco-USGS Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		base_request <- construct_statistics_request(service = service, version = 0)

		# TODO?: arg type checking here

Add statistics functions #836

Are you sure you want to change the base?

Add statistics functions #836

Uh oh!

Conversation

jzemmels commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jzemmels commented Dec 30, 2025

Uh oh!

ldecicco-USGS commented Dec 31, 2025

Uh oh!

ldecicco-USGS Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

jzemmels Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldecicco-USGS Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ldecicco-USGS Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

jzemmels Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ldecicco-USGS Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ldecicco-USGS Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ldecicco-USGS Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jzemmels commented Dec 29, 2025 •

edited

Loading

jzemmels Dec 31, 2025 •

edited

Loading