-
Notifications
You must be signed in to change notification settings - Fork 80
Description
First of all thanks for making such a useful package!
However I have been trying to skim a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.
I had a look to see where the bottleneck was and the skim_by_type methods and the build_results function called are both very slow when there are lots of columns. This looks to me like an issue with I refactored dplyr::across when running summarize with lots of column / function pairs. Notwithstanding thatskim_by_type to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute. I may be missing something here so apologies in advance if so.
I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored skim_by_type function, which should be able to replace all 3 of the current methods:
library(tidyverse)
library(skimr)
library(stringi)
library(microbenchmark)
#### Make some wide test data
number_rows <- 100000
coltypes <- c(rep("numeric",500), rep("date",500), rep("char", 500))
large_test_df <- bind_cols(
map(seq_len(length(coltypes)), function(col){
if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
if(coltypes[col] == "date") content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
if(coltypes[col] == "char") content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
tibble(!!sym(str_c("col",col)) := content)
}
))
#### save the original function and define refactored function
skim_by_type_original <- skimr:::skim_by_type
skim_by_type_refactor <- function(mangled_skimmers, variable_names, data){
group_columns <- groups(data)
data <- as_tibble(data)
map_df(variable_names, function(skim_var){
pivot_longer(select(data, !! skim_var, !!! group_columns), cols = all_of(skim_var), names_to = "skim_variable", values_to = "values") %>%
group_by(skim_variable, !!! group_columns) %>%
summarize(across(values, mangled_skimmers$funs), .groups = "drop")
}) %>%
rename_with(~str_remove(.x,"values_~!@#\\$%\\^\\&\\*\\(\\)-\\+"), everything()) %>%
arrange(match(skim_variable, variable_names))
}
#### run test
microbenchmark(
{
assignInNamespace("skim_by_type", skim_by_type_original, ns="skimr")
full_version <- skim(large_test_df)
},
{
assignInNamespace("skim_by_type", skim_by_type_refactor, ns="skimr")
full_version <- skim(large_test_df)
},
times = 5
)Unit: seconds
expr
{ assignInNamespace("skim_by_type", skim_by_type_original, ns = "skimr") full_version <- skimr::skim(large_test_df) }
{ assignInNamespace("skim_by_type", skim_by_type_refactor, ns = "skimr") full_version <- skimr::skim(large_test_df) }
min lq mean median uq max neval
251.1866 256.76540 256.68652 256.94793 258.26105 260.27161 5
10.1157 10.18126 10.30293 10.28105 10.35256 10.58411 5