Skip to content

Improve performance (for big datasets) #673

@hms1

Description

@hms1

First of all thanks for making such a useful package!

However I have been trying to skim a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.

I had a look to see where the bottleneck was and the skim_by_type methods and the build_results function called are both very slow when there are lots of columns. This looks to me like an issue with dplyr::across when running summarize with lots of column / function pairs. Notwithstanding that I refactored skim_by_type to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute. I may be missing something here so apologies in advance if so.

I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored skim_by_type function, which should be able to replace all 3 of the current methods:

library(tidyverse)
library(skimr)
library(stringi)
library(microbenchmark)

#### Make some wide test data
number_rows <- 100000
coltypes <- c(rep("numeric",500), rep("date",500), rep("char", 500))

large_test_df <- bind_cols(
	map(seq_len(length(coltypes)), function(col){
		if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
		if(coltypes[col] == "date")    content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
		if(coltypes[col] == "char")    content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
		tibble(!!sym(str_c("col",col)) := content)
	}
	))

#### save the original function and define refactored function
skim_by_type_original <- skimr:::skim_by_type

skim_by_type_refactor <- function(mangled_skimmers, variable_names, data){
	group_columns <- groups(data)
	data <- as_tibble(data)
	map_df(variable_names, function(skim_var){
		pivot_longer(select(data, !! skim_var, !!! group_columns), cols = all_of(skim_var), names_to = "skim_variable", values_to = "values") %>%
			group_by(skim_variable, !!! group_columns) %>%
			summarize(across(values, mangled_skimmers$funs), .groups = "drop")
	})	%>%
		rename_with(~str_remove(.x,"values_~!@#\\$%\\^\\&\\*\\(\\)-\\+"), everything()) %>%
		arrange(match(skim_variable, variable_names))
}

#### run test
microbenchmark(
	{
		assignInNamespace("skim_by_type", skim_by_type_original, ns="skimr")
		full_version <- skim(large_test_df)
	},
	{
		assignInNamespace("skim_by_type", skim_by_type_refactor, ns="skimr")
		full_version <- skim(large_test_df)
	},
	times = 5
)
Unit: seconds
                                                                                                                                   expr
 {     assignInNamespace("skim_by_type", skim_by_type_original, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
 {     assignInNamespace("skim_by_type", skim_by_type_refactor, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
      min        lq      mean    median        uq       max neval
 251.1866 256.76540 256.68652 256.94793 258.26105 260.27161     5
  10.1157  10.18126  10.30293  10.28105  10.35256  10.58411     5

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions