Improve performance (for big datasets)

First of all thanks for making such a useful package!

However I have been trying to `skim` a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.

I had a look to see where the bottleneck was and the `skim_by_type` methods and the `build_results` function called are both very slow when there are lots of columns. ~~This looks to me like an issue with `dplyr::across` when running `summarize` with lots of column / function pairs. Notwithstanding that~~ I refactored `skim_by_type` to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute.  I may be missing something here so apologies in advance if so.

I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored `skim_by_type` function, which should be able to replace all 3 of the current methods:

```R
library(tidyverse)
library(skimr)
library(stringi)
library(microbenchmark)

#### Make some wide test data
number_rows <- 100000
coltypes <- c(rep("numeric",500), rep("date",500), rep("char", 500))

large_test_df <- bind_cols(
	map(seq_len(length(coltypes)), function(col){
		if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
		if(coltypes[col] == "date")    content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
		if(coltypes[col] == "char")    content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
		tibble(!!sym(str_c("col",col)) := content)
	}
	))

#### save the original function and define refactored function
skim_by_type_original <- skimr:::skim_by_type

skim_by_type_refactor <- function(mangled_skimmers, variable_names, data){
	group_columns <- groups(data)
	data <- as_tibble(data)
	map_df(variable_names, function(skim_var){
		pivot_longer(select(data, !! skim_var, !!! group_columns), cols = all_of(skim_var), names_to = "skim_variable", values_to = "values") %>%
			group_by(skim_variable, !!! group_columns) %>%
			summarize(across(values, mangled_skimmers$funs), .groups = "drop")
	})	%>%
		rename_with(~str_remove(.x,"values_~!@#\\$%\\^\\&\\*\$\$-\\+"), everything()) %>%
		arrange(match(skim_variable, variable_names))
}

#### run test
microbenchmark(
	{
		assignInNamespace("skim_by_type", skim_by_type_original, ns="skimr")
		full_version <- skim(large_test_df)
	},
	{
		assignInNamespace("skim_by_type", skim_by_type_refactor, ns="skimr")
		full_version <- skim(large_test_df)
	},
	times = 5
)
```

```
Unit: seconds
                                                                                                                                   expr
 {     assignInNamespace("skim_by_type", skim_by_type_original, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
 {     assignInNamespace("skim_by_type", skim_by_type_refactor, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
      min        lq      mean    median        uq       max neval
 251.1866 256.76540 256.68652 256.94793 258.26105 260.27161     5
  10.1157  10.18126  10.30293  10.28105  10.35256  10.58411     5
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance (for big datasets) #673

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance (for big datasets) #673

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions