Skip to content

Gracefully handle NAs in predictors #8

@holub008

Description

@holub008

Currently, if the response contains an NA, a clear error message is thrown:

data <- data.frame(x = rnorm(50), y = c(rnorm(49), NA))
m <- xrf(y ~x, data, family = 'gaussian', xgb_control = list(nrounds=1, max_depth=2))

Error in xrf_preconditions(family, xgb_control, glm_control, data, response_var,  : 
  Response variable contains missing values which is not allowed

However, if any predictor contains an NA, the *model.matrix implementation will silently drop the row, which results in confusing errors:

data <- data.frame(y = rnorm(50), x = c(rnorm(49), NA))
m <- xrf(y ~x, data, family = 'gaussian', xgb_control = list(nrounds=1, max_depth=2))

Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) : 
  The length of labels must equal to the number of rows in the input data

Several fixes may make sense:

  • Fail fast & clearly with a preconditions check
  • Offer several (configurable) remediation methods, like dropping offending rows or mean/mode imputation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions