Skip to content

ldiversity() underestimates distinct l-diversity when quasi-identifiers contain missing values #363

@MuellerRoman

Description

@MuellerRoman

Dear sdcMicro maintainers

When computing distinct l-diversity using ldiversity() in sdcMicro, groups where one or more quasi-identifiers contain missing values seem to yield incorrect l-diversity values.

Minimal reproducible example:

library(sdcMicro)

## create test data
data <- data.frame(
    sex = c(
        "female","female","female",   # EC1 (problematic)
        "male","male",                # EC2 (ok)
        "female","female"             # EC3 (ok)
    ),
    occupation = c(
        NA, NA, NA,                   # EC1: missing QI
        "teacher","teacher",          # EC2
        "nurse","nurse"               # EC3
    ),
    ethnicity = c(
        "other","other","other",       # EC1
        "other","other",               # EC2
        "majority","majority"          # EC3
    ),
    sensitive = c(
        1, 1, 0,                       # EC1 → two distinct values
        1, 0,                          # EC2 → two distinct values
        0, 1                           # EC3 → two distinct values
    )
)

# quasi-identifier variables
qi_vars   <- c("sex", "occupation", "ethnicity")

# create sdc object
sdcObj <- createSdcObj(data,
                       keyVars = qi_vars,
                       sensibleVar = "sensitive")

# compute l-diversity
ldiv_res <- ldiversity(sdcObj)

# extract l-diversity values
ldiv_res <- head(ldiv_res@risk$ldiversity, nrow(data))

# join quasi-identifier information
ldiv_res <- cbind(data, ldiv_res)
print(ldiv_res[, 1:5])

     sex occupation ethnicity sensitive sensitive_Distinct_Ldiversity
1 female       <NA>     other         1                             1
2 female       <NA>     other         1                             1
3 female       <NA>     other         0                             1
4   male    teacher     other         1                             2
5   male    teacher     other         0                             2
6 female      nurse  majority         0                             2
7 female      nurse  majority         1                             2

Individuals with sex = "female", occupation = NA, ethnicity = "other" have sensitive value 1 or 0 (i.e., two distinct values). However, according to the ldiversity() output, the l-diversity for this group is $l = 1$.

Suspected location of the problem: Measure_Risk.h, row 577 and following.

Thanks!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions