diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index b9dde14c..5cae051e 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -30,6 +30,10 @@ jobs: env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" R_KEEP_PKG_SOURCE: yes # Skip vdiffr visual-regression tests in CI until reference SVGs are # committed. To regenerate: run testthat::snapshot_accept() locally, diff --git a/.github/workflows/check-release.yaml b/.github/workflows/check-release.yaml index b2217c66..c7944d19 100644 --- a/.github/workflows/check-release.yaml +++ b/.github/workflows/check-release.yaml @@ -13,6 +13,10 @@ jobs: runs-on: ubuntu-latest env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" R_KEEP_PKG_SOURCE: yes VDIFFR_RUN_TESTS: "false" steps: diff --git a/.github/workflows/check-standard.yaml b/.github/workflows/check-standard.yaml index d6ea388d..a44f74c9 100644 --- a/.github/workflows/check-standard.yaml +++ b/.github/workflows/check-standard.yaml @@ -29,6 +29,10 @@ jobs: env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" R_KEEP_PKG_SOURCE: yes VDIFFR_RUN_TESTS: "false" diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml index 592acc2e..d8507ba7 100644 --- a/.github/workflows/test-coverage.yaml +++ b/.github/workflows/test-coverage.yaml @@ -14,6 +14,10 @@ jobs: runs-on: ubuntu-latest env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + # Run skip_on_cran() tests in CI (incl. the varPro grow tests): CI is + # not a sanitizer build, so the upstream UBSAN path is harmless here; + # only CRAN's own check machines (NOT_CRAN unset) skip them. + NOT_CRAN: "true" VDIFFR_RUN_TESTS: "false" steps: diff --git a/DESCRIPTION b/DESCRIPTION index 4db36866..65152ae1 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,8 +1,8 @@ Package: ggRandomForests Type: Package Title: Visually Exploring Random Forests -Version: 3.1.0.9000 -Date: 2026-06-11 +Version: 3.1.1 +Date: 2026-06-12 Authors@R: person("John", "Ehrlinger", role = c("aut", "cre"), email = "john.ehrlinger@gmail.com") diff --git a/NEWS.md b/NEWS.md index 987c98ff..0c2366cb 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,9 +1,25 @@ Package: ggRandomForests -Version: 3.1.0.9000 +Version: 3.1.1 -ggRandomForests v3.1.0.9000 (development) -========================================= -* Development version opened after the v3.1.0 CRAN release. +ggRandomForests v3.1.1 +====================== +* CRAN fix: the varPro tests now call `skip_on_cran()` so they do not run + on CRAN's check machines, including the gcc-UBSAN additional check. They + were triggering an upstream `randomForestSRC` sanitizer issue (a 0-length + array access in `rfsrcGrow`, `entry.c:184`) that surfaces when any + `varPro` grow (`varpro()`, `beta.varpro()`, `uvarpro()`, `isopro()`, + `ivarpro()`) builds a forest. ggRandomForests is pure R and its code is + unchanged; the varPro tests still run in our CI (the workflows set + `NOT_CRAN=true`) and locally; they are skipped only on CRAN's check + machines, including the gcc-UBSAN check. The upstream issue has been + reported to the randomForestSRC maintainers. +* The `varpro` vignette now loads every varPro fit from a precomputed + file (`vignettes/varpro_precomputed.rds`, built by + `vignettes/precompute_varpro.R`), so the vignette performs no live + varPro grow during `R CMD check`. This removes the same upstream + sanitizer path from the vignette build and trims check time. Each chunk + falls back to a live fit if the precomputed object is absent, so the + vignette remains reproducible from source. ggRandomForests v3.1.0 ====================== diff --git a/cran-comments.md b/cran-comments.md index 561003d6..5a415c50 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -1,88 +1,56 @@ -## v3.1.0 — varPro integration + bug fix (major release) +## v3.1.1 — CRAN fix (gcc-UBSAN additional check) -### Resubmission +This patch addresses the gcc-UBSAN "additional issue" flagged for 3.1.0 +(email from Prof Ripley, 2026-06-12; correct before 2026-06-29). -This resubmission addresses the one issue raised on the previous 3.1.0 -submission: the overall check time exceeded 10 minutes on -r-devel-windows-x86_64, driven by the vignette rebuild (~331 s) and the -tests (~209 s). We have reduced both, per the three levers suggested -(small toy data, fewer iterations, precomputed results for the lengthiest -parts): +### The issue -- The regression and survival vignettes use lighter forests (`ntree` - 200 / 150, imputation `ntree` 100) and coarser partial-dependence grids. -- The varpro vignette's three `gg_partial_varpro()` calls and the Boston - `beta.varpro()` fit — together ~34 s, the bulk of that vignette's - rebuild — are now precomputed offline (`vignettes/precompute_varpro.R`) - and loaded from a 167 KB `vignettes/varpro_precomputed.rds`, with an - automatic live-computation fallback if the file is absent. -- The `gg_udependent()` tests previously recomputed the same per-fit - entropy matrix (~1.5 s) once per test; they now memoise it. +The sanitizer reports, once, during the package tests: -Locally the full vignette rebuild drops from ~80 s to ~38 s and the test -suite from ~44 s to ~36 s (R 4.6.0, aarch64-apple-darwin23), with no -change in test coverage or vignette content. +``` +entry.c:184:55: runtime error: pointer index expression with base +0x000000000001 overflowed to 0xfffffffffffffff9 + #0 rfsrcGrow .../randomForestSRC/src/entry.c:184 +``` -### Release context +This is a 0-length array access in the compiled code of the +`randomForestSRC` package (`rfsrcGrow`, `entry.c:184`), reached when +`varPro::varpro()` / `varPro::beta.varpro()` grow a forest in our tests. +ggRandomForests is a pure-R package (`NeedsCompilation: no`); it contains +no C/C++/Fortran of its own, so the overflow is not in our code. It is +surfaced because our test suite exercises that code path. -v2.7.3 is the version currently published on CRAN. A v3.0.0 submission -(2026-05-28) cleared the Windows and Debian pretests (0/0/0) but did not -complete the review cycle; this 3.1.0 release supersedes it, carrying the -full v3.0.0 feature layer plus the v3.1.0 bug fix and documentation work. -The version is in 3.x territory because it adds a substantial new feature -layer and soft-deprecates one user-facing function. +### The fix -### Changes in v3.1.0 +Two changes, neither touching package R code (ggRandomForests remains +`NeedsCompilation: no`): -- **Bug fix.** `gg_vimp()` on single-outcome `rfsrc` forests now correctly - flags variables with non-positive VIMP in the `positive` column (used for - plot colouring). The single-outcome fit names the column `VIMP` - (uppercase) while the flag check read `$vimp` (lowercase), so `positive` - stayed `TRUE` for every variable. Added a regression test. -- **Documentation.** Deepened the varPro-family and rfsrc importance / - partial / survival help pages, made the `gg_vimp()` (permutation - importance) vs `gg_varpro()` (release-rule importance) distinction - explicit and cross-linked, and fixed a stale `@return` in `gg_roc()`. No - user-facing behaviour change beyond the bug fix above. +1. The tests that grow a varPro forest now call `testthat::skip_on_cran()`, + so they do not run on CRAN's check machines (including the gcc-UBSAN + check). They still run in our CI (the workflows set `NOT_CRAN=true`) + and locally for users who run `devtools::test()`; only CRAN's check + machines, where `NOT_CRAN` is unset, skip them. +2. The `varpro` vignette now loads every varPro fit from a precomputed + file (`vignettes/varpro_precomputed.rds`) instead of growing forests + live, so the vignette build performs no varPro grow under `R CMD check` + and cannot surface the same upstream path. Each chunk falls back to a + live fit if the file is absent, so the vignette is still reproducible + from source. -### Major changes carried from v3.0.0 +The upstream issue has been reported to the randomForestSRC maintainers. -- **New varPro wrapper family.** Tidy extractors and `plot()` methods for - the `varPro` package: `gg_partial_varpro()` (partial dependence), - `gg_varpro()` (variable importance), `gg_udependent()` (cross-variable - dependency graph), `gg_isopro()` (isolation-forest anomaly scores), - `gg_beta_varpro()` (per-rule beta importance), and `gg_ivarpro()` - (individual / local importance), each with `print` / `summary` / - `autoplot` companions and a dedicated "varpro" vignette. -- **Soft-deprecation.** `gg_partialpro()` now warns and forwards to - `gg_partial_varpro()`; it will be removed in a future release. -- **randomForest engine fixes.** `gg_variable.randomForest()` - classification, `gg_roc()` / `calc_roc()` for `randomForest` (true - probability-based, macro-averaged ROC), per-class one-vs-rest ROC - (`per_class = TRUE`), and `plot.gg_variable()` now returns a single - `ggplot` / `patchwork` object rather than a bare list. -- **Importance-plot ordering.** All importance plots now place the - most-important variable at the top, matching the `gg_vimp` convention. - -### Dependency change (reverse-dependency safe) - -`randomForestSRC` and `randomForest` move from `Depends:` to `Imports:`, -and `varPro` is added to `Imports:`. `library(ggRandomForests)` no longer -attaches the forest packages to the search path. There are **0 reverse -dependencies** on CRAN, so no downstream packages are affected. - -## Test environments +### Test environments * **Local:** R 4.6.0 on macOS (aarch64-apple-darwin23). - `R CMD check --as-cran` returns 0 errors, 0 warnings, 0 notes. + `R CMD check --as-cran` returns 0 errors, 0 warnings, 0 notes; the + varPro tests skip under the CRAN check environment as intended. * **GitHub Actions matrix:** ubuntu-latest (R-devel / R-release / - R-oldrel-1), windows-latest (R-release), macos-latest (R-release), - all green on the head commit. -* **Reverse-dependency check:** `tools::package_dependencies(reverse = - TRUE)` returns 0. -* **URLs:** `urlchecker::url_check()` clean. + R-oldrel-1), windows-latest (R-release), macos-latest (R-release). +* **Reverse-dependency check:** 0 reverse dependencies on CRAN. -## NOTE disposition +### NOTE disposition -No notes in local `R CMD check --as-cran`. Examples that fit survival -forests are wrapped in `\donttest` and run in a few seconds each. +One NOTE on the incoming feasibility check, "Days since last update: 1". +This is expected: 3.1.0 was published 2026-06-11 and this patch is the +gcc-UBSAN fix the CRAN team requested (2026-06-12 email, correct before +2026-06-29), so the short interval is unavoidable. No other notes. diff --git a/dev/randomForestSRC-ubsan-report.md b/dev/randomForestSRC-ubsan-report.md new file mode 100644 index 00000000..c489f44b --- /dev/null +++ b/dev/randomForestSRC-ubsan-report.md @@ -0,0 +1,90 @@ +# UBSAN: 0-length-array pointer arithmetic in `rfsrcGrow` (entry.c:184) + +**Package:** randomForestSRC (observed in 3.6.2, latent in earlier versions +incl. 3.3.5) +**Reporter:** John Ehrlinger (ggRandomForests maintainer) +**Severity:** Undefined behaviour flagged by gcc-UBSAN; numerically benign in +practice, but CRAN treats it as an "Additional issue." + +## What CRAN reports + +gcc-UBSAN, surfaced via the ggRandomForests test suite (which grows varPro +forests): + +``` +entry.c:184:55: runtime error: pointer index expression with base +0x000000000001 overflowed to 0xfffffffffffffff9 + #0 rfsrcGrow .../randomForestSRC/src/entry.c:184 +``` + +## Root cause + +`rfsrcGrow()` builds 1-offset working pointers from incoming R weight vectors: + +```c +// src/entry.c (3.6.2), lines 183-185 +RF_xWeightStat = REAL(xWeightStat); RF_xWeightStat--; // 183 +RF_yWeight = REAL(yWeight); RF_yWeight--; // 184 <-- UBSAN +RF_xWeight = REAL(xWeight); RF_xWeight--; // 185 +``` + +When the corresponding R vector has **length 0**, `REAL()` returns R's +empty-vector sentinel address `0x1`. The unconditional `ptr--` then forms +`0x1 - sizeof(double) = 0x1 - 8 = 0xfffffffffffffff9`. *Forming* this +out-of-bounds pointer is undefined behaviour (UBSAN, column 55 = the `--`), +even though the package never dereferences `RF_yWeight[i]` when the weight is +empty — so the result is numerically correct and the bug is otherwise silent. + +`yWeight` (R-side `yvar.wt`) is length 0 whenever the y-weight vector is +empty. The R glue produces exactly that for the **unsupervised** family: + +```r +# R/rfsrc.R +if (family == "unspv") { + yvar.wt <- NULL # -> as.double(NULL) -> numeric(0) -> length-0 yWeight +} else { + yvar.wt <- get.weight(yvar.wt, length(yvar.types)) +} +``` + +So every unsupervised forest already forms this pointer; varPro reaches it +because its rule-grow machinery fits forests through that path. + +### Minimal reproduction (no varPro needed) + +```r +library(randomForestSRC) +d <- data.frame(y = rnorm(60), x1 = rnorm(60), x2 = rnorm(60)) +rfsrc(Unsupervised() ~ ., d, ntree = 10) # yvar.wt is numeric(0) -> entry.c:184 +``` + +Confirmed by tracing `randomForestSRC:::get.weight`: the unsupervised path +returns a length-0 `yvar.wt` (`n=0, outlen=0`); regression returns length 1. + +Note the sibling pointers `RF_subjWeight`, `RF_eventType`, etc. are guarded +with `if (VECTOR_ELT(...) != R_NilValue)`, but these three are not — and +`as.double(numeric(0))` is never `R_NilValue`, so nothing catches the empty +case. + +## Suggested fix + +Only decrement when the vector is non-empty (mirrors the existing +NULL-guarded idiom; the empty pointer is never indexed downstream): + +```c +RF_xWeightStat = REAL(xWeightStat); if (LENGTH(xWeightStat) > 0) RF_xWeightStat--; +RF_yWeight = REAL(yWeight); if (LENGTH(yWeight) > 0) RF_yWeight--; +RF_xWeight = REAL(xWeight); if (LENGTH(xWeight) > 0) RF_xWeight--; +``` + +(In this repo's amalgamated `src/randomForestSRC.c` the same three lines are +at ~36965-36967.) The identical guard should be applied anywhere else +`rfsrcGrow`/`rfsrcPredict` does `REAL(w); w--` / `INTEGER(w); w--` on a vector +that can arrive length 0. + +## Downstream impact + +ggRandomForests 3.1.1 works around this by not exercising the path under +`R CMD check` (varPro tests `skip_on_cran()`; the varpro vignette loads +precomputed fits). A guard in randomForestSRC would remove the UB at the +source for all callers. diff --git a/tests/testthat/helper-varpro-fixtures.R b/tests/testthat/helper-varpro-fixtures.R index 2f3a3c8f..87f251da 100644 --- a/tests/testthat/helper-varpro-fixtures.R +++ b/tests/testthat/helper-varpro-fixtures.R @@ -5,6 +5,7 @@ .varpro_cache <- new.env(parent = emptyenv()) .varpro_mtcars <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$v)) { if (!requireNamespace("varPro", quietly = TRUE)) { testthat::skip("varPro not installed") @@ -16,6 +17,7 @@ } .beta_fit_mtcars <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$b)) { v <- .varpro_mtcars() set.seed(20260526L) @@ -25,6 +27,7 @@ } .varpro_iris_binary <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$vb)) { if (!requireNamespace("varPro", quietly = TRUE)) testthat::skip("varPro not installed") set.seed(20260526L) @@ -36,6 +39,7 @@ } .beta_fit_iris_binary <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$bb)) { set.seed(20260526L) .varpro_cache$bb <- varPro::beta.varpro(.varpro_iris_binary()) @@ -44,6 +48,7 @@ } .varpro_iris_multiclass <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$vm)) { if (!requireNamespace("varPro", quietly = TRUE)) testthat::skip("varPro not installed") set.seed(20260526L) @@ -53,6 +58,7 @@ } .beta_fit_iris_multiclass <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$bm)) { set.seed(20260526L) .varpro_cache$bm <- varPro::beta.varpro(.varpro_iris_multiclass()) @@ -61,6 +67,7 @@ } .ivarpro_boston <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$iv_boston)) { if (!requireNamespace("varPro", quietly = TRUE)) testthat::skip("varPro not installed") if (!requireNamespace("MASS", quietly = TRUE)) testthat::skip("MASS not installed") @@ -73,6 +80,7 @@ } .varpro_boston <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$v_boston)) { invisible(.ivarpro_boston()) # populates v_boston as a side-effect } @@ -80,6 +88,7 @@ } .ivarpro_iris_binary <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$iv_iris_binary)) { if (!requireNamespace("varPro", quietly = TRUE)) testthat::skip("varPro not installed") set.seed(20260526L) @@ -93,11 +102,13 @@ } .varpro_iris_binary_for_ivarpro <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$v_iris_binary)) invisible(.ivarpro_iris_binary()) .varpro_cache$v_iris_binary } .ivarpro_iris_multiclass <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$iv_iris_multi)) { if (!requireNamespace("varPro", quietly = TRUE)) testthat::skip("varPro not installed") set.seed(20260526L) @@ -109,6 +120,7 @@ } .varpro_iris_multiclass_for_ivarpro <- function() { + testthat::skip_on_cran() if (is.null(.varpro_cache$v_iris_multi)) invisible(.ivarpro_iris_multiclass()) .varpro_cache$v_iris_multi } diff --git a/tests/testthat/test_gg_udependent.R b/tests/testthat/test_gg_udependent.R index 39a2f4c2..440ca6d6 100644 --- a/tests/testthat/test_gg_udependent.R +++ b/tests/testthat/test_gg_udependent.R @@ -3,6 +3,7 @@ ## ── Helpers ────────────────────────────────────────────────────────────────── make_uvp <- function(ntree = 25L) { + testthat::skip_on_cran() set.seed(42L) varPro::uvarpro(iris[, -5L], ntree = ntree) } diff --git a/tests/testthat/test_gg_varpro.R b/tests/testthat/test_gg_varpro.R index 68cba93a..ef99ab63 100644 --- a/tests/testthat/test_gg_varpro.R +++ b/tests/testthat/test_gg_varpro.R @@ -4,12 +4,14 @@ # Regression fit — fast, always available (mtcars is base R) make_vp_regr <- function(ntree = 25L) { + testthat::skip_on_cran() set.seed(42L) varPro::varpro(mpg ~ ., data = mtcars, ntree = ntree) } # Classification fit — iris, always available make_vp_class <- function(ntree = 25L) { + testthat::skip_on_cran() set.seed(42L) varPro::varpro(Species ~ ., data = iris, ntree = ntree) } diff --git a/vignettes/precompute_varpro.R b/vignettes/precompute_varpro.R index e7b9dfbd..47b83377 100644 --- a/vignettes/precompute_varpro.R +++ b/vignettes/precompute_varpro.R @@ -1,16 +1,23 @@ -# Precompute the expensive varPro objects used by the varpro vignette. +# Precompute the varPro objects used by the varpro vignette. # -# The three gg_partial_varpro() calls (~31s) and the Boston beta.varpro() -# fit (~3s) dominate the vignette rebuild time. Computing them once here and -# loading the result in varpro.qmd keeps the R CMD check vignette rebuild -# fast (CRAN flagged the overall check time). Every other varPro call in the -# vignette is sub-second and stays live. +# Two reasons this file exists: +# +# 1. Speed. The three gg_partial_varpro() calls (~31s) and the beta.varpro() +# fits dominate the vignette rebuild time (CRAN flagged the overall check +# time). +# 2. Safety. Every varPro grow (varpro(), uvarpro(), isopro(), ivarpro(), +# beta.varpro()) reaches randomForestSRC's compiled rule-grow path, which +# trips a gcc-UBSAN "0-length array" report (rfsrcGrow, entry.c:184). By +# loading these objects from disk, the vignette performs NO live varPro +# grow during R CMD check, so it cannot surface that upstream sanitizer +# issue even if a CRAN flavour builds the vignette under UBSAN. # # Run from the package root after changing any of the varpro vignette fits: # Rscript vignettes/precompute_varpro.R # # Produces vignettes/varpro_precomputed.rds. The vignette falls back to live -# computation if that file is absent, so the result is reproducible either way. +# computation for any object missing from that file, so the result is +# reproducible either way. suppressMessages({ library(varPro) @@ -30,30 +37,90 @@ options(mc.cores = 1, rf.cores = 1) # --- Regression: Boston housing ------------------------------------------ data("Boston", package = "MASS") +boston_x <- Boston[, setdiff(names(Boston), "medv")] +set.seed(20260527L) +v_boston <- varPro::varpro(medv ~ ., data = Boston, ntree = 50) +pd_boston <- gg_partial_varpro(object = v_boston) +b_boston <- varPro::beta.varpro(v_boston) +iv_boston <- varPro::ivarpro(v_boston) +set.seed(20260527L) +u_boston <- varPro::uvarpro(boston_x, ntree = 50) set.seed(20260527L) -v_boston <- varPro::varpro(medv ~ ., data = Boston, ntree = 50) -pd_boston <- gg_partial_varpro(object = v_boston) -b_boston <- varPro::beta.varpro(v_boston) +iso_boston <- varPro::isopro(data = boston_x, method = "rnd", + sampsize = 256, ntree = 50) -# --- Classification: iris (multi-class) ---------------------------------- +# --- Classification: iris ------------------------------------------------ +iris_binary <- iris[iris$Species != "setosa", ] +iris_binary$Species <- droplevels(iris_binary$Species) +set.seed(20260527L) +v_iris_binary <- varPro::varpro(Species ~ ., data = iris_binary, ntree = 50) +b_iris_binary <- varPro::beta.varpro(v_iris_binary) set.seed(20260527L) v_iris_multi <- varPro::varpro(Species ~ ., data = iris, ntree = 50) pd_iris_multi <- gg_partial_varpro(object = v_iris_multi) +b_iris_multi <- varPro::beta.varpro(v_iris_multi) # --- Survival: PBC (C-path partial dependence) --------------------------- data(pbc, package = "randomForestSRC") pbc_small <- na.omit(pbc[, c("days", "status", "age", "albumin", "bili", "edema", "platelet")]) set.seed(20260527L) -v_pbc <- varPro::varpro(Surv(days, status) ~ ., data = pbc_small, ntree = 50) -pd_pbc <- gg_partial_varpro(object = v_pbc) +v_pbc <- varPro::varpro(Surv(days, status) ~ ., data = pbc_small, ntree = 50) +pd_pbc <- gg_partial_varpro(object = v_pbc) +set.seed(20260527L) +iso_pbc <- varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", + "platelet")], + method = "rnd", sampsize = 256, ntree = 50) + +# --- Trim to keep the shipped .rds (and the source tarball) small -------- +# The gg_* calls this vignette makes (on its cached path) read only +# importance/rule summaries, not the embedded forests. The survival C-path +# gg_partial_varpro() and gg_isopro(newdata=) DO use the forest, but the +# vignette never invokes those on a stripped object (pd_pbc is cached and +# gg_isopro() is training-path only). Dropping those heavy +# slots takes the file from ~1.6 MB to ~0.4 MB (validated: every vignette +# wrapper call returns output identical to the un-stripped object). Two +# exceptions keep their forest: v_boston is printed in the vignette +# (print.varpro reads $rf), and u_boston feeds gg_udependent(), which uses it. +.strip_varpro <- function(v) { # $rf: unused on the cached path + v$rf <- NULL + v +} +.strip_isopro <- function(o) { # training-path gg_isopro reads only $ntree + o$isoforest <- list(ntree = o$isoforest$ntree) + o +} +.strip_ivarpro <- function(iv) { # drop redundant model/path attrs + for (a in c("ivarpro.path", "model", "data")) attr(iv, a) <- NULL + iv +} +b_boston <- .strip_varpro(b_boston) +v_pbc <- .strip_varpro(v_pbc) +v_iris_binary <- .strip_varpro(v_iris_binary) +v_iris_multi <- .strip_varpro(v_iris_multi) +iso_boston <- .strip_isopro(iso_boston) +iso_pbc <- .strip_isopro(iso_pbc) +iv_boston <- .strip_ivarpro(iv_boston) saveRDS( list( + # Boston (regr) + v_boston = v_boston, pd_boston = pd_boston, b_boston = b_boston, + iv_boston = iv_boston, + u_boston = u_boston, + iso_boston = iso_boston, + # iris (class) + v_iris_binary = v_iris_binary, + b_iris_binary = b_iris_binary, + v_iris_multi = v_iris_multi, pd_iris_multi = pd_iris_multi, - pd_pbc = pd_pbc + b_iris_multi = b_iris_multi, + # PBC (surv) + v_pbc = v_pbc, + pd_pbc = pd_pbc, + iso_pbc = iso_pbc ), file = "vignettes/varpro_precomputed.rds", version = 2, diff --git a/vignettes/varpro.qmd b/vignettes/varpro.qmd index b63f4d48..0bd7bb44 100644 --- a/vignettes/varpro.qmd +++ b/vignettes/varpro.qmd @@ -145,7 +145,12 @@ predictors, median home value as the response. ```{r boston-fit} data("Boston", package = "MASS") set.seed(20260527L) -v_boston <- varPro::varpro(medv ~ ., data = Boston, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_boston <- if (is.null(.vp$v_boston)) { + varPro::varpro(medv ~ ., data = Boston, ntree = 50) +} else { + .vp$v_boston +} v_boston ``` @@ -264,8 +269,12 @@ configurable threshold. The visual is built with `ggraph`, which is in `Suggests` rather than `Imports`; install it if you want this view. ```{r boston-uvarpro, cache=TRUE} -u_boston <- varPro::uvarpro(Boston[, setdiff(names(Boston), "medv")], - ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +u_boston <- if (is.null(.vp$u_boston)) { + varPro::uvarpro(Boston[, setdiff(names(Boston), "medv")], ntree = 50) +} else { + .vp$u_boston +} ``` ```{r boston-gg-udependent, eval = requireNamespace("ggraph", quietly = TRUE)} @@ -294,10 +303,13 @@ on `[0, 1]`; the wrapper's convention is "higher = more anomalous" consistency). ```{r boston-isopro, cache=TRUE} -iso_boston <- varPro::isopro(data = Boston[, setdiff(names(Boston), - "medv")], - method = "rnd", sampsize = 256, - ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +iso_boston <- if (is.null(.vp$iso_boston)) { + varPro::isopro(data = Boston[, setdiff(names(Boston), "medv")], + method = "rnd", sampsize = 256, ntree = 50) +} else { + .vp$iso_boston +} ``` ```{r boston-gg-isopro} @@ -331,7 +343,12 @@ local importances across variables. accepts a pre-computed `ivarpro_fit` for reuse across views. ```{r boston-ivarpro-fit, cache=TRUE} -iv_boston <- varPro::ivarpro(v_boston) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +iv_boston <- if (is.null(.vp$iv_boston)) { + varPro::ivarpro(v_boston) +} else { + .vp$iv_boston +} ``` ```{r boston-gg-ivarpro-distribution} @@ -375,12 +392,22 @@ and the full three-class problem. iris_binary <- iris[iris$Species != "setosa", ] iris_binary$Species <- droplevels(iris_binary$Species) set.seed(20260527L) -v_iris_binary <- varPro::varpro(Species ~ ., data = iris_binary, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_iris_binary <- if (is.null(.vp$v_iris_binary)) { + varPro::varpro(Species ~ ., data = iris_binary, ntree = 50) +} else { + .vp$v_iris_binary +} ``` ```{r iris-fit-multi} set.seed(20260527L) -v_iris_multi <- varPro::varpro(Species ~ ., data = iris, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_iris_multi <- if (is.null(.vp$v_iris_multi)) { + varPro::varpro(Species ~ ., data = iris, ntree = 50) +} else { + .vp$v_iris_multi +} ``` ## Class-conditional importance with `gg_varpro(conditional = TRUE)` @@ -437,7 +464,12 @@ such as 30-day mortality: drop the negative class, set the event class as the last factor level, and read the figure for the event panel. ```{r iris-binary-beta-fit, cache=TRUE} -b_iris_binary <- varPro::beta.varpro(v_iris_binary) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +b_iris_binary <- if (is.null(.vp$b_iris_binary)) { + varPro::beta.varpro(v_iris_binary) +} else { + .vp$b_iris_binary +} ``` ```{r iris-binary-gg-beta-varpro} @@ -449,7 +481,12 @@ class sharing the row order set by the unified ranking, same trick as `gg_varpro(conditional = TRUE)`. ```{r iris-multi-beta-fit, cache=TRUE} -b_iris_multi <- varPro::beta.varpro(v_iris_multi) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +b_iris_multi <- if (is.null(.vp$b_iris_multi)) { + varPro::beta.varpro(v_iris_multi) +} else { + .vp$b_iris_multi +} ``` ```{r iris-multi-gg-beta-varpro} @@ -491,10 +528,12 @@ pbc_small <- pbc[, c("days", "status", "age", "albumin", "bili", "edema", "platelet")] pbc_small <- na.omit(pbc_small) set.seed(20260527L) -v_pbc <- varPro::varpro( - Surv(days, status) ~ ., - data = pbc_small, ntree = 50 -) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +v_pbc <- if (is.null(.vp$v_pbc)) { + varPro::varpro(Surv(days, status) ~ ., data = pbc_small, ntree = 50) +} else { + .vp$v_pbc +} ``` ## Variable importance: `gg_varpro()` @@ -532,9 +571,13 @@ Because `isopro()` only sees the predictor matrix, it doesn't care about the family. The same call from section 3 works here. ```{r pbc-isopro, cache=TRUE} -iso_pbc <- varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", - "platelet")], - method = "rnd", sampsize = 256, ntree = 50) +# Precomputed offline (see precompute_varpro.R); falls back to a live fit. +iso_pbc <- if (is.null(.vp$iso_pbc)) { + varPro::isopro(data = pbc_small[, c("age", "albumin", "bili", "platelet")], + method = "rnd", sampsize = 256, ntree = 50) +} else { + .vp$iso_pbc +} plot(gg_isopro(iso_pbc)) ``` diff --git a/vignettes/varpro_precomputed.rds b/vignettes/varpro_precomputed.rds index d9694604..82eea83b 100644 Binary files a/vignettes/varpro_precomputed.rds and b/vignettes/varpro_precomputed.rds differ