Implementation of the Stochastic Approximation Mean-Shift (SAMS) clustering algorithm proposed by Hyrien and Baran (2016). Mean-shift is an iterative procedure often used as a nonparametric clustering algorithm that defines clusters based on the modal regions of a density function. The algorithm is conceptually appealing and makes assumptions neither about the shape of the clusters nor about their number. However, with a complexity of O(n²) per iteration, it does not scale well to large datasets. SAMS performs density-based clustering much quicker than mean shift, yet delivers virtually identical results. This algorithm combines subsampling and a stochastic approximation procedure to achieve a potential complexity of O(n) at each step. Clusters may be arbitrarily-shaped (e.g., non-Gaussian or nonconvex). The number of clusters is automatically determined by the number of modes of the density estimate built from the data, and the presence of outliers has limited influence over clustering results.
The rsams package provides a complete implementation of the SAMS algorithm with the following features:
- Adaptive bandwidths - Automatic bandwidth selection based on local density
- Parallel processing - Multi-core support for large datasets
- Smart initialization - k-means++ style initialization for faster convergence
- k-NN refinement - Optional post-processing to improve cluster assignments
- Hierarchical clustering - Support for multi-scale analysis
# Install from GitHub
devtools::install_github("HyrienLab/sams-clustering", subdir = "rsams")library(rsams)
# Generate sample data: 2 clusters with 500 points each
set.seed(42)
data1 <- matrix(rnorm(500 * 2, mean = 0, sd = 0.5), ncol = 2)
data2 <- matrix(rnorm(500 * 2, mean = 2, sd = 0.5), ncol = 2)
Y <- rbind(data1, data2)
# Run SAMS clustering
result <- sams_cluster(Y,
rho_k = 0.05,
max_iter = 200,
max_jump = 0.05,
merge_threshold = 0.5,
use_parallel = TRUE)
# View results
cat("Clusters found:", result$n_clusters, "\n")
plot(Y, col = result$clusters + 1, pch = 19)
points(result$cluster_centers, pch = 4, cex = 2, lwd = 2)
# Visualize trajectory convergence
plot_trajectories(result)SAMS can identify arbitrarily-shaped clusters:
library(rsams)
set.seed(123)
# Generate non-Gaussian 2D mixture data
# 1000 points: 35% banana-shaped, 65% skewed
data <- generate_nongaussian_2d(n = 1000, prop_banana = 0.35, seed = 123)
Y <- data$Y
# Step 0: Initial fine-grained clustering
result <- sams_cluster(Y, rho_k = 0.02, max_iter = 200, max_jump = 0.02,
merge_threshold = 0.2, use_parallel = TRUE)
# Step 1: Cluster the cluster centers
result2 <- sams_cluster(Y, X = result$cluster_centers, rho_k = 0.1,
max_iter = 1000, max_jump = 0.1,
merge_threshold = 0.5, use_parallel = TRUE)
# Step 2: Cluster the meta-cluster centers
result3 <- sams_cluster(Y, X = result2$cluster_centers, rho_k = 0.5,
max_iter = 3000, max_jump = 0.2,
merge_threshold = 1.5, use_parallel = TRUE)
# Map original points to final clusters
clusters_step1 <- result2$clusters[result$clusters]
clusters_final <- result3$clusters[clusters_step1]SAMS can further accelerate clustering by iteratively clustering the cluster centers from the previous level (steps). This allows significant reduction of the amount of computation and speed up computing time for large datasets:
library(rsams)
library(MASS)
set.seed(123)
# Generate 5 clusters with 5000 points each
n <- 5000
data1 <- matrix(rnorm(2 * n, mean = 0, sd = 0.25), ncol = 2)
data2 <- matrix(rnorm(2 * n, mean = 1.5, sd = 0.5), ncol = 2)
data3 <- matrix(rnorm(2 * n, mean = -1.5, sd = 0.75), ncol = 2)
data4 <- MASS::mvrnorm(n, mu = c(2, -2), Sigma = matrix(c(0.5, -0.35, -0.35, 0.5), 2))
data5 <- MASS::mvrnorm(n, mu = c(-1.5, 1.5), Sigma = matrix(c(0.5, 0.35, 0.35, 0.5), 2))
Y <- rbind(data1, data2, data3, data4, data5)
# Step 0: Initial fine-grained clustering
result <- sams_cluster(Y, rho_k = 0.001, max_iter = 50,
merge_threshold = 0.1, use_parallel = TRUE)
# Step 1: Cluster the cluster centers
result2 <- sams_cluster(Y, X = result$cluster_centers,
rho_k = 0.05, max_iter = 1000,
merge_threshold = 0.25, use_parallel = TRUE)
# Step 2: Cluster the meta-cluster centers
result3 <- sams_cluster(Y, X = result2$cluster_centers,
rho_k = 0.5, max_iter = 10000,
merge_threshold = 1, use_parallel = TRUE)
# Map original points to final clusters
clusters_step1 <- result2$clusters[result$clusters]
clusters_final <- result3$clusters[clusters_step1]Clustering 5 Gaussian clusters in 3D: 4 at the vertices of a tetrahedron and 1 at the center, each with 2000 points and different correlation structures.
library(rsams)
library(MASS)
set.seed(123)
n_per_cluster <- 2000
sd_cluster <- 0.3
# Vertices of a regular tetrahedron + center
vertices <- matrix(c(
1, 1, 1,
1, -1, -1,
-1, 1, -1,
-1, -1, 1
), nrow = 4, byrow = TRUE)
center <- c(0, 0, 0)
# Covariance matrix with correlation rho
make_cov <- function(sd, rho) {
cov_mat <- matrix(rho * sd^2, nrow = 3, ncol = 3)
diag(cov_mat) <- sd^2
cov_mat
}
# Different correlations for each cluster
correlations <- c(0, 0.3, -0.3, 0.2, -0.2)
# Generate data
Y_list <- list()
for (i in 1:4) {
cov_mat <- make_cov(sd_cluster, correlations[i])
Y_list[[i]] <- MASS::mvrnorm(n = n_per_cluster, mu = vertices[i, ], Sigma = cov_mat)
}
cov_mat <- make_cov(sd_cluster, correlations[5])
Y_list[[5]] <- MASS::mvrnorm(n = n_per_cluster, mu = center, Sigma = cov_mat)
Y <- do.call(rbind, Y_list)
# Step 0: Initial fine-grained clustering
result <- sams_cluster(Y, rho_k = 0.005, max_iter = 100, max_jump = 0.02,
merge_threshold = 0.15, use_parallel = TRUE)
# Step 1: Cluster the cluster centers
result2 <- sams_cluster(Y, X = result$cluster_centers, rho_k = 0.05,
max_iter = 500, max_jump = 0.1,
merge_threshold = 0.3, use_parallel = TRUE)
# Step 2: Cluster the meta-cluster centers
result3 <- sams_cluster(Y, X = result2$cluster_centers, rho_k = 0.1,
max_iter = 1000, max_jump = 0.2,
merge_threshold = 0.5, use_parallel = TRUE)
# Map original points to final clusters
clusters_step1 <- result2$clusters[result$clusters]
clusters_final <- result3$clusters[clusters_step1]| Function | Description |
|---|---|
sams_cluster() |
Main clustering function |
sams_trajectory() |
Run SAMS for a single starting point |
generate_nongaussian_2d() |
Generate non-Gaussian test data |
plot_clusters() |
Visualize clustering results |
plot_trajectories() |
Visualize convergence trajectories |
See the rsams/inst/examples/ directory for detailed examples:
basic_example.R- Simple two-cluster demonstrationnongaussian_example.R- Non-Gaussian clusters (banana-shaped and skewed)hierarchical_example.R- Multi-scale hierarchical clusteringmultidimensional_example.R- Higher-dimensional data
Ollivier Hyrien and Andrea Baran (2016). Fast Nonparametric Density-Based Clustering of Large Data Sets Using a Stochastic Approximation Mean-Shift Algorithm. Journal of Computational and Graphical Statistics, 25(3): 899–916. doi:10.1080/10618600.2015.1051625



