Stochastic Approximation Mean-Shift (SAMS): a density-based nonparametric clustering algorithm

A scalable mean-shift for large datasets

Implementation of the Stochastic Approximation Mean-Shift (SAMS) clustering algorithm proposed by Hyrien and Baran (2016). Mean-shift is an iterative procedure often used as a nonparametric clustering algorithm that defines clusters based on the modal regions of a density function. The algorithm is conceptually appealing and makes assumptions neither about the shape of the clusters nor about their number. However, with a complexity of O(n²) per iteration, it does not scale well to large datasets. SAMS performs density-based clustering much quicker than mean shift, yet delivers virtually identical results. This algorithm combines subsampling and a stochastic approximation procedure to achieve a potential complexity of O(n) at each step. Clusters may be arbitrarily-shaped (e.g., non-Gaussian or nonconvex). The number of clusters is automatically determined by the number of modes of the density estimate built from the data, and the presence of outliers has limited influence over clustering results.

rsams R Package

The rsams package provides a complete implementation of the SAMS algorithm with the following features:

Adaptive bandwidths - Automatic bandwidth selection based on local density
Parallel processing - Multi-core support for large datasets
Smart initialization - k-means++ style initialization for faster convergence
k-NN refinement - Optional post-processing to improve cluster assignments
Hierarchical clustering - Support for multi-scale analysis

Installation

# Install from GitHub
devtools::install_github("HyrienLab/sams-clustering", subdir = "rsams")

Quick Start

library(rsams)

# Generate sample data: 2 clusters with 500 points each
set.seed(42)
data1 <- matrix(rnorm(500 * 2, mean = 0, sd = 0.5), ncol = 2)
data2 <- matrix(rnorm(500 * 2, mean = 2, sd = 0.5), ncol = 2)
Y <- rbind(data1, data2)

# Run SAMS clustering
result <- sams_cluster(Y,
                       rho_k = 0.05,
                       max_iter = 200,
                       max_jump = 0.05,
                       merge_threshold = 0.5,
                       use_parallel = TRUE)

# View results
cat("Clusters found:", result$n_clusters, "\n")
plot(Y, col = result$clusters + 1, pch = 19)
points(result$cluster_centers, pch = 4, cex = 2, lwd = 2)

# Visualize trajectory convergence
plot_trajectories(result)

Non-Gaussian and Non-Convex Clusters

SAMS can identify arbitrarily-shaped clusters:

library(rsams)

set.seed(123)

# Generate non-Gaussian 2D mixture data
# 1000 points: 35% banana-shaped, 65% skewed
data <- generate_nongaussian_2d(n = 1000, prop_banana = 0.35, seed = 123)
Y <- data$Y

# Step 0: Initial fine-grained clustering
result <- sams_cluster(Y, rho_k = 0.02, max_iter = 200, max_jump = 0.02,
                       merge_threshold = 0.2, use_parallel = TRUE)

# Step 1: Cluster the cluster centers
result2 <- sams_cluster(Y, X = result$cluster_centers, rho_k = 0.1,
                        max_iter = 1000, max_jump = 0.1,
                        merge_threshold = 0.5, use_parallel = TRUE)

# Step 2: Cluster the meta-cluster centers
result3 <- sams_cluster(Y, X = result2$cluster_centers, rho_k = 0.5,
                        max_iter = 3000, max_jump = 0.2,
                        merge_threshold = 1.5, use_parallel = TRUE)

# Map original points to final clusters
clusters_step1 <- result2$clusters[result$clusters]
clusters_final <- result3$clusters[clusters_step1]

Acceleration Through Iterative Meta Clustering

SAMS can further accelerate clustering by iteratively clustering the cluster centers from the previous level (steps). This allows significant reduction of the amount of computation and speed up computing time for large datasets:

library(rsams)
library(MASS)

set.seed(123)

# Generate 5 clusters with 5000 points each
n <- 5000
data1 <- matrix(rnorm(2 * n, mean = 0, sd = 0.25), ncol = 2)
data2 <- matrix(rnorm(2 * n, mean = 1.5, sd = 0.5), ncol = 2)
data3 <- matrix(rnorm(2 * n, mean = -1.5, sd = 0.75), ncol = 2)
data4 <- MASS::mvrnorm(n, mu = c(2, -2), Sigma = matrix(c(0.5, -0.35, -0.35, 0.5), 2))
data5 <- MASS::mvrnorm(n, mu = c(-1.5, 1.5), Sigma = matrix(c(0.5, 0.35, 0.35, 0.5), 2))
Y <- rbind(data1, data2, data3, data4, data5)

# Step 0: Initial fine-grained clustering
result <- sams_cluster(Y, rho_k = 0.001, max_iter = 50,
                       merge_threshold = 0.1, use_parallel = TRUE)

# Step 1: Cluster the cluster centers
result2 <- sams_cluster(Y, X = result$cluster_centers,
                        rho_k = 0.05, max_iter = 1000,
                        merge_threshold = 0.25, use_parallel = TRUE)

# Step 2: Cluster the meta-cluster centers
result3 <- sams_cluster(Y, X = result2$cluster_centers,
                        rho_k = 0.5, max_iter = 10000,
                        merge_threshold = 1, use_parallel = TRUE)

# Map original points to final clusters
clusters_step1 <- result2$clusters[result$clusters]
clusters_final <- result3$clusters[clusters_step1]

A 3D Example

Clustering 5 Gaussian clusters in 3D: 4 at the vertices of a tetrahedron and 1 at the center, each with 2000 points and different correlation structures.

library(rsams)
library(MASS)

set.seed(123)

n_per_cluster <- 2000
sd_cluster <- 0.3

# Vertices of a regular tetrahedron + center
vertices <- matrix(c(
  1,  1,  1,
  1, -1, -1,
 -1,  1, -1,
 -1, -1,  1
), nrow = 4, byrow = TRUE)
center <- c(0, 0, 0)

# Covariance matrix with correlation rho
make_cov <- function(sd, rho) {
  cov_mat <- matrix(rho * sd^2, nrow = 3, ncol = 3)
  diag(cov_mat) <- sd^2
  cov_mat
}

# Different correlations for each cluster
correlations <- c(0, 0.3, -0.3, 0.2, -0.2)

# Generate data
Y_list <- list()
for (i in 1:4) {
  cov_mat <- make_cov(sd_cluster, correlations[i])
  Y_list[[i]] <- MASS::mvrnorm(n = n_per_cluster, mu = vertices[i, ], Sigma = cov_mat)
}
cov_mat <- make_cov(sd_cluster, correlations[5])
Y_list[[5]] <- MASS::mvrnorm(n = n_per_cluster, mu = center, Sigma = cov_mat)
Y <- do.call(rbind, Y_list)

# Step 0: Initial fine-grained clustering
result <- sams_cluster(Y, rho_k = 0.005, max_iter = 100, max_jump = 0.02,
                       merge_threshold = 0.15, use_parallel = TRUE)

# Step 1: Cluster the cluster centers
result2 <- sams_cluster(Y, X = result$cluster_centers, rho_k = 0.05,
                        max_iter = 500, max_jump = 0.1,
                        merge_threshold = 0.3, use_parallel = TRUE)

# Step 2: Cluster the meta-cluster centers
result3 <- sams_cluster(Y, X = result2$cluster_centers, rho_k = 0.1,
                        max_iter = 1000, max_jump = 0.2,
                        merge_threshold = 0.5, use_parallel = TRUE)

# Map original points to final clusters
clusters_step1 <- result2$clusters[result$clusters]
clusters_final <- result3$clusters[clusters_step1]

Key Functions

Function	Description
`sams_cluster()`	Main clustering function
`sams_trajectory()`	Run SAMS for a single starting point
`generate_nongaussian_2d()`	Generate non-Gaussian test data
`plot_clusters()`	Visualize clustering results
`plot_trajectories()`	Visualize convergence trajectories

Examples

See the rsams/inst/examples/ directory for detailed examples:

basic_example.R - Simple two-cluster demonstration
nongaussian_example.R - Non-Gaussian clusters (banana-shaped and skewed)
hierarchical_example.R - Multi-scale hierarchical clustering
multidimensional_example.R - Higher-dimensional data

Reference

Ollivier Hyrien and Andrea Baran (2016). Fast Nonparametric Density-Based Clustering of Large Data Sets Using a Stochastic Approximation Mean-Shift Algorithm. Journal of Computational and Graphical Statistics, 25(3): 899–916. doi:10.1080/10618600.2015.1051625

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
pysams		pysams
rsams		rsams
Hyrien_Baran_JCGS2016 1.pdf		Hyrien_Baran_JCGS2016 1.pdf
README.md		README.md
nihms-814938.pdf		nihms-814938.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stochastic Approximation Mean-Shift (SAMS): a density-based nonparametric clustering algorithm

A scalable mean-shift for large datasets

rsams R Package

Installation

Quick Start

Non-Gaussian and Non-Convex Clusters

Acceleration Through Iterative Meta Clustering

A 3D Example

Key Functions

Examples

Reference

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

HyrienLab/sams-clustering

Folders and files

Latest commit

History

Repository files navigation

Stochastic Approximation Mean-Shift (SAMS): a density-based nonparametric clustering algorithm

A scalable mean-shift for large datasets

rsams R Package

Installation

Quick Start

Non-Gaussian and Non-Convex Clusters

Acceleration Through Iterative Meta Clustering

A 3D Example

Key Functions

Examples

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages