First off, thank you for statrs! This looks like a fantastic library that took a lot of effort.
Regarding the numerical stability of the functions in statrs::statistics, I think it would be useful to reach for Welford's algorithm for reasons outlined here by John D. Cook and implemented here. It could also improve performance; the original algorithm requires only a single pass over the data, and it can also be parallelized.
For example, running the following on my laptop
Example Code
//! Example from <https://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/>
use rand::prelude::*;
use statrs::statistics::Statistics;
/// <https://www.johndcook.com/skewness_kurtosis.html>
#[derive(Default)]
struct RunningStats(f64, f64, f64);
impl RunningStats {
fn push(&mut self, x: f64) {
let n1 = self.0;
self.0 += 1.0;
let delta = x - self.1;
let delta_n = delta / self.0;
let term1 = delta * delta_n * n1;
self.1 += delta_n;
self.2 += term1;
}
fn variance(&self) -> Option<f64> {
if self.0 < 2.0 {
None
} else {
Some(self.2 / (self.0 - 1.0))
}
}
}
impl<T: Iterator<Item = f64>> From<T> for RunningStats {
fn from(xs: T) -> Self {
let mut r = Self::default();
for x in xs {
r.push(x);
}
r
}
}
fn main() {
let mut rng = SmallRng::seed_from_u64(30_613_700); // seed from random.org
let mut xs = vec![0f64; 1_000_000];
// John D. Cook's second example: samples ~ U([0, 1]), shifted by 10^12
for x in &mut xs {
*x = 1e12 + rng.random::<f64>();
}
println!(
"statrs: {}\nRunningStats: {}",
Statistics::variance(&xs),
RunningStats::from(xs.into_iter())
.variance()
.expect("nonempty")
);
}
yields
statrs: 0.31179242499195364
RunningStats: 0.08339059221218162
Thanks again!
First off, thank you for
statrs! This looks like a fantastic library that took a lot of effort.Regarding the numerical stability of the functions in
statrs::statistics, I think it would be useful to reach for Welford's algorithm for reasons outlined here by John D. Cook and implemented here. It could also improve performance; the original algorithm requires only a single pass over the data, and it can also be parallelized.For example, running the following on my laptop
Example Code
yields
Thanks again!