Conversation
Update the README
Clean up the godoc
Add circle config
Move simple into subpackage
Restore the simple test
Add simple/doc.go
Rephrase and fix typos
Add varopt benchmarks
Check for NaN values; return error instead of panicking
Inline the large-weight heap to avoid interface conversions
Memory optimization support
Pre-allocate main buffers
* Remove a test-only method * update circle Go version * simplify circleci * mod update
|
@oertl |
| if value <= digest[0].Mean { | ||
| return digest[0].Weight / (2 * sumw) | ||
| } | ||
| if value >= digest[len(digest)-1].Mean { |
There was a problem hiding this comment.
The challenge in this code is to estimate the density of buckets outside the range that was covered in the prior window. Here I make the extreme buckets have half the density of their neighbor, which is a bit arbitrary.
The idea is that in order to use inverse-frequency weighted sampling, you need an estimate for what you haven't seen before. For a numerical distribution, the approach here seems to work but isn't perfect.
For a categorical distribution, I've looked into using a non-parametric estimate based on the theory of species-diversity estimation (see here), which academically derives from Goode-Turing Frequency Estimation. This is a curiosity of mine.
|
The main branch has been rebased so this can't be used except for reference. Still useful. |
T-digest can compute a digest from a set of weighted input points.
From the digest, we can estimate the weight of an unweighted input point.
Varopt produces a small set of weighted input points from a large set of weighted input points.
Take these properties together, and we have a potential feedback loop:
The use of inverse weight function leaves a single parameter: how much weight to assign to observations outside the previous digest's range. This code assigns a probability to points that lie outside the previous range equal to half the probability of the adjacent extreme bucket.