Skip to content

variance_measure_add: skip non-finite input instead of asserting#351

Open
rocketmark wants to merge 1 commit intocollabora:masterfrom
rocketmark:upstream-pr/variance-nan-guard
Open

variance_measure_add: skip non-finite input instead of asserting#351
rocketmark wants to merge 1 commit intocollabora:masterfrom
rocketmark:upstream-pr/variance-nan-guard

Conversation

@rocketmark
Copy link
Contributor

Summary

variance_measure_add contains an assert(isfinite(d[i])) inside the accumulation loop. When corrupt optical angle data reaches this function, the assert fires, crashes the process with SIGABRT, and corrupts the on-disk config.json because the file is left in a partially-written state. The next launch then fails to parse config and crashes again — a boot loop.

Dropping one non-finite sample has negligible effect on the variance estimate. Crashing and corrupting config does not.

Demonstration

Hardware trigger: a USB disturbance (cable flex, port brown-out) causes the FPGA to emit bad timestamps. The driver produces a non-finite optical angle. That angle propagates into variance_tracker_addvariance_measure_add:

BEFORE fix:
d[0] = NaN
assert(isfinite(NaN)) → fires
Process: SIGABRT
config.json: partially written, unparseable on next launch
Result: crash loop until config.json is manually deleted

AFTER fix:
d[0] = NaN
isfinite(NaN) == false → early return, measurement skipped
stderr: "[libsurvive] variance_measure_add: non-finite d[0]=nan, dropping measurement"
Process: continues normally
Result: one sample dropped, variance estimate unaffected

The assert was presumably added as a correctness guard during development. In production, it turns a transient hardware glitch into a persistent failure requiring manual intervention.

Impact

Any caller passing data derived from optical angles is exposed. variance_tracker_add in redist/variance.h is the primary path; it is called from lighthouse calibration and pose confidence tracking. On hardware that experiences any USB instability, this assert is reachable.

Change

One file, redist/variance.h. The assert is replaced with a pre-check that returns early and logs to stderr. The accumulator (meas->n, meas->sum, meas->sumSq) is left unchanged, which is correct — including a non-finite value in any of those fields would corrupt all subsequent variance calculations derived from this accumulator.

Found via

Observed in production: tracker running on embedded hardware (Raspberry Pi, USB bus under load) would enter a crash loop after a cable disturbance. coredumpctl showed SIGABRT inside variance_measure_add. Deleting config.json recovered the system; the underlying assert was the cause.

for (int i = 0; i < meas->size; i++) {
if (!isfinite(d[i])) {
fprintf(stderr, "[libsurvive] variance_measure_add: non-finite d[%d]=%f, dropping measurement\n", i, (double)d[i]);
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if all of the other measures should be thrown out when only one of them is finite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared n counter makes partial updates inconsistent. variance_measure_calc computes:

d[i] = (sumSq[i] - (sum[i] * sum[i]) / n) / n

n is a single value for all dimensions. If we skip the NaN dimension but still increment n, the variance for that dimension uses the wrong denominator (too large). If we update the finite dimensions but don't increment n, those dimensions' variance calculations are also wrong (too small).

A complete solution per-dimension would be per-dimension n counters, which would be a larger struct change outside the scope of this fix.

In practice, when a USB disturbance produces a NaN angle, the other angles in the same d[] bundle are from the same corrupted sensor read and are likely also unreliable. Dropping the whole sample feels like the safer choice here.

If you'd prefer, I can change the guard to only trigger if all elements are non-finite, but that would let through mixed good/NaN samples that would silently corrupt the variance for the NaN dimension. Happy to discuss which behaviour you'd prefer.

An assert(isfinite(d[i])) crash here corrupts the on-disk libsurvive
config because the process dies mid-write. Corrupt optical angles
(e.g. bad FPGA timestamps during USB disturbances) can produce NaN or
Inf values that reach this function; crashing is strictly worse than
dropping one sample, which has negligible effect on the variance estimate.

Replace the assert with an early-return guard that logs to stderr and
leaves the accumulator unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rocketmark rocketmark force-pushed the upstream-pr/variance-nan-guard branch from 00b698c to 1b7da16 Compare March 22, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants