variance_measure_add: skip non-finite input instead of asserting#351
variance_measure_add: skip non-finite input instead of asserting#351rocketmark wants to merge 1 commit intocollabora:masterfrom
Conversation
| for (int i = 0; i < meas->size; i++) { | ||
| if (!isfinite(d[i])) { | ||
| fprintf(stderr, "[libsurvive] variance_measure_add: non-finite d[%d]=%f, dropping measurement\n", i, (double)d[i]); | ||
| return; |
There was a problem hiding this comment.
I'm not sure if all of the other measures should be thrown out when only one of them is finite.
There was a problem hiding this comment.
The shared n counter makes partial updates inconsistent. variance_measure_calc computes:
d[i] = (sumSq[i] - (sum[i] * sum[i]) / n) / n
n is a single value for all dimensions. If we skip the NaN dimension but still increment n, the variance for that dimension uses the wrong denominator (too large). If we update the finite dimensions but don't increment n, those dimensions' variance calculations are also wrong (too small).
A complete solution per-dimension would be per-dimension n counters, which would be a larger struct change outside the scope of this fix.
In practice, when a USB disturbance produces a NaN angle, the other angles in the same d[] bundle are from the same corrupted sensor read and are likely also unreliable. Dropping the whole sample feels like the safer choice here.
If you'd prefer, I can change the guard to only trigger if all elements are non-finite, but that would let through mixed good/NaN samples that would silently corrupt the variance for the NaN dimension. Happy to discuss which behaviour you'd prefer.
An assert(isfinite(d[i])) crash here corrupts the on-disk libsurvive config because the process dies mid-write. Corrupt optical angles (e.g. bad FPGA timestamps during USB disturbances) can produce NaN or Inf values that reach this function; crashing is strictly worse than dropping one sample, which has negligible effect on the variance estimate. Replace the assert with an early-return guard that logs to stderr and leaves the accumulator unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
00b698c to
1b7da16
Compare
Summary
variance_measure_addcontains anassert(isfinite(d[i]))inside the accumulation loop. When corrupt optical angle data reaches this function, the assert fires, crashes the process withSIGABRT, and corrupts the on-diskconfig.jsonbecause the file is left in a partially-written state. The next launch then fails to parse config and crashes again — a boot loop.Dropping one non-finite sample has negligible effect on the variance estimate. Crashing and corrupting config does not.
Demonstration
Hardware trigger: a USB disturbance (cable flex, port brown-out) causes the FPGA to emit bad timestamps. The driver produces a non-finite optical angle. That angle propagates into
variance_tracker_add→variance_measure_add:BEFORE fix:
d[0] = NaN
assert(isfinite(NaN)) → fires
Process: SIGABRT
config.json: partially written, unparseable on next launch
Result: crash loop until config.json is manually deleted
AFTER fix:
d[0] = NaN
isfinite(NaN) == false → early return, measurement skipped
stderr: "[libsurvive] variance_measure_add: non-finite d[0]=nan, dropping measurement"
Process: continues normally
Result: one sample dropped, variance estimate unaffected
The assert was presumably added as a correctness guard during development. In production, it turns a transient hardware glitch into a persistent failure requiring manual intervention.
Impact
Any caller passing data derived from optical angles is exposed.
variance_tracker_addinredist/variance.his the primary path; it is called from lighthouse calibration and pose confidence tracking. On hardware that experiences any USB instability, this assert is reachable.Change
One file,
redist/variance.h. The assert is replaced with a pre-check that returns early and logs to stderr. The accumulator (meas->n,meas->sum,meas->sumSq) is left unchanged, which is correct — including a non-finite value in any of those fields would corrupt all subsequent variance calculations derived from this accumulator.Found via
Observed in production: tracker running on embedded hardware (Raspberry Pi, USB bus under load) would enter a crash loop after a cable disturbance.
coredumpctlshowedSIGABRTinsidevariance_measure_add. Deletingconfig.jsonrecovered the system; the underlying assert was the cause.