Add support for FLAC compression of 64bit integers. #9

tskisner · 2025-03-30T20:44:17Z

Add native support in the compiled extension for compressing 64bit
integers with FLAC. The high and low 32bits are compressed in
separate channels. On compression (for little-endian systems), the
input 64bit integers are simply interpreted as interleaved stereo
channels. On decompression, the separate channels in each frame
are extracted to the high / low 32bit portions of the 64bit integers.
Expose C functions in public header.
Add 32bit / 64bit wrappers to the bindings, and support for 64bit
integers to the mid-level encode / decode functions.
Changes to the treatment of floating point data. float32 data is
converted to int32 as before, but float64 data is now converted to
int64 to gain additional precision.
Changes to int64 encoding- previously this was restricted to either
32bit range around an offset or forced conversion as floating point
data. Now this data is supported natively. Remove the int32/int64
conversion routines which are no longer needed.
Change the HDF5 and Zarr on-disk format (bumping version to "1").
This change was necessary to support storing int64 data without
an associated offset.
Modify and expand unit tests to cover new int64 cases. For MPI tests,
ensure random data is created on one process and then distributed.
Cleanup benchmarking script.
Changes to MPI I/O helper functions: refactor the extract, send
and receive steps into subroutines for more clarity. Use Send
instead of Isend for better simplicity and more robustness across
MPI implementations.

tskisner · 2025-03-30T20:48:20Z

This branch will be rebased after #8 is merged. I also want to expand the docs to describe the specific on-disk format. I also need to test the big-endian high / low 32bit handling in a multiarch qemu docker container.

tskisner · 2025-05-27T16:42:45Z

Although this work is largely done, I need to update the documentation notebooks and I still need to test the case of big-endian portability for 64bit (high / low channel) data.

review-notebook-app · 2025-05-29T20:06:48Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

- Add native support in the compiled extension for compressing 64bit integers with FLAC. The high and low 32bits are compressed in separate channels. On compression (for little-endian systems), the input 64bit integers are simply interpreted as interleaved stereo channels. On decompression, the separate channels in each frame are extracted to the high / low 32bit portions of the 64bit integers. Expose C functions in public header. - Add 32bit / 64bit wrappers to the bindings, and support for 64bit integers to the mid-level encode / decode functions. - Changes to the treatment of floating point data. float32 data is converted to int32 as before, but float64 data is now converted to int64 to gain additional precision. - Changes to int64 encoding- previously this was restricted to either 32bit range around an offset or forced conversion as floating point data. Now this data is supported natively. Remove the int32/int64 conversion routines which are no longer needed. - Change the HDF5 and Zarr on-disk format (bumping version to "1"). This change was necessary to support storing int64 data without an associated offset. - Modify and expand unit tests to cover new int64 cases. For MPI tests, ensure random data is created on one process and then distributed. - Cleanup benchmarking script. - Changes to MPI I/O helper functions: refactor the extract, send and receive steps into subroutines for more clarity. Use Send instead of Isend for better simplicity and more robustness across MPI implementations. - Refine API to be more explicit about when single-stream arrays are flattened on decompression / reading.

mhasself

I'm not able to review this fully. In a quick test, the only odd thing to me was in relation to the default quantization for float types.

None of the docstrings I could find would explain what is done when quanta=None and precision=None. What seems to happen in practice is that there is effectively no compression. For example, compressing a float64 stream that contains only ~10 distinct values (corresponding to [-5, +5]) still takes an average of 6 bytes per sample.

It is thus very likely that someone who doesn't specify quanta/precision has (a) forgotten to do so or (b) is accidentally compressing a float array instead of an int array.

So:

It should perhaps be an error to compress floats without specifying one of quanta or precision.
Whatever happens, the docstring should explain the default behavior.

tskisner · 2025-06-02T20:13:57Z

Thanks for the quick review!

I mention the behavior of quanta=None briefly in the docstring, but I will work to improve that.

There is also a new section in the cookbook section discussion quantization.

Your assessment is correct that quanta=None defaults to essentially no compression (it errs on the conservative side of splitting the dynamic range by machine precision). We could certainly force the user to specify either quantization or precision.

- Require user-specifed quanta or precision - Update docstrings describing the requirement - Update documentation and example notebooks - Updata benchmark script and test cases.

- When computing floating point offset, round to nearest quanta before subtracting to reduce errors - Expand docstrings discussing floating point conversion and ensure that all low-level and direct write I/O functions reference the FlacArray class docstring where this discussion is centralized. - Expand unit tests to include special int32 / int64 values (+/- 2^31-1, +/- 2^32, +/- 2^63-1)

mhasself · 2025-06-06T01:16:23Z

Ok, so tests pass for me, but I get significant errors -- especially with negative numbers.

import flacarray as fa
import numpy as np
import os

import h5py

q = 1e-3
os.remove('test.h5')

for offset in [0, -.5, .5, -10, +10, -10.51, -10.4]:
    with h5py.File('test.h5', 'w') as h:
        td = (np.random.normal(size=100000) + offset) // 1 * q
        fa.hdf5.write_array(td, h, quanta=q)

    with h5py.File('test.h5', 'r') as h:
        x = fa.hdf5.read_array(h)
        max_err = abs(x-td).max() / q
        bad = abs(x - td) / q > .1
        print(offset, max_err, bad.sum() / len(x))

Output:

0 1.0 0.49757
-0.5 1.0 0.69379
0.5 1.0 0.30946
-10 1.0000000000000009 0.84343
10 1.0000000000000009 0.49778
-10.51 1.0000000000000009 0.69847
-10.4 1.0000000000000009 0.91955

Basically that says there's always an error of size 1 quantum. The error rate can be as high as 92% (bottom most case).

My guess would be this has to do with using int casting to round... C int casting rounds towards zero, which is rarely what you want.

…s less than 0.5 quanta.

tskisner · 2025-07-01T21:02:21Z

Ok @mhasself , your small test code is now incorporated into a unit test (which passes). You were correct that this was a rounding issue. The residuals (in the case where the quanta is larger than the minimum supported by the data type) are now <= 0.5 of the quanta value.

mhasself

Thanks. Because I'm so good at breaking things, I found this:

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=1.)
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array(1.))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1.]))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1., 1., 1.]))
(array([-9223372036854775808, -9223372036854775808, -9223372036854775808]), array([-1.01]), array([9.13205152e+18]))

I know that last line is incorrect usage, but it seems like a case that should raise an exception instead of silently producing bizarre output.

mhasself · 2025-07-02T15:43:37Z

src/flacarray/tests/array.py

+                residual = np.absolute(output - quantized)
+                max_resid = np.amax(residual)
+                max_q_err = max_resid / quant
+                if max_q_err > 0.5:


I think for these tests we actually want a more stringent requirement: by construction, the quantization error should be almost zero. This is to meet the expectation that if I have numbers with, e.g., 3 digit precision (1.002, 3.125, -1.545) then I should be able to encode them losslessly by specifying quantum=0.001. This has applications in data acquisition, where readings really can be quantized at this level.

I split this code into 2 tests. For the original random data the requirement should be that roundtrip conversion results in errors < 0.5 of a quanta. If we pre-truncate the data (divide by quanta, convert to int, multiply by quanta), then we expect the residuals to be less than the data range times machine precision.

src/flacarray/tests/array.py

…d to float_to_int conversion. Fix quantization checks.

tskisner · 2025-07-04T17:13:41Z

Thanks. Because I'm so good at breaking things, I found this:

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=1.)
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array(1.))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1.]))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1., 1., 1.]))
(array([-9223372036854775808, -9223372036854775808, -9223372036854775808]), array([-1.01]), array([9.13205152e+18]))

I know that last line is incorrect usage, but it seems like a case that should raise an exception instead of silently producing bizarre output.

The latest commit checks the shape of the precision and quanta arrays, which should match the leading shape of the data. So your last 2 examples will raise an error now.

mhasself

Great!

tskisner force-pushed the multi_channel branch 2 times, most recently from c43624d to 1895907 Compare May 27, 2025 15:08

tskisner added 2 commits May 31, 2025 14:53

More cleanups to single-stream case and tests.

43496d9

tskisner force-pushed the multi_channel branch from b48acc2 to 43496d9 Compare June 2, 2025 17:54

tskisner marked this pull request as ready for review June 2, 2025 18:02

tskisner requested a review from mhasself June 2, 2025 18:02

mhasself reviewed Jun 2, 2025

View reviewed changes

tskisner added 2 commits June 3, 2025 11:07

Change compression API for floating point data:

3d17b43

- Require user-specifed quanta or precision - Update docstrings describing the requirement - Update documentation and example notebooks - Updata benchmark script and test cases.

tskisner added 2 commits July 1, 2025 12:00

Fix rounding of offsets during float to int conversion

115f607

Add unit test to check that float to int roundtrip always gives error…

d98b335

…s less than 0.5 quanta.

mhasself requested changes Jul 2, 2025

View reviewed changes

Enforce better checks on the shape of quanta / precision arrays passe…

3687fad

…d to float_to_int conversion. Fix quantization checks.

mhasself approved these changes Jul 7, 2025

View reviewed changes

tskisner merged commit 57a1323 into main Jul 7, 2025
7 checks passed

tskisner deleted the multi_channel branch July 7, 2025 16:23

tskisner mentioned this pull request Aug 12, 2025

Add more general int64 conversion #4

Closed

Add support for FLAC compression of 64bit integers. #9

Add support for FLAC compression of 64bit integers. #9

Uh oh!

Conversation

tskisner commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tskisner commented Mar 30, 2025

Uh oh!

tskisner commented May 27, 2025

Uh oh!

review-notebook-app bot commented May 29, 2025

Uh oh!

mhasself left a comment

Choose a reason for hiding this comment

Uh oh!

tskisner commented Jun 2, 2025

Uh oh!

mhasself commented Jun 6, 2025

Uh oh!

tskisner commented Jul 1, 2025

Uh oh!

mhasself left a comment

Choose a reason for hiding this comment

Uh oh!

mhasself Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

tskisner Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tskisner commented Jul 4, 2025

Uh oh!

mhasself left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tskisner commented Mar 30, 2025 •

edited

Loading