Skip to content

Conversation

@tskisner
Copy link
Member

@tskisner tskisner commented Mar 30, 2025

  • Add native support in the compiled extension for compressing 64bit
    integers with FLAC. The high and low 32bits are compressed in
    separate channels. On compression (for little-endian systems), the
    input 64bit integers are simply interpreted as interleaved stereo
    channels. On decompression, the separate channels in each frame
    are extracted to the high / low 32bit portions of the 64bit integers.
    Expose C functions in public header.

  • Add 32bit / 64bit wrappers to the bindings, and support for 64bit
    integers to the mid-level encode / decode functions.

  • Changes to the treatment of floating point data. float32 data is
    converted to int32 as before, but float64 data is now converted to
    int64 to gain additional precision.

  • Changes to int64 encoding- previously this was restricted to either
    32bit range around an offset or forced conversion as floating point
    data. Now this data is supported natively. Remove the int32/int64
    conversion routines which are no longer needed.

  • Change the HDF5 and Zarr on-disk format (bumping version to "1").
    This change was necessary to support storing int64 data without
    an associated offset.

  • Modify and expand unit tests to cover new int64 cases. For MPI tests,
    ensure random data is created on one process and then distributed.

  • Cleanup benchmarking script.

  • Changes to MPI I/O helper functions: refactor the extract, send
    and receive steps into subroutines for more clarity. Use Send
    instead of Isend for better simplicity and more robustness across
    MPI implementations.

@tskisner
Copy link
Member Author

This branch will be rebased after #8 is merged. I also want to expand the docs to describe the specific on-disk format. I also need to test the big-endian high / low 32bit handling in a multiarch qemu docker container.

@tskisner tskisner force-pushed the multi_channel branch 2 times, most recently from c43624d to 1895907 Compare May 27, 2025 15:08
@tskisner
Copy link
Member Author

Although this work is largely done, I need to update the documentation notebooks and I still need to test the case of big-endian portability for 64bit (high / low channel) data.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

tskisner added 2 commits May 31, 2025 14:53
- Add native support in the compiled extension for compressing 64bit
  integers with FLAC.  The high and low 32bits are compressed in
  separate channels.  On compression (for little-endian systems), the
  input 64bit integers are simply interpreted as interleaved stereo
  channels.  On decompression, the separate channels in each frame
  are extracted to the high / low 32bit portions of the 64bit integers.
  Expose C functions in public header.

- Add 32bit / 64bit wrappers to the bindings, and support for 64bit
  integers to the mid-level encode / decode functions.

- Changes to the treatment of floating point data.  float32 data is
  converted to int32 as before, but float64 data is now converted to
  int64 to gain additional precision.

- Changes to int64 encoding- previously this was restricted to either
  32bit range around an offset or forced conversion as floating point
  data.  Now this data is supported natively.  Remove the int32/int64
  conversion routines which are no longer needed.

- Change the HDF5 and Zarr on-disk format (bumping version to "1").
  This change was necessary to support storing int64 data without
  an associated offset.

- Modify and expand unit tests to cover new int64 cases.  For MPI tests,
  ensure random data is created on one process and then distributed.

- Cleanup benchmarking script.

- Changes to MPI I/O helper functions:  refactor the extract, send
  and receive steps into subroutines for more clarity.  Use Send
  instead of Isend for better simplicity and more robustness across
  MPI implementations.

- Refine API to be more explicit about when single-stream arrays
  are flattened on decompression / reading.
@tskisner tskisner marked this pull request as ready for review June 2, 2025 18:02
@tskisner tskisner requested a review from mhasself June 2, 2025 18:02
Copy link
Collaborator

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not able to review this fully. In a quick test, the only odd thing to me was in relation to the default quantization for float types.

None of the docstrings I could find would explain what is done when quanta=None and precision=None. What seems to happen in practice is that there is effectively no compression. For example, compressing a float64 stream that contains only ~10 distinct values (corresponding to [-5, +5]) still takes an average of 6 bytes per sample.

It is thus very likely that someone who doesn't specify quanta/precision has (a) forgotten to do so or (b) is accidentally compressing a float array instead of an int array.

So:

  • It should perhaps be an error to compress floats without specifying one of quanta or precision.
  • Whatever happens, the docstring should explain the default behavior.

@tskisner
Copy link
Member Author

tskisner commented Jun 2, 2025

Thanks for the quick review!

I mention the behavior of quanta=None briefly in the docstring, but I will work to improve that.

There is also a new section in the cookbook section discussion quantization.

Your assessment is correct that quanta=None defaults to essentially no compression (it errs on the conservative side of splitting the dynamic range by machine precision). We could certainly force the user to specify either quantization or precision.

tskisner added 2 commits June 3, 2025 11:07
- Require user-specifed quanta or precision
- Update docstrings describing the requirement
- Update documentation and example notebooks
- Updata benchmark script and test cases.
- When computing floating point offset, round to nearest quanta before
  subtracting to reduce errors
- Expand docstrings discussing floating point conversion and ensure
  that all low-level and direct write I/O functions reference the
  FlacArray class docstring where this discussion is centralized.
- Expand unit tests to include special int32 / int64 values
  (+/- 2^31-1, +/- 2^32, +/- 2^63-1)
@mhasself
Copy link
Collaborator

mhasself commented Jun 6, 2025

Ok, so tests pass for me, but I get significant errors -- especially with negative numbers.

import flacarray as fa
import numpy as np
import os

import h5py

q = 1e-3
os.remove('test.h5')

for offset in [0, -.5, .5, -10, +10, -10.51, -10.4]:
    with h5py.File('test.h5', 'w') as h:
        td = (np.random.normal(size=100000) + offset) // 1 * q
        fa.hdf5.write_array(td, h, quanta=q)

    with h5py.File('test.h5', 'r') as h:
        x = fa.hdf5.read_array(h)
        max_err = abs(x-td).max() / q
        bad = abs(x - td) / q > .1
        print(offset, max_err, bad.sum() / len(x))

Output:

0 1.0 0.49757
-0.5 1.0 0.69379
0.5 1.0 0.30946
-10 1.0000000000000009 0.84343
10 1.0000000000000009 0.49778
-10.51 1.0000000000000009 0.69847
-10.4 1.0000000000000009 0.91955

Basically that says there's always an error of size 1 quantum. The error rate can be as high as 92% (bottom most case).

My guess would be this has to do with using int casting to round... C int casting rounds towards zero, which is rarely what you want.

@tskisner
Copy link
Member Author

tskisner commented Jul 1, 2025

Ok @mhasself , your small test code is now incorporated into a unit test (which passes). You were correct that this was a rounding issue. The residuals (in the case where the quanta is larger than the minimum supported by the data type) are now <= 0.5 of the quanta value.

Copy link
Collaborator

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Because I'm so good at breaking things, I found this:

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=1.)
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array(1.))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1.]))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1., 1., 1.]))
(array([-9223372036854775808, -9223372036854775808, -9223372036854775808]), array([-1.01]), array([9.13205152e+18]))

I know that last line is incorrect usage, but it seems like a case that should raise an exception instead of silently producing bizarre output.

residual = np.absolute(output - quantized)
max_resid = np.amax(residual)
max_q_err = max_resid / quant
if max_q_err > 0.5:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for these tests we actually want a more stringent requirement: by construction, the quantization error should be almost zero. This is to meet the expectation that if I have numbers with, e.g., 3 digit precision (1.002, 3.125, -1.545) then I should be able to encode them losslessly by specifying quantum=0.001. This has applications in data acquisition, where readings really can be quantized at this level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this code into 2 tests. For the original random data the requirement should be that roundtrip conversion results in errors < 0.5 of a quanta. If we pre-truncate the data (divide by quanta, convert to int, multiply by quanta), then we expect the residuals to be less than the data range times machine precision.

…d to float_to_int conversion. Fix quantization checks.
@tskisner
Copy link
Member Author

tskisner commented Jul 4, 2025

Thanks. Because I'm so good at breaking things, I found this:

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=1.)
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array(1.))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1.]))
(array([-1,  0,  1]), array([2.]), array([1.]))

>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1., 1., 1.]))
(array([-9223372036854775808, -9223372036854775808, -9223372036854775808]), array([-1.01]), array([9.13205152e+18]))

I know that last line is incorrect usage, but it seems like a case that should raise an exception instead of silently producing bizarre output.

The latest commit checks the shape of the precision and quanta arrays, which should match the leading shape of the data. So your last 2 examples will raise an error now.

Copy link
Collaborator

@mhasself mhasself left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@tskisner tskisner merged commit 57a1323 into main Jul 7, 2025
7 checks passed
@tskisner tskisner deleted the multi_channel branch July 7, 2025 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants