-
Notifications
You must be signed in to change notification settings - Fork 0
Add support for FLAC compression of 64bit integers. #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This branch will be rebased after #8 is merged. I also want to expand the docs to describe the specific on-disk format. I also need to test the big-endian high / low 32bit handling in a multiarch qemu docker container. |
c43624d to
1895907
Compare
|
Although this work is largely done, I need to update the documentation notebooks and I still need to test the case of big-endian portability for 64bit (high / low channel) data. |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
- Add native support in the compiled extension for compressing 64bit integers with FLAC. The high and low 32bits are compressed in separate channels. On compression (for little-endian systems), the input 64bit integers are simply interpreted as interleaved stereo channels. On decompression, the separate channels in each frame are extracted to the high / low 32bit portions of the 64bit integers. Expose C functions in public header. - Add 32bit / 64bit wrappers to the bindings, and support for 64bit integers to the mid-level encode / decode functions. - Changes to the treatment of floating point data. float32 data is converted to int32 as before, but float64 data is now converted to int64 to gain additional precision. - Changes to int64 encoding- previously this was restricted to either 32bit range around an offset or forced conversion as floating point data. Now this data is supported natively. Remove the int32/int64 conversion routines which are no longer needed. - Change the HDF5 and Zarr on-disk format (bumping version to "1"). This change was necessary to support storing int64 data without an associated offset. - Modify and expand unit tests to cover new int64 cases. For MPI tests, ensure random data is created on one process and then distributed. - Cleanup benchmarking script. - Changes to MPI I/O helper functions: refactor the extract, send and receive steps into subroutines for more clarity. Use Send instead of Isend for better simplicity and more robustness across MPI implementations. - Refine API to be more explicit about when single-stream arrays are flattened on decompression / reading.
mhasself
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not able to review this fully. In a quick test, the only odd thing to me was in relation to the default quantization for float types.
None of the docstrings I could find would explain what is done when quanta=None and precision=None. What seems to happen in practice is that there is effectively no compression. For example, compressing a float64 stream that contains only ~10 distinct values (corresponding to [-5, +5]) still takes an average of 6 bytes per sample.
It is thus very likely that someone who doesn't specify quanta/precision has (a) forgotten to do so or (b) is accidentally compressing a float array instead of an int array.
So:
- It should perhaps be an error to compress floats without specifying one of quanta or precision.
- Whatever happens, the docstring should explain the default behavior.
|
Thanks for the quick review! I mention the behavior of There is also a new section in the Your assessment is correct that |
- Require user-specifed quanta or precision - Update docstrings describing the requirement - Update documentation and example notebooks - Updata benchmark script and test cases.
- When computing floating point offset, round to nearest quanta before subtracting to reduce errors - Expand docstrings discussing floating point conversion and ensure that all low-level and direct write I/O functions reference the FlacArray class docstring where this discussion is centralized. - Expand unit tests to include special int32 / int64 values (+/- 2^31-1, +/- 2^32, +/- 2^63-1)
|
Ok, so tests pass for me, but I get significant errors -- especially with negative numbers. Output: Basically that says there's always an error of size 1 quantum. The error rate can be as high as 92% (bottom most case). My guess would be this has to do with using int casting to round... C int casting rounds towards zero, which is rarely what you want. |
|
Ok @mhasself , your small test code is now incorporated into a unit test (which passes). You were correct that this was a rounding issue. The residuals (in the case where the quanta is larger than the minimum supported by the data type) are now <= 0.5 of the quanta value. |
mhasself
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Because I'm so good at breaking things, I found this:
>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=1.)
(array([-1, 0, 1]), array([2.]), array([1.]))
>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array(1.))
(array([-1, 0, 1]), array([2.]), array([1.]))
>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1.]))
(array([-1, 0, 1]), array([2.]), array([1.]))
>>> fa.utils.float_to_int(np.array([1.,2,3]), quanta=np.array([1., 1., 1.]))
(array([-9223372036854775808, -9223372036854775808, -9223372036854775808]), array([-1.01]), array([9.13205152e+18]))
I know that last line is incorrect usage, but it seems like a case that should raise an exception instead of silently producing bizarre output.
src/flacarray/tests/array.py
Outdated
| residual = np.absolute(output - quantized) | ||
| max_resid = np.amax(residual) | ||
| max_q_err = max_resid / quant | ||
| if max_q_err > 0.5: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for these tests we actually want a more stringent requirement: by construction, the quantization error should be almost zero. This is to meet the expectation that if I have numbers with, e.g., 3 digit precision (1.002, 3.125, -1.545) then I should be able to encode them losslessly by specifying quantum=0.001. This has applications in data acquisition, where readings really can be quantized at this level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split this code into 2 tests. For the original random data the requirement should be that roundtrip conversion results in errors < 0.5 of a quanta. If we pre-truncate the data (divide by quanta, convert to int, multiply by quanta), then we expect the residuals to be less than the data range times machine precision.
…d to float_to_int conversion. Fix quantization checks.
The latest commit checks the shape of the precision and quanta arrays, which should match the leading shape of the data. So your last 2 examples will raise an error now. |
mhasself
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
Add native support in the compiled extension for compressing 64bit
integers with FLAC. The high and low 32bits are compressed in
separate channels. On compression (for little-endian systems), the
input 64bit integers are simply interpreted as interleaved stereo
channels. On decompression, the separate channels in each frame
are extracted to the high / low 32bit portions of the 64bit integers.
Expose C functions in public header.
Add 32bit / 64bit wrappers to the bindings, and support for 64bit
integers to the mid-level encode / decode functions.
Changes to the treatment of floating point data. float32 data is
converted to int32 as before, but float64 data is now converted to
int64 to gain additional precision.
Changes to int64 encoding- previously this was restricted to either
32bit range around an offset or forced conversion as floating point
data. Now this data is supported natively. Remove the int32/int64
conversion routines which are no longer needed.
Change the HDF5 and Zarr on-disk format (bumping version to "1").
This change was necessary to support storing int64 data without
an associated offset.
Modify and expand unit tests to cover new int64 cases. For MPI tests,
ensure random data is created on one process and then distributed.
Cleanup benchmarking script.
Changes to MPI I/O helper functions: refactor the extract, send
and receive steps into subroutines for more clarity. Use Send
instead of Isend for better simplicity and more robustness across
MPI implementations.