You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A Constant initializer of type UINT2 is mis-loaded by the CPU execution provider in onnxruntime 1.27.0. The packed raw bytes are not unpacked into the runtime tensor; instead the packed bytes are copied verbatim into a buffer sized for unpacked storage (1 byte per element), leaving the trailing bytes as zero. INT2 with the same structure works correctly, which makes this a clear asymmetry rather than an intentional design choice.
This looks like another instance of the sub-byte byte-count formula bug class that #28171 (merged, in 1.27.0) fixed in CopyCpuTensor, and that #29157 is currently fixing in OrtApi::GetValue. We believe a remaining call site uses element_type->Size() * count instead of Tensor::SizeInBytes() for sub-byte initializers (the sizeof(UInt2x4) == 1 static_assert makes the wrong formula yield count bytes instead of ceil(count/4)).
For uint2, GetTensorData<uint8_t>() returns 5 bytes — i.e. the runtime tensor is sized for unpacked storage (1 byte per element). The two packed bytes from the model (0x63 0x03) are memcpy'd into that buffer, leaving the trailing 3 bytes as zero. Decoded as 2-bit values one per byte (byte & 0x03), this gives [3, 3, 0, 0, 0] instead of the expected [3, 0, 2, 1, 3].
For int2, GetTensorData<uint8_t>() correctly returns the packed size (1 byte for 4 elements), and the value unpacks correctly to [-2, -1, 0, 1].
Expected
uint2 should be loaded with the same packed semantics as int2. The runtime tensor should be the packed size (ceil(count/4) bytes), so the model's packed initializer is consumed verbatim and the high-level unpacking machinery sees the right values. With the repro above, the runtime tensor for uint2 should be 2 bytes 63 03, decoding to [3, 0, 2, 1, 3].
Why we think it's the same bug class as #28171 / #29157
Multiple code paths in ORT historically computed sub-byte tensor byte size as element_type->Size() * count. For 5 uint2 elements this gives 1 * 5 = 5 bytes — exactly the unpacked size we observe.
The INT2 vs UINT2 asymmetry, given the registrations in data_types.cc are perfectly symmetric (ORT_REGISTER_PRIM_SUBBYTE_TYPE(Int2x4, 4) / ORT_REGISTER_PRIM_SUBBYTE_TYPE(UInt2x4, 4)), suggests the regression is in one or more dispatch sites whose INT2/UINT2 arms were added at different times and only one took the packing-aware path. UINT4 / INT4 initializers behave correctly with the same model construction, so the regression is specific to UINT2 (and possibly only via the initializer/output path).
A grep of the codebase for sub-byte byte-size formulas that don't go through Tensor::SizeInBytes() — particularly in the TensorProto → Tensor materialization for initializers (e.g. tensorprotoutils.cc's TensorProtoToTensor / GetSizeInBytesFromTensorProto) — is likely to find the remaining site. The fact that INT2 works suggests the dispatch table has a packed path; the UINT2 arm just isn't taking it.
To reproduce
Build a minimal model with a single Constant output of type UINT2, packed per the ONNX spec (4 elements per byte, low field first):
Read the runtime tensor's raw bytes back via the C API (Python's onnxruntime binding has no numpy mapping for UInt2x4, so verification has to go through C++):
// g++ -std=c++17 -I/path/to/onnxruntime/include repro.cpp -L/path/to/onnxruntime/lib -lonnxruntime
#include<onnxruntime_cxx_api.h>
#include<cstdio>intmain() {
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "repro");
Ort::SessionOptions so;
constchar* names_u[] = {"uint2"};
constchar* names_i[] = {"int2"};
Ort::Session s_u(env, "uint2.onnx", so);
Ort::Session s_i(env, "int2.onnx", so);
auto v_u = s_u.Run(Ort::RunOptions{}, nullptr, nullptr, 0, names_u, 1);
auto v_i = s_i.Run(Ort::RunOptions{}, nullptr, nullptr, 0, names_i, 1);
auto info_u = v_u[0].GetTensorTypeAndShapeInfo();
auto info_i = v_i[0].GetTensorTypeAndShapeInfo();
auto* du = v_u[0].GetTensorData<uint8_t>();
auto* di = v_i[0].GetTensorData<uint8_t>();
printf("uint2 element_count=%zu bytes=", info_u.GetElementCount());
for (size_t i = 0; i < info_u.GetElementCount(); i++) printf("%02X ", du[i]);
printf("\nint2 element_count=%zu bytes=", info_i.GetElementCount());
for (size_t i = 0; i < info_i.GetElementCount(); i++) printf("%02X ", di[i]);
printf("\n");
}
Describe the issue
A
Constantinitializer of typeUINT2is mis-loaded by the CPU execution provider in onnxruntime 1.27.0. The packed raw bytes are not unpacked into the runtime tensor; instead the packed bytes are copied verbatim into a buffer sized for unpacked storage (1 byte per element), leaving the trailing bytes as zero.INT2with the same structure works correctly, which makes this a clear asymmetry rather than an intentional design choice.This looks like another instance of the sub-byte byte-count formula bug class that #28171 (merged, in 1.27.0) fixed in
CopyCpuTensor, and that #29157 is currently fixing inOrtApi::GetValue. We believe a remaining call site useselement_type->Size() * countinstead ofTensor::SizeInBytes()for sub-byte initializers (thesizeof(UInt2x4) == 1static_assert makes the wrong formula yieldcountbytes instead ofceil(count/4)).Observed (onnxruntime 1.27.0, Linux x64 CPU EP)
For
uint2,GetTensorData<uint8_t>()returns 5 bytes — i.e. the runtime tensor is sized for unpacked storage (1 byte per element). The two packed bytes from the model (0x63 0x03) are memcpy'd into that buffer, leaving the trailing 3 bytes as zero. Decoded as 2-bit values one per byte (byte & 0x03), this gives[3, 3, 0, 0, 0]instead of the expected[3, 0, 2, 1, 3].For
int2,GetTensorData<uint8_t>()correctly returns the packed size (1 byte for 4 elements), and the value unpacks correctly to[-2, -1, 0, 1].Expected
uint2should be loaded with the same packed semantics asint2. The runtime tensor should be the packed size (ceil(count/4)bytes), so the model's packed initializer is consumed verbatim and the high-level unpacking machinery sees the right values. With the repro above, the runtime tensor foruint2should be 2 bytes63 03, decoding to[3, 0, 2, 1, 3].Why we think it's the same bug class as #28171 / #29157
sizeof(UInt2x4) == 1(perinclude/onnxruntime/core/framework/int2.h).element_type->Size() * count. For 5 uint2 elements this gives1 * 5 = 5bytes — exactly the unpacked size we observe.Tensor::SizeInBytes()inCopyCpuTensor, and Fix over-copy of packed sub-byte tensors in OrtApi::GetValue #29157 is doing the same inOrtApi::GetValue. The initializer-load path appears to be another, still-unfixed instance.data_types.ccare perfectly symmetric (ORT_REGISTER_PRIM_SUBBYTE_TYPE(Int2x4, 4)/ORT_REGISTER_PRIM_SUBBYTE_TYPE(UInt2x4, 4)), suggests the regression is in one or more dispatch sites whoseINT2/UINT2arms were added at different times and only one took the packing-aware path.UINT4/INT4initializers behave correctly with the same model construction, so the regression is specific toUINT2(and possibly only via the initializer/output path).Related
Suggested fix area
A grep of the codebase for sub-byte byte-size formulas that don't go through
Tensor::SizeInBytes()— particularly in the TensorProto → Tensor materialization for initializers (e.g.tensorprotoutils.cc'sTensorProtoToTensor/GetSizeInBytesFromTensorProto) — is likely to find the remaining site. The fact that INT2 works suggests the dispatch table has a packed path; the UINT2 arm just isn't taking it.To reproduce
Build a minimal model with a single
Constantoutput of typeUINT2, packed per the ONNX spec (4 elements per byte, low field first):Read the runtime tensor's raw bytes back via the C API (Python's onnxruntime binding has no numpy mapping for
UInt2x4, so verification has to go through C++):Urgency
No response
Platform
Linux
OS Version
Ubuntu 24.04 (WSL2)
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.27.0
ONNX Runtime API
C++
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
CPUExecutionProvider, onnx: 1.22.0