Skip to content

CPU EP mis-loads packed UINT2 Constant initializer (treats UInt2x4 as unpacked storage; INT2 unaffected) #29172

Description

@kibae

Describe the issue

A Constant initializer of type UINT2 is mis-loaded by the CPU execution provider in onnxruntime 1.27.0. The packed raw bytes are not unpacked into the runtime tensor; instead the packed bytes are copied verbatim into a buffer sized for unpacked storage (1 byte per element), leaving the trailing bytes as zero.
INT2 with the same structure works correctly, which makes this a clear asymmetry rather than an intentional design choice.
This looks like another instance of the sub-byte byte-count formula bug class that #28171 (merged, in 1.27.0) fixed in CopyCpuTensor, and that #29157 is currently fixing in OrtApi::GetValue. We believe a remaining call site uses element_type->Size() * count instead of Tensor::SizeInBytes() for sub-byte initializers (the sizeof(UInt2x4) == 1 static_assert makes the wrong formula yield count bytes instead of ceil(count/4)).

Observed (onnxruntime 1.27.0, Linux x64 CPU EP)

uint2 element_count=5  bytes=63 03 00 00 00
int2  element_count=1  bytes=4E

For uint2, GetTensorData<uint8_t>() returns 5 bytes — i.e. the runtime tensor is sized for unpacked storage (1 byte per element). The two packed bytes from the model (0x63 0x03) are memcpy'd into that buffer, leaving the trailing 3 bytes as zero. Decoded as 2-bit values one per byte (byte & 0x03), this gives [3, 3, 0, 0, 0] instead of the expected [3, 0, 2, 1, 3].
For int2, GetTensorData<uint8_t>() correctly returns the packed size (1 byte for 4 elements), and the value unpacks correctly to [-2, -1, 0, 1].

Expected

uint2 should be loaded with the same packed semantics as int2. The runtime tensor should be the packed size (ceil(count/4) bytes), so the model's packed initializer is consumed verbatim and the high-level unpacking machinery sees the right values. With the repro above, the runtime tensor for uint2 should be 2 bytes 63 03, decoding to [3, 0, 2, 1, 3].

Why we think it's the same bug class as #28171 / #29157

  • sizeof(UInt2x4) == 1 (per include/onnxruntime/core/framework/int2.h).
  • Multiple code paths in ORT historically computed sub-byte tensor byte size as element_type->Size() * count. For 5 uint2 elements this gives 1 * 5 = 5 bytes — exactly the unpacked size we observe.
  • Fix overflow in CopyCpuTensor for sub-byte types #28171 already replaced this anti-pattern with Tensor::SizeInBytes() in CopyCpuTensor, and Fix over-copy of packed sub-byte tensors in OrtApi::GetValue #29157 is doing the same in OrtApi::GetValue. The initializer-load path appears to be another, still-unfixed instance.
  • The INT2 vs UINT2 asymmetry, given the registrations in data_types.cc are perfectly symmetric (ORT_REGISTER_PRIM_SUBBYTE_TYPE(Int2x4, 4) / ORT_REGISTER_PRIM_SUBBYTE_TYPE(UInt2x4, 4)), suggests the regression is in one or more dispatch sites whose INT2/UINT2 arms were added at different times and only one took the packing-aware path.
    UINT4 / INT4 initializers behave correctly with the same model construction, so the regression is specific to UINT2 (and possibly only via the initializer/output path).

Related

Suggested fix area

A grep of the codebase for sub-byte byte-size formulas that don't go through Tensor::SizeInBytes() — particularly in the TensorProto → Tensor materialization for initializers (e.g. tensorprotoutils.cc's TensorProtoToTensor / GetSizeInBytesFromTensorProto) — is likely to find the remaining site. The fact that INT2 works suggests the dispatch table has a packed path; the UINT2 arm just isn't taking it.

To reproduce

Build a minimal model with a single Constant output of type UINT2, packed per the ONNX spec (4 elements per byte, low field first):

# pip install onnx onnxruntime==1.27.0
import onnx
import onnxruntime as ort
from onnx import TensorProto, helper
def build(name: str, tp: int, raw: bytes, dims: list[int]) -> bytes:
    t = helper.make_tensor(name, tp, dims, vals=raw, raw=True)
    g = helper.make_graph(
        [helper.make_node("Constant", [], [name], value=t)],
        "g", [], [helper.make_tensor_value_info(name, tp, dims)],
    )
    m = helper.make_model(g, opset_imports=[helper.make_opsetid("", 23)])
    m.ir_version = 10
    onnx.checker.check_model(m)
    return m.SerializeToString()
# 5 uint2 values [3, 0, 2, 1, 3] packed as 4-per-byte (low field first):
#   byte 0 = 3 | (0<<2) | (2<<4) | (1<<6) = 0x63
#   byte 1 = 3                            = 0x03
with open("uint2.onnx", "wb") as f:
    f.write(build("uint2", TensorProto.UINT2, bytes([0x63, 0x03]), [5]))
# 4 int2 values [-2, -1, 0, 1] (two's-complement bits 2, 3, 0, 1) packed:
#   byte 0 = 2 | (3<<2) | (0<<4) | (1<<6) = 0x4E
with open("int2.onnx", "wb") as f:
    f.write(build("int2", TensorProto.INT2, bytes([0x4E]), [4]))
# Both pass onnx.checker.check_model and create a session successfully.
ort.InferenceSession("uint2.onnx", providers=["CPUExecutionProvider"])
ort.InferenceSession("int2.onnx",  providers=["CPUExecutionProvider"])

Read the runtime tensor's raw bytes back via the C API (Python's onnxruntime binding has no numpy mapping for UInt2x4, so verification has to go through C++):

// g++ -std=c++17 -I/path/to/onnxruntime/include repro.cpp -L/path/to/onnxruntime/lib -lonnxruntime
#include <onnxruntime_cxx_api.h>
#include <cstdio>
int main() {
    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "repro");
    Ort::SessionOptions so;
    const char* names_u[] = {"uint2"};
    const char* names_i[] = {"int2"};
    Ort::Session s_u(env, "uint2.onnx", so);
    Ort::Session s_i(env, "int2.onnx",  so);
    auto v_u = s_u.Run(Ort::RunOptions{}, nullptr, nullptr, 0, names_u, 1);
    auto v_i = s_i.Run(Ort::RunOptions{}, nullptr, nullptr, 0, names_i, 1);
    auto info_u = v_u[0].GetTensorTypeAndShapeInfo();
    auto info_i = v_i[0].GetTensorTypeAndShapeInfo();
    auto* du = v_u[0].GetTensorData<uint8_t>();
    auto* di = v_i[0].GetTensorData<uint8_t>();
    printf("uint2 element_count=%zu  bytes=", info_u.GetElementCount());
    for (size_t i = 0; i < info_u.GetElementCount(); i++) printf("%02X ", du[i]);
    printf("\nint2  element_count=%zu  bytes=", info_i.GetElementCount());
    for (size_t i = 0; i < info_i.GetElementCount(); i++) printf("%02X ", di[i]);
    printf("\n");
}

Urgency

No response

Platform

Linux

OS Version

Ubuntu 24.04 (WSL2)

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.27.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

CPUExecutionProvider, onnx: 1.22.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions