CPU EP mis-loads packed `UINT2` Constant initializer (treats `UInt2x4` as unpacked storage; `INT2` unaffected)

### Describe the issue
A `Constant` initializer of type `UINT2` is mis-loaded by the CPU execution provider in onnxruntime 1.27.0. The packed raw bytes are not unpacked into the runtime tensor; instead the packed bytes are copied verbatim into a buffer sized for **unpacked** storage (1 byte per element), leaving the trailing bytes as zero.
`INT2` with the same structure works correctly, which makes this a clear asymmetry rather than an intentional design choice.
This looks like another instance of the sub-byte byte-count formula bug class that #28171 (merged, in 1.27.0) fixed in `CopyCpuTensor`, and that #29157 is currently fixing in `OrtApi::GetValue`. We believe a remaining call site uses `element_type->Size() * count` instead of `Tensor::SizeInBytes()` for sub-byte initializers (the `sizeof(UInt2x4) == 1` static_assert makes the wrong formula yield `count` bytes instead of `ceil(count/4)`).

### Observed (onnxruntime 1.27.0, Linux x64 CPU EP)
```
uint2 element_count=5  bytes=63 03 00 00 00
int2  element_count=1  bytes=4E
```
For `uint2`, `GetTensorData<uint8_t>()` returns 5 bytes — i.e. the runtime tensor is sized for **unpacked** storage (1 byte per element). The two packed bytes from the model (`0x63 0x03`) are memcpy'd into that buffer, leaving the trailing 3 bytes as zero. Decoded as 2-bit values one per byte (`byte & 0x03`), this gives `[3, 3, 0, 0, 0]` instead of the expected `[3, 0, 2, 1, 3]`.
For `int2`, `GetTensorData<uint8_t>()` correctly returns the packed size (1 byte for 4 elements), and the value unpacks correctly to `[-2, -1, 0, 1]`.
### Expected
`uint2` should be loaded with the same packed semantics as `int2`. The runtime tensor should be the packed size (`ceil(count/4)` bytes), so the model's packed initializer is consumed verbatim and the high-level unpacking machinery sees the right values. With the repro above, the runtime tensor for `uint2` should be 2 bytes `63 03`, decoding to `[3, 0, 2, 1, 3]`.
### Why we think it's the same bug class as #28171 / #29157
- `sizeof(UInt2x4) == 1` (per `include/onnxruntime/core/framework/int2.h`).
- Multiple code paths in ORT historically computed sub-byte tensor byte size as `element_type->Size() * count`. For 5 uint2 elements this gives `1 * 5 = 5` bytes — exactly the unpacked size we observe.
- #28171 already replaced this anti-pattern with `Tensor::SizeInBytes()` in `CopyCpuTensor`, and #29157 is doing the same in `OrtApi::GetValue`. The initializer-load path appears to be another, still-unfixed instance.
- The INT2 vs UINT2 asymmetry, given the registrations in `data_types.cc` are perfectly symmetric (`ORT_REGISTER_PRIM_SUBBYTE_TYPE(Int2x4, 4)` / `ORT_REGISTER_PRIM_SUBBYTE_TYPE(UInt2x4, 4)`), suggests the regression is in one or more dispatch sites whose `INT2`/`UINT2` arms were added at different times and only one took the packing-aware path.
`UINT4` / `INT4` initializers behave correctly with the same model construction, so the regression is specific to `UINT2` (and possibly only via the initializer/output path).

### Related
- #28171 — Fix overflow in CopyCpuTensor for sub-byte types (merged, in 1.27.0)
- #29157 — Fix over-copy of packed sub-byte tensors in OrtApi::GetValue (open)
- #26824 — Add type definitions, registration, utilities for INT2/UINT2 support
- #27022 — Add INT2 and UINT2 support for QDQ, transpose and cast ops
### Suggested fix area
A grep of the codebase for sub-byte byte-size formulas that don't go through `Tensor::SizeInBytes()` — particularly in the TensorProto → Tensor materialization for initializers (e.g. `tensorprotoutils.cc`'s `TensorProtoToTensor` / `GetSizeInBytesFromTensorProto`) — is likely to find the remaining site. The fact that INT2 works suggests the dispatch table has a packed path; the UINT2 arm just isn't taking it.


### To reproduce

Build a minimal model with a single `Constant` output of type `UINT2`, packed per the ONNX spec (4 elements per byte, low field first):
```python
# pip install onnx onnxruntime==1.27.0
import onnx
import onnxruntime as ort
from onnx import TensorProto, helper
def build(name: str, tp: int, raw: bytes, dims: list[int]) -> bytes:
    t = helper.make_tensor(name, tp, dims, vals=raw, raw=True)
    g = helper.make_graph(
        [helper.make_node("Constant", [], [name], value=t)],
        "g", [], [helper.make_tensor_value_info(name, tp, dims)],
    )
    m = helper.make_model(g, opset_imports=[helper.make_opsetid("", 23)])
    m.ir_version = 10
    onnx.checker.check_model(m)
    return m.SerializeToString()
# 5 uint2 values [3, 0, 2, 1, 3] packed as 4-per-byte (low field first):
#   byte 0 = 3 | (0<<2) | (2<<4) | (1<<6) = 0x63
#   byte 1 = 3                            = 0x03
with open("uint2.onnx", "wb") as f:
    f.write(build("uint2", TensorProto.UINT2, bytes([0x63, 0x03]), [5]))
# 4 int2 values [-2, -1, 0, 1] (two's-complement bits 2, 3, 0, 1) packed:
#   byte 0 = 2 | (3<<2) | (0<<4) | (1<<6) = 0x4E
with open("int2.onnx", "wb") as f:
    f.write(build("int2", TensorProto.INT2, bytes([0x4E]), [4]))
# Both pass onnx.checker.check_model and create a session successfully.
ort.InferenceSession("uint2.onnx", providers=["CPUExecutionProvider"])
ort.InferenceSession("int2.onnx",  providers=["CPUExecutionProvider"])
```
Read the runtime tensor's raw bytes back via the C API (Python's onnxruntime binding has no numpy mapping for `UInt2x4`, so verification has to go through C++):
```cpp
// g++ -std=c++17 -I/path/to/onnxruntime/include repro.cpp -L/path/to/onnxruntime/lib -lonnxruntime
#include <onnxruntime_cxx_api.h>
#include <cstdio>
int main() {
    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "repro");
    Ort::SessionOptions so;
    const char* names_u[] = {"uint2"};
    const char* names_i[] = {"int2"};
    Ort::Session s_u(env, "uint2.onnx", so);
    Ort::Session s_i(env, "int2.onnx",  so);
    auto v_u = s_u.Run(Ort::RunOptions{}, nullptr, nullptr, 0, names_u, 1);
    auto v_i = s_i.Run(Ort::RunOptions{}, nullptr, nullptr, 0, names_i, 1);
    auto info_u = v_u[0].GetTensorTypeAndShapeInfo();
    auto info_i = v_i[0].GetTensorTypeAndShapeInfo();
    auto* du = v_u[0].GetTensorData<uint8_t>();
    auto* di = v_i[0].GetTensorData<uint8_t>();
    printf("uint2 element_count=%zu  bytes=", info_u.GetElementCount());
    for (size_t i = 0; i < info_u.GetElementCount(); i++) printf("%02X ", du[i]);
    printf("\nint2  element_count=%zu  bytes=", info_i.GetElementCount());
    for (size_t i = 0; i < info_i.GetElementCount(); i++) printf("%02X ", di[i]);
    printf("\n");
}
```



### Urgency

_No response_

### Platform

Linux

### OS Version

Ubuntu 24.04 (WSL2)

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.27.0

### ONNX Runtime API

C++

### Architecture

X64

### Execution Provider

Default CPU

### Execution Provider Library Version

CPUExecutionProvider, onnx: 1.22.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU EP mis-loads packed `UINT2` Constant initializer (treats `UInt2x4` as unpacked storage; `INT2` unaffected) #29172

Describe the issue

Observed (onnxruntime 1.27.0, Linux x64 CPU EP)

Expected

Why we think it's the same bug class as #28171 / #29157

Related

Suggested fix area

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CPU EP mis-loads packed UINT2 Constant initializer (treats UInt2x4 as unpacked storage; INT2 unaffected) #29172

Description

Describe the issue

Observed (onnxruntime 1.27.0, Linux x64 CPU EP)

Expected

Why we think it's the same bug class as #28171 / #29157

Related

Suggested fix area

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

CPU EP mis-loads packed `UINT2` Constant initializer (treats `UInt2x4` as unpacked storage; `INT2` unaffected) #29172