fix: Prevent silent conversion of float array to int (#96)

jpbrodrick89 · dionhaefner · web-flow · commit 6111b7b23c68 · 2025-03-27T14:16:50.000Z
#### Relevant issue or PR N/A #### Description of changes Tesseract `Arrays` are intended to prevent conversion from float to int due to the following line in `runtime.array_encoding.coerce_shape_dtype`: ```python if not np.can_cast(arr.dtype, expected_dtype, casting=**"same_kind"**): ``` However, this does not work as intended due to casting happening at an earlier stage in `runtime.array_encoding.python_to_array` ```python arr = np.asarray(val, dtype=**expected_dtype**, order="C") ``` This PR removes the initial casting by changing the `dtype` kwarg to `None` (note that this is its default so we could remove entirely if preferred). Unfortunately, `asarray` does not offer a `casting` kwarg so we can't pass `"same_kind"` any earlier. #### Testing done - [x] Manually tested validation errors raised when trying to read json with float arrays into an int array: ``` ╭─ Error ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Invalid value for ApplyInputSchema: 4 validation errors for ApplyInputSchema │ │ inputs.a.`chain[EncodedArrayModel__any__int32__DIFFERENTIABLE,function-plain[functools.partial(<function decode_array at │ │ 0x10542a980>, expected_shape=(None,), expected_dtype='int32')()]]` │ │ Input should be a valid dictionary or instance of EncodedArrayModel__any__int32__DIFFERENTIABLE [type=model_type, input_value=[1.1, │ │ 2.5, 3.9], input_type=list] │ │ For further information visit https://errors.pydantic.dev/2.10/v/model_type │ │ inputs.a.`function-plain[functools.partial(<function python_to_array at 0x10542aca0>, expected_shape=(None,), │ │ expected_dtype='int32')()]` │ │ Value error, Dtype mismatch: float64 cannot be cast to int32 [type=value_error, input_value=[1.1, 2.5, 3.9], input_type=list] │ │ For further information visit https://errors.pydantic.dev/2.10/v/value_error │ │ inputs.b.`chain[EncodedArrayModel__any__int32__DIFFERENTIABLE,function-plain[functools.partial(<function decode_array at │ │ 0x10542a980>, expected_shape=(None,), expected_dtype='int32')()]]` │ │ Input should be a valid dictionary or instance of EncodedArrayModel__any__int32__DIFFERENTIABLE [type=model_type, input_value=[4.1, │ │ 5.5, 6.9], input_type=list] │ │ For further information visit https://errors.pydantic.dev/2.10/v/model_type │ │ inputs.b.`function-plain[functools.partial(<function python_to_array at 0x10542aca0>, expected_shape=(None,), │ │ expected_dtype='int32')()]` │ │ Value error, Dtype mismatch: float64 cannot be cast to int32 [type=value_error, input_value=[4.1, 5.5, 6.9], input_type=list] │ │ For further information visit https://errors.pydantic.dev/2.10/v/value_error │ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ``` - [x] Manually tested that validation errors are now raised when converting from float to int ```python >>> import numpy as np >>> from tesseract_core.runtime import Array, Int32 >>> from pydantic import TypeAdapter >>> TypeAdapter(Array[..., 'int32']).validate_python(2.5*np.ones((3,3))) Traceback (most recent call last): File "<python-input-3>", line 1, in <module> TypeAdapter(Array[..., 'int32']).validate_python(2.5*np.ones((3,3))) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^ File "/Users/jonathanbrodrick/.virtualenvs/ergodic/lib/python3.13/site-packages/pydantic/type_adapter.py", line 412, in validate_python return self.validator.validate_python( ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ object, ^^^^^^^ ...<3 lines>... allow_partial=experimental_allow_partial, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ pydantic_core._pydantic_core.ValidationError: 2 validation errors for json-or-python[json=chain[EncodedArrayModel__anyrank__int32__noflags,function-plain[functools.partial(<function decode_array at 0x105d6a340>, expected_shape=Ellipsis, expected_dtype='int32')()]],python=union[chain[EncodedArrayModel__anyrank__int32__noflags,function-plain[functools.partial(<function decode_array at 0x105d6a340>, expected_shape=Ellipsis, expected_dtype='int32')()]],function-plain[functools.partial(<function python_to_array at 0x105d6a660>, expected_shape=Ellipsis, expected_dtype='int32')()]]] `chain[EncodedArrayModel__anyrank__int32__noflags,function-plain[functools.partial(<function decode_array at 0x105d6a340>, expected_shape=Ellipsis, expected_dtype='int32')()]]` Input should be a valid dictionary or instance of EncodedArrayModel__anyrank__int32__noflags [type=model_type, input_value=array([[2.5, 2.5, 2.5], ... [2.5, 2.5, 2.5]]), input_type=ndarray] For further information visit https://errors.pydantic.dev/2.10/v/model_type `function-plain[functools.partial(<function python_to_array at 0x105d6a660>, expected_shape=Ellipsis, expected_dtype='int32')()]` Value error, Dtype mismatch: float64 cannot be cast to int32 [type=value_error, input_value=array([[2.5, 2.5, 2.5], ... [2.5, 2.5, 2.5]]), input_type=ndarray] For further information visit https://errors.pydantic.dev/2.10/v/value_error ``` - [x] CI passes #### License - [x] By submitting this pull request, I confirm that my contribution is made under the terms of the [Apache 2.0 license](https://pasteurlabs.github.io/tesseract/LICENSE). - [x] I sign the Developer Certificate of Origin below by adding my name and email address to the `Signed-off-by` line. <details> <summary><b>Developer Certificate of Origin</b></summary> ```text Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ``` </details> Signed-off-by: Jonathan Brodrick <jonathan.brodrick@simulation.science> --------- Co-authored-by: Dion Häfner <dion.haefner@simulation.science>
diff --git a/tesseract_core/runtime/array_encoding.py b/tesseract_core/runtime/array_encoding.py
@@ -356,11 +356,14 @@ def python_to_array(
     val: Any, expected_shape: ShapeType, expected_dtype: Optional[str]
 ) -> ArrayLike:
     """Convert a Python object to a NumPy array."""
-    try:
-        arr = np.asarray(val, dtype=expected_dtype, order="C")
-    except TypeError as exc:
-        raise ValueError(f"Could not convert {val} to NumPy array") from exc
-    return _coerce_shape_dtype(arr, expected_shape, expected_dtype)
+    val = np.asarray(val, order="C")
+    if not np.issubdtype(val.dtype, np.number) and not np.issubdtype(
+        val.dtype, np.bool_
+    ):
+        raise ValueError(
+            f"Could not convert object to numeric NumPy array (got dtype: {val.dtype})"
+        )
+    return _coerce_shape_dtype(val, expected_shape, expected_dtype)
 
 
 def decode_array(
@@ -381,7 +384,15 @@ def decode_array(
 
         # keep checking for "raw" for backwards compat
         elif val.data.encoding in {"json", "raw"}:
-            data = np.asarray(val.data.buffer, dtype=val.dtype).reshape(val.shape)
+            data = np.asarray(val.data.buffer).reshape(val.shape)
+            if np.issubdtype(data.dtype, np.floating) and np.issubdtype(
+                val.dtype, np.integer
+            ):
+                if np.any(data % 1):
+                    raise ValueError(
+                        f"Expected integer data, but got floating point data: {data}"
+                    )
+            data = data.astype(val.dtype, casting="unsafe")
 
         else:
             # Unreachable
diff --git a/tests/runtime_tests/test_core.py b/tests/runtime_tests/test_core.py
@@ -408,7 +408,7 @@ def test_ad_endpoint_bad_tangent(testmodule, endpoint_name, failure_mode):
             msg = "String should match pattern"
         elif failure_mode == "invalid":
             tangent_vector = {k: "ahoy" for k in ad_inp}
-            msg = "Got invalid dtype"
+            msg = "Could not convert object"
 
         inputs = {
             "inputs": test_input,
@@ -427,7 +427,7 @@ def test_ad_endpoint_bad_tangent(testmodule, endpoint_name, failure_mode):
             msg = "String should match pattern"
         elif failure_mode == "invalid":
             cotangent_vector = {k: "ahoy" for k in ad_out}
-            msg = "Got invalid dtype"
+            msg = "Could not convert object"
 
         inputs = {
             "inputs": test_input,
@@ -436,8 +436,6 @@ def test_ad_endpoint_bad_tangent(testmodule, endpoint_name, failure_mode):
             "cotangent_vector": cotangent_vector,
         }
 
-    with pytest.raises(ValidationError) as excinfo:
+    with pytest.raises(ValidationError, match=msg):
         inputs = EndpointSchema.model_validate(inputs)
         endpoint_func(inputs)
-
-    assert msg in str(excinfo.value)
diff --git a/tests/runtime_tests/test_schema_types.py b/tests/runtime_tests/test_schema_types.py
@@ -447,3 +447,75 @@ class MyLazyModel(BaseModel):
     assert len(allowed_types) == 2
     assert allowed_types[0]["type"] == "array"
     assert allowed_types[1]["type"] == "string"
+
+
+def test_dtype_casting():
+    json_payload_str = MyModel(
+        array_int=arr_int,
+        array_float=arr_float,
+        array_bool=arr_bool,
+        scalar_int=scalar_int,
+    ).model_dump_json()
+
+    # Base case: proper int data (should work fine)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"]["data"] = {
+        "buffer": arr_int.flatten().tolist(),
+        "encoding": "json",
+    }
+    res = MyModel.model_validate(json_payload)
+    assert np.array_equal(res.array_int, arr_int)
+
+    # Case 1: floats in JSON array w/o fractional parts (should work fine)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"]["data"] = {
+        "buffer": arr_int.astype(float).flatten().tolist(),
+        "encoding": "json",
+    }
+    res = MyModel.model_validate(json_payload)
+    assert np.array_equal(res.array_int, arr_int)
+
+    # Case 2: floats in JSON array w/ fractional parts (should raise)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"]["data"] = {
+        "buffer": (arr_int.astype(float) + 1e-6).flatten().tolist(),
+        "encoding": "json",
+    }
+    with pytest.raises(ValidationError, match="Expected integer data"):
+        MyModel.model_validate(json_payload)
+
+    # Case 3: pass NumPy array directly (should work fine)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"] = arr_int
+    res = MyModel.model_validate(json_payload)
+    assert np.array_equal(res.array_int, arr_int)
+
+    # Case 4: pass NumPy array with incompatible dtype (should raise)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"] = arr_int.astype(np.float32)
+    with pytest.raises(ValidationError, match="cannot be cast"):
+        MyModel.model_validate(json_payload)
+
+    # Case 5: pass JSON data directly (should work fine)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"] = arr_int.tolist()
+    res = MyModel.model_validate(json_payload)
+    assert np.array_equal(res.array_int, arr_int)
+
+    # Case 6: pass JSON data with incompatible dtype (should raise)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"] = arr_int.astype(np.float32).tolist()
+    with pytest.raises(ValidationError, match="cannot be cast"):
+        MyModel.model_validate(json_payload)
+
+    # Case 7: Pass non-numeric data (should raise)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"] = ["a", "b", "c"]
+    with pytest.raises(ValidationError, match="Could not convert object"):
+        MyModel.model_validate(json_payload)
+
+    # Case 8: Pass non-numeric Python object (should raise)
+    json_payload = json.loads(json_payload_str)
+    json_payload["array_int"] = object()
+    with pytest.raises(ValidationError, match="Could not convert object"):
+        MyModel.model_validate(json_payload)