Skip to content

Conversation

@staticintlucas
Copy link

Fixes #5651.

Fixes the implementation of PyErrArguments for string::FromUtf8Error and ffi::IntoStringError.

Removes the broken implementation of PyErrArguments for str::Utf8Error, string::FromUtf16Error, and char::DecodeUtf16Error.

Adds a convenience trait called IntoPyErrWithBytes for str::Utf8Error to allow easy conversion to PyErr. The implementation of From<str::Utf8Error> for PyErr was being used in a few places internally in PyO3, so this felt like the best solution. And unlike the UTF-16 errors, str::Utf8Error contains everything needed to construct a Python UnicodeDecodeError except the source bytes. I exposed this API publicly since it's probably a pretty common use case (at least compared to UTF-16), but let me know if you want any changes (especially regarding API naming).

src/err/mod.rs Outdated
/// str::from_utf8(bytes).map_err(|e| e.into_with_bytes(bytes))
/// # }
/// ```
pub trait IntoPyErrWithBytes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather that adding a new trait I would prefer if we just made a constructor on PyUnicodeDecodeError which returns the PyErr. Internally it could make use of PyUnicodeDecodeError::new_utf8 to construct it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the implementation to a new constructor for PyUnicodeDecodeError.

I don't want to call PyUnicodeDecodeError::new_utf8 since it is fallible and I don't want to have to worry about handling errors while already creating an error. I've kept the implementation mostly the same using a private Utf8ErrorWithBytes struct which implements PyErrArguments and calls PyUnicodeDecodeError::new_err under the hood.

Lemme know if this approach sounds reasonable to you, or you want any more changes

@staticintlucas staticintlucas force-pushed the main branch 3 times, most recently from feddff7 to a8b289c Compare December 2, 2025 21:28
@staticintlucas
Copy link
Author

staticintlucas commented Dec 2, 2025

Will add some tests for my changes tomorrow next week to make codecov happy

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is looking great, brilliant to fix a broken corner in our error handling!

let bytes = types::PyBytes::new(py, &bytes).into_any();
let start = types::PyInt::new(py, start).into_any();
let end = types::PyInt::new(py, end).into_any();
let reason = types::PyString::new(py, "invalid utf-8").into_any();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's hard to do better than this with the current API, we'd have to peek the invalid bytes and deduce the reason for failure?

(Probably not worth doing that, just checking.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check for incomplete byte sequences (i.e. unexpected end of input) versus invalid bytes by checking if err.error_len() is None. But the Rust error type does not differentiate between the different kinds of invalid bytes (invalid start byte, invalid continuation, encoding out of range, etc...)

pub fn new_err_from_utf8(bytes: &[u8], err: std::str::Utf8Error) -> crate::PyErr {
Utf8ErrorWithBytes {
err,
bytes: bytes.to_vec(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid this intermediate allocation? I guess not, because we don't have py: Python here.

... is it possibly better to take 'py: Python as an argument, call new_utf8 above to construct Python bytes immediately, and then use PyErr::from_value inside this function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT I just saw your comment about fallibility.

I think if new_utf8 returns an error, we could just bubble that up with return e. I guess that only happens in weird edge cases like keyboard interrupt, memory allocation error etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or we just return PyResult<PyErr> here and let callers decide whether to flatten the other errors away, but the API does feel a bit awkward then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didn't really want to return a PyResult<PyErr>, in part because it would be an awkward API and in part to keep symmetry between new_err_from_utf8 and new_err.

Adding a py: Python argument would also break the symmetry with new_err, but if you think that's worth it to avoid the extra allocation then I can make that change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection I think that having the return type just be PyErr is good enough, the failure mode is unlikely and users will probably always just flatten the outer error anyway because there's not an obvious alternative.

Taking py: Python I think is fine; the discussion we've had around other parts of the PyO3 error handling story is strongly leaning towards the conclusion that we will want to require py: Python to create a PyErr in the future (because it leads to more efficient implementations, as we find here).

src/err/impls.rs Outdated
Comment on lines 141 to 145
impl std::convert::From<exceptions::Utf8ErrorWithBytes> for PyErr {
fn from(err: exceptions::Utf8ErrorWithBytes) -> PyErr {
exceptions::PyUnicodeDecodeError::new_err(err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation may not be needed if the changes to new_err_from_utf8 to call from_utf8 as proposed in my other comment.

@staticintlucas
Copy link
Author

I think the failing test should be due to Rust diagnostic messages changing in 1.92, not related to my change

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow reply. I see you worked out that merging with main would address the CI issues 👍

pub fn new_err_from_utf8(bytes: &[u8], err: std::str::Utf8Error) -> crate::PyErr {
Utf8ErrorWithBytes {
err,
bytes: bytes.to_vec(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection I think that having the return type just be PyErr is good enough, the failure mode is unlikely and users will probably always just flatten the outer error anyway because there's not an obvious alternative.

Taking py: Python I think is fine; the discussion we've had around other parts of the PyO3 error handling story is strongly leaning towards the conclusion that we will want to require py: Python to create a PyErr in the future (because it leads to more efficient implementations, as we find here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TypeError converting Rust's FromUtf8Error to Python's UnicodeDecodeError

3 participants