-
Notifications
You must be signed in to change notification settings - Fork 31
FIX: Encoding Decoding #265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| size_t copySize = std::min(wstr.size(), info.columnSize); | ||
| #if defined(_WIN32) | ||
| // Windows: direct copy | ||
| wmemcpy(&wcharArray[i * (info.columnSize + 1)], wstr.c_str(), copySize); |
Check warning
Code scanning / devskim
These functions are historically error-prone and have been associated with a significant number of vulnerabilities. Most of these functions have safer alternatives, such as replacing 'strcpy' with 'strlcpy' or 'strcpy_s'. Warning
| } | ||
|
|
||
| size_t copySize = std::min(str.size(), info.columnSize); | ||
| memcpy(&charArray[i * (info.columnSize + 1)], str.c_str(), copySize); |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances character encoding and decoding support in the mssql-python library to address issues with non-UTF-8 character sets, particularly East Asian encodings like GBK. The main focus is on making encoding and decoding settings dynamically configurable and properly handling character conversion during query execution and result fetching.
- Added dynamic encoding/decoding configuration retrieval from connection objects
- Enhanced parameter binding to use connection-specific encoding settings for SQL_C_CHAR and SQL_C_WCHAR types
- Updated result fetching to apply proper character decoding based on connection settings
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tests/test_003_connection.py | Added comprehensive test cases for various encoding scenarios including GBK, UTF-8, East Asian characters, and diagnostic tests |
| mssql_python/pybind/ddbc_bindings.cpp | Enhanced parameter binding and result fetching with encoding-aware string conversion functions and extensive debug logging |
| mssql_python/cursor.py | Added helper methods to retrieve encoding/decoding settings from connection and updated execute/fetch methods to use dynamic settings |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| // Use unix-specific conversion to handle the wchar_t/SQLWCHAR size difference | ||
| SQLWCHAR* wcharData = &buffers.wcharBuffers[col - 1][i * fetchBufferSize]; | ||
| // Use DecodeString directly with the raw data | ||
| py::object decodedStr = DecodeString(wcharData, numCharsInData * sizeof(SQLWCHAR), wcharEncoding, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we reconstruct byte length as numCharsInData * sizeof(SQLWCHAR). If numCharsInData was derived from a buffer that includes a null terminator or unused capacity, or if a surrogate pair boundary was truncated, we could mis-decode. Prefer passing the original ODBC dataLen (in bytes) captured per row/column rather than recomputing. Also add a defensive check: if (bytesLen % 2 != 0) log & fail for wide data.
Please talk to @gargsaumya for these implementations
sumitmsft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments..
| sqlwchars.size() * sizeof(SQLWCHAR)); | ||
| wcharArray[i * (info.columnSize + 1) + sqlwchars.size()] = 0; |
Check notice
Code scanning / devskim
There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note
… jahnvi/githubissue_250
|
This comment is not just this PR specific, rather binding.cpp file specific. I would suggest to create a task to track these.
For wide-char parameters, the code sometimes round-trips through Python encode/decode, which is inefficient.
|
gargsaumya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments for improvement.
| // Use PyUnicode_DecodeUTF16 directly for best performance | ||
| // Note: SQLWCHAR is always 2 bytes (UTF-16LE) on all platforms for SQL Server | ||
| int byteorder = -1; | ||
| PyObject* unicode = PyUnicode_DecodeUTF16( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This usage is correct. The py::reinterpret_steal immediately wraps the PyObject*, ensuring automatic cleanup. This is the pattern that should be used throughout.
|
Closing this PR created a new PR#301 |
Work Item / Issue Reference
Summary
This pull request improves the handling of character encoding and decoding in the
mssql_python/cursor.pymodule. The main changes ensure that encoding and decoding settings are dynamically retrieved from the connection, allowing for more robust and flexible support for different character sets when executing queries and fetching results.Encoding and decoding settings improvements:
_get_encoding_settingsand_get_decoding_settingshelper methods to retrieve encoding and decoding configurations from the connection, with sensible fallbacks if unavailable.Query execution enhancements:
Updated the
executeandexecutemanymethods to use dynamic encoding and character type settings when calling the underlying DDBC bindings, ensuring queries are sent with the correct encoding.Result fetching improvements:
Modified
fetchone,fetchmany, andfetchallmethods to use dynamic decoding settings for character and wide character data, improving reliability and compatibility when reading results from the database.