Skip to content

Conversation

@jahnvi480
Copy link
Contributor

@jahnvi480 jahnvi480 commented Sep 30, 2025

Work Item / Issue Reference

AB#39049

GitHub Issue: #250


Summary

This pull request improves the handling of character encoding and decoding in the mssql_python/cursor.py module. The main changes ensure that encoding and decoding settings are dynamically retrieved from the connection, allowing for more robust and flexible support for different character sets when executing queries and fetching results.

Encoding and decoding settings improvements:

  • Added _get_encoding_settings and _get_decoding_settings helper methods to retrieve encoding and decoding configurations from the connection, with sensible fallbacks if unavailable.

Query execution enhancements:

  • Updated the execute and executemany methods to use dynamic encoding and character type settings when calling the underlying DDBC bindings, ensuring queries are sent with the correct encoding.
    Result fetching improvements:

  • Modified fetchone, fetchmany, and fetchall methods to use dynamic decoding settings for character and wide character data, improving reliability and compatibility when reading results from the database.

Copilot AI review requested due to automatic review settings September 30, 2025 06:54
@github-actions github-actions bot added the pr-size: large Substantial code update label Sep 30, 2025
size_t copySize = std::min(wstr.size(), info.columnSize);
#if defined(_WIN32)
// Windows: direct copy
wmemcpy(&wcharArray[i * (info.columnSize + 1)], wstr.c_str(), copySize);

Check warning

Code scanning / devskim

These functions are historically error-prone and have been associated with a significant number of vulnerabilities. Most of these functions have safer alternatives, such as replacing 'strcpy' with 'strlcpy' or 'strcpy_s'. Warning

Banned C function detected
}

size_t copySize = std::min(str.size(), info.columnSize);
memcpy(&charArray[i * (info.columnSize + 1)], str.c_str(), copySize);

Check notice

Code scanning / devskim

There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note

Problematic C function detected (memcpy)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances character encoding and decoding support in the mssql-python library to address issues with non-UTF-8 character sets, particularly East Asian encodings like GBK. The main focus is on making encoding and decoding settings dynamically configurable and properly handling character conversion during query execution and result fetching.

  • Added dynamic encoding/decoding configuration retrieval from connection objects
  • Enhanced parameter binding to use connection-specific encoding settings for SQL_C_CHAR and SQL_C_WCHAR types
  • Updated result fetching to apply proper character decoding based on connection settings

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
tests/test_003_connection.py Added comprehensive test cases for various encoding scenarios including GBK, UTF-8, East Asian characters, and diagnostic tests
mssql_python/pybind/ddbc_bindings.cpp Enhanced parameter binding and result fetching with encoding-aware string conversion functions and extensive debug logging
mssql_python/cursor.py Added helper methods to retrieve encoding/decoding settings from connection and updated execute/fetch methods to use dynamic settings

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

// Use unix-specific conversion to handle the wchar_t/SQLWCHAR size difference
SQLWCHAR* wcharData = &buffers.wcharBuffers[col - 1][i * fetchBufferSize];
// Use DecodeString directly with the raw data
py::object decodedStr = DecodeString(wcharData, numCharsInData * sizeof(SQLWCHAR), wcharEncoding, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we reconstruct byte length as numCharsInData * sizeof(SQLWCHAR). If numCharsInData was derived from a buffer that includes a null terminator or unused capacity, or if a surrogate pair boundary was truncated, we could mis-decode. Prefer passing the original ODBC dataLen (in bytes) captured per row/column rather than recomputing. Also add a defensive check: if (bytesLen % 2 != 0) log & fail for wide data.

Please talk to @gargsaumya for these implementations

Copy link
Contributor

@sumitmsft sumitmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments..

Comment on lines +2029 to +2030
sqlwchars.size() * sizeof(SQLWCHAR));
wcharArray[i * (info.columnSize + 1) + sqlwchars.size()] = 0;

Check notice

Code scanning / devskim

There are a number of conditions in which memcpy can introduce a vulnerability (mismatched buffer sizes, null pointers, etc.). More secure alternitives perform additional validation of the source and destination buffer Note

Problematic C function detected (memcpy)
@Ramudaykumar
Copy link
Contributor

This comment is not just this PR specific, rather binding.cpp file specific. I would suggest to create a task to track these.

  1. String/Wide String Conversion Overhead

For wide-char parameters, the code sometimes round-trips through Python encode/decode, which is inefficient.
For py::str to std::wstring, use param.caststd::wstring() directly.
Only use encode/decode for bytes/bytearray or when explicit encoding is needed.

  1. LOB/Variable-Length Data Handling
    For columns with unknown or huge size, the code falls back to streaming, but the allocation logic for buffers (e.g., 2GB for LONGVARCHAR) is risky.
    Always prefer streaming for LOBs.
    Avoid allocating large vectors based on column size alone; use a reasonable cap and document the fallback.

Copy link
Contributor

@gargsaumya gargsaumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments for improvement.

// Use PyUnicode_DecodeUTF16 directly for best performance
// Note: SQLWCHAR is always 2 bytes (UTF-16LE) on all platforms for SQL Server
int byteorder = -1;
PyObject* unicode = PyUnicode_DecodeUTF16(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This usage is correct. The py::reinterpret_steal immediately wraps the PyObject*, ensuring automatic cleanup. This is the pattern that should be used throughout.

@jahnvi480
Copy link
Contributor Author

Closing this PR created a new PR#301

@jahnvi480 jahnvi480 closed this Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants