Skip to content

Conversation

@reinhardt1053
Copy link

Problem

Legacy Firebird databases commonly use charset NONE on text columns. In these databases, text is stored as raw bytes in the application's encoding (typically WIN1252) without any charset metadata on the columns.

The driver currently hardcodes utf8 as the connection charset (lc_ctype in the DPB). When connecting to a database with charset NONE columns, Firebird does not transliterate the data: it sends the raw bytes as-is. The driver then incorrectly decodes these WIN1252 bytes as UTF-8, corrupting accented characters:

  • TournéeTourn�e
  • CaféCaf�

This affects a large number of production Firebird databases where charset NONE was the default.

Solution

Add a charset option to ConnectOptions that:

  1. Sets the DPB lc_ctype to the specified charset instead of hardcoded utf8
  2. Propagates the charset to the data reader/writer so string encoding/decoding matches the connection charset

Usage

const attachment = await client.connect('host:database', {
  username: 'SYSDBA',
  password: 'masterkey',
  charset: 'WIN1252',
});

How it works

  • mapCharsetToEncoding() maps Firebird charset names to Node.js BufferEncoding values (utf8 for UTF8, latin1 for all single-byte charsets)
  • latin1 encoding in Node.js provides a 1:1 byte-to-codepoint mapping, which correctly handles any single-byte Firebird charset (WIN1252, ISO8859_1, WIN1250, etc.)
  • The encoding is stored on AbstractAttachment and passed through to createDataReader() and createDataWriter() via StatementImpl.prepare()

Changes

  • node-firebird-driver:

    • ConnectOptions: add optional charset property
    • createDpb(): use options.charset instead of hardcoded 'utf8'
    • mapCharsetToEncoding(): new helper to map Firebird charset → Node.js encoding
    • AbstractAttachment: add encoding property
    • createDataReader() / createDataWriter(): accept encoding parameter
  • node-firebird-driver-native:

    • AttachmentImpl.connect(): set encoding from mapCharsetToEncoding(options.charset)
    • StatementImpl.prepare(): pass attachment.encoding to reader/writer

Backward compatible

When charset is not specified, the behavior is identical to before (defaults to utf8).

Add charset option to ConnectOptions allowing users to specify the
connection character set used in the DPB (lc_ctype parameter).
The charset is also propagated to the data reader and writer so that
string encoding/decoding matches the connection charset.

This is essential for legacy Firebird databases (commonly created with
Delphi/IBX) where columns use charset NONE. In these databases, text is
stored as raw bytes in the application's encoding (typically WIN1252)
without any charset metadata on the columns.

With the current hardcoded 'utf8' charset, the driver tells Firebird to
communicate in UTF-8, but Firebird does not transliterate charset NONE
columns. The raw WIN1252 bytes are then incorrectly decoded as UTF-8,
corrupting accented characters (e.g. 'Tournée' becomes 'Tourn�e').

By setting charset: 'WIN1252' in ConnectOptions, Firebird sends the
correct bytes and the driver decodes them using the matching Node.js
encoding (latin1, which provides 1:1 byte-to-codepoint mapping for
single-byte charsets).

Changes:
- ConnectOptions: add optional charset property
- createDpb(): use options.charset instead of hardcoded 'utf8'
- mapCharsetToEncoding(): map Firebird charset names to Node.js encodings
- AbstractAttachment: store encoding from connection charset
- createDataReader(): accept encoding parameter for string decoding
- createDataWriter(): accept encoding parameter for string encoding
- AttachmentImpl: set encoding on connect using mapCharsetToEncoding()
- StatementImpl: pass attachment.encoding to reader/writer

Backward compatible: defaults to 'utf8' when charset is not specified.
@asfernandes
Copy link
Owner

Isn't node.js strings assumed to be utf8?
How would it will work with strings that are just bytes?

@reinhardt1053
Copy link
Author

Isn't node.js strings assumed to be utf8?

JavaScript strings are Unicode internally but the key issue is how raw bytes from the wire are decoded into JS strings, and how JS strings are encoded back to bytes when writing. At the moment with charset NONE columns Firebird sends raw bytes without transliteration, the byte 0xE9 (which is é in WIN1252) is not valid as a single-byte UTF-8 sequence, so StringDecoder('utf8') replaces it with �

How would it will work with strings that are just bytes?

With NONE columns the data is indeed just bytes, the driver can't know the encoding. That's why it's left to the user to specify it via the charset option, the user knows what encoding their application uses (e.g. Delphi apps typically use WIN1252). The driver then uses the corresponding node.js encoding (latin1) to decode/encode correctly.

@asfernandes
Copy link
Owner

Usage of latin1 is wrong there.
Looks like TextDecoder would be the correct way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants