Skip to content

fix non-ASCII filename mangling caused by NBSP in page titles#1930

Open
Self-Perfection wants to merge 1 commit into
gildas-lormeau:masterfrom
Self-Perfection:fix/nbsp-filename-mangling
Open

fix non-ASCII filename mangling caused by NBSP in page titles#1930
Self-Perfection wants to merge 1 commit into
gildas-lormeau:masterfrom
Self-Perfection:fix/nbsp-filename-mangling

Conversation

@Self-Perfection
Copy link
Copy Markdown

Firefox's browser.downloads.download() rejects filenames containing NBSP (U+00A0) and narrow NBSP (U+202F) with "illegal characters", even though both are valid on every modern filesystem and accepted by Chromium (see https://bugzilla.mozilla.org/show_bug.cgi?id=2030811).

The existing catch-block in download-util.js had a fallback that replaced ALL non-ASCII runs with the replacement character on any "illegal characters" error — so a single invisible NBSP from typographic markup would silently destroy the entire Cyrillic/CJK/ Arabic filename.

Add a targeted retry branch (before the legacy non-ASCII strip) that handles six specific codepoints rejected by browser engines:

  • U+00A0 NBSP, U+202F narrow NBSP → replaced with regular space
  • U+00AD soft hyphen, U+200B ZWSP, U+FEFF BOM, U+2060 word joiner → removed

The first two are Gecko-specific rejections; the other four are rejected by both Gecko and Chromium. On Chromium the new branch never fires (no error is thrown), so behavior is unchanged.

Minimal repro (save via SingleFile on Firefox):

<title>Pies w łóżku</title>

Before: "Pies w_ku (...).html" (ł ó ż eaten by non-ASCII strip)
After: "Pies w łóżku (...).html" (NBSP → space, Polish letters intact)

Firefox's browser.downloads.download() rejects filenames containing
NBSP (U+00A0) and narrow NBSP (U+202F) with "illegal characters",
even though both are valid on every modern filesystem and accepted by
Chromium (see https://bugzilla.mozilla.org/show_bug.cgi?id=2030811).

The existing catch-block in download-util.js had a fallback that
replaced ALL non-ASCII runs with the replacement character on any
"illegal characters" error — so a single invisible NBSP from
typographic markup would silently destroy the entire Cyrillic/CJK/
Arabic filename.

Add a targeted retry branch (before the legacy non-ASCII strip) that
handles six specific codepoints rejected by browser engines:
- U+00A0 NBSP, U+202F narrow NBSP → replaced with regular space
- U+00AD soft hyphen, U+200B ZWSP, U+FEFF BOM, U+2060 word joiner
  → removed

The first two are Gecko-specific rejections; the other four are
rejected by both Gecko and Chromium. On Chromium the new branch
never fires (no error is thrown), so behavior is unchanged.

Minimal repro (save via SingleFile on Firefox):

  <title>Pies w&nbsp;łóżku</title>

Before: "Pies w_ku (...).html" (ł ó ż eaten by non-ASCII strip)
After:  "Pies w łóżku (...).html" (NBSP → space, Polish letters intact)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant