Skip to content

feat(wordwrap): support CJK line-breaking rules#86

Open
Chronostasys wants to merge 1 commit into
muesli:masterfrom
Chronostasys:fix/cjk-word-break
Open

feat(wordwrap): support CJK line-breaking rules#86
Chronostasys wants to merge 1 commit into
muesli:masterfrom
Chronostasys:fix/cjk-word-break

Conversation

@Chronostasys

Copy link
Copy Markdown

Problem

CJK (Chinese, Japanese, Korean) text wrapping is broken. In CJK typography, every character is a valid line-break point — unlike Latin scripts where only spaces and explicit breakpoints allow wrapping. The current implementation treats CJK+Latin sequences without spaces as a single word, causing:

  1. Entire mixed-language segments wrap as one unit, wasting half the available line width
  2. Text like "manual(手动触发),很可能没跑。" at limit=12 renders as a single long line that overflows, instead of breaking at CJK character boundaries

Before (limit=12)

manual(手动触发),很可能没跑。

The entire string is one "word" (no spaces between CJK chars) → never breaks.

After (limit=12)

manual(手动
触发),很可
能没跑。

Each CJK character is a break point. CJK↔Latin boundaries also break.

Changes

Minimal changes to wordwrap.go Write() method:

  • isCJK(r rune) bool: Detects CJK characters by Unicode range (Han, Hiragana, Katakana, Hangul, CJK punctuation, fullwidth forms).
  • CJK characters are flushed immediately as individual words, making each one a valid break point — standard CJK typography rule.
  • CJK↔non-CJK boundaries trigger a word flush, enabling breaks between scripts (e.g., "这是" | "manual" | "触发").
  • Non-CJK behavior is completely unchanged — all existing tests pass.

Test Cases

Added TestWordWrapCJK with 11 cases covering:

  • Pure CJK text (each char is a break point)
  • CJK mixed with Latin (boundary detection)
  • CJK punctuation (fullwidth forms)
  • Limit=0 passthrough (no wrap)
  • Latin-only (unchanged behavior)
=== RUN   TestWordWrapCJK
--- PASS: TestWordWrapCJK (0.00s)
=== RUN   TestWordWrapCJKNoWrap
--- PASS: TestWordWrapCJKNoWrap (0.00s)
=== RUN   TestWordWrapCJKString
--- PASS: TestWordWrapCJKString (0.00s)

All existing tests also pass (except a pre-existing failure in truncate unrelated to this change).

In CJK (Chinese, Japanese, Korean) typography, each character is a
valid line-break point — unlike Latin scripts where only spaces and
explicit breakpoints allow wrapping. The original implementation treats
CJK+Latin sequences without spaces as a single word, causing entire
mixed-language segments like "manual(手动触发)" to wrap as one unit
and waste half the available line width.

Changes to Write():
- Add isCJK() to detect CJK characters by Unicode range (Han, Hiragana,
  Katakana, Hangul, CJK punctuation, fullwidth forms).
- CJK characters are immediately flushed as individual words, making
  each one a valid break point (standard CJK typography rule).
- CJK↔non-CJK boundaries trigger a word flush, enabling breaks between
  scripts (e.g., "这是" | "manual" | "触发").
- Non-CJK behavior is completely unchanged.

Tests: add TestWordWrapCJK with 11 cases covering pure CJK, CJK+Latin
mix, CJK punctuation, and boundary detection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant