Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) #1222

Copilot · 2026-01-21T19:03:54Z

What does this changes

Adds conservative handling for spaces before Thai tone marks and non-base characters (above/below vowels) in text normalization, while avoiding the creation of new words that could distort text meaning.

What was wrong

The normalize() function removed duplicate spaces but didn't handle spaces inserted before non-base characters, a common typo in Thai text:

from pythainlp.util import normalize

# Space between consonant and below vowel + tone mark
normalize('พ ุ่มดอกไม้')  # Returns 'พ ุ่มดอกไม้' (space remains)

Additionally, the initial implementation could unintentionally create new words by merging characters that should remain separated, leading to incorrect tokenization.

How this fixes it

Added _RE_REMOVE_SPACES_BEFORE_NONBASE regex with conservative pattern that only removes spaces after consonants not preceded by vowels (using negative lookbehind (?<![{thai_vowels}][{thai_consonants}]))
Implemented remove_spaces_before_marks() applying the conservative pattern to fix obvious typos like "พ ุ่ม" while preserving word boundaries like "ภาพ ุ่"
Enhanced remove_dangling() to also remove dangling non-base characters that appear after spaces, handling cases where spaces separate complete words from dangling characters
Integrated into normalize() pipeline after remove_dup_spaces(), before remove_repeat_vowels()

# Original issue cases - fixed
normalize('พ ุ่มดอกไม้')  # Now returns 'พุ่มดอกไม้'
normalize('เค้้้าเดินไปสนามหญา้หนา้บา้น')  # Returns 'เค้าเดินไปสนามหญ้าหน้าบ้าน'

# Conservative handling - preserves word boundaries
normalize('มีรูปภาพ ุ่มากในห้อง')  # Returns 'มีรูปภาพ มากในห้อง'
# Tokenization: ['มี', 'รูปภาพ', ' ', 'มาก', 'ใน', 'ห้อง'] ✓

Your checklist for this pull request

Passed code styles and structures
Passed code linting checks and unit test

Original prompt

This section details on the original issue you should resolve

<issue_title>Text normalization not working in some cases</issue_title>
<issue_description>Describe the bug
Text normalization not working in some cases such as 'เค้้้าเดินไปสนามหญา้หนา้บา้น' output 'เค้้้าเดินไปสนามหญ้าหน้าบ้าน' and 'พ ุ่มดอกไม้ในสนามหญา้หนา้บา้น' output 'พ ุ่มดอกไม้ในสนามหญ้าหน้าบ้าน'.

To Reproduce
Steps to reproduce the behavior:

from pythainlp.util import normalize

normalize('เค้้้าเดินไปสนามหญา้หนา้บา้น')

normalize('พ ุ่มดอกไม้ในสนามหญา้หนา้บา้น')

Desktop (please complete the following information):
Colab. PyThaiNLP 2.2.0dev0</issue_description>

Comments on the Issue (you are @copilot in this section)
@bact To generalize, the current normalize() cannot handle (some of) these cases: 1. Tone mark repetition 2. Spaces between consonant and tone mark
Correct?</comment_new>
<comment_new>@wannaphong

Correct

Correct.</comment_new>
<comment_new>@bact
#389 should fix the (1) case (tone mark repetition)

For (2), the space thing, it is possible to have this kind of text: "มีรูปภาพ ุ่มากในห้อง".
A normalization will create: "มีรูปภาพุ่มากในห้อง" - which may lead to a different word tokens after segmentation. So I'm not sure should we do the normalization or not.

Chars like tone marks and above/below vowels are often found at the beginning of line/sentence as the user accidentally typed it and text editor or text field doesn't show it explicitly enough. If we trying to remove spaces between consonant and these chars, we may mistakenly remove the space that author has an intention to put it be there.

กรณีที่ (2) นี่ไม่แน่ใจครับ ว่าเราควรพยายามลบช่องว่างออกดีไหม เพราะมันอาจจะไปสร้างคำใหม่ก็ได้
ตามตัวอย่าง "มีรูปภาพ ุ่มากในห้อง" -> "มีรูปภาพุ่มากในห้อง

หรืออาจมีกรณีที่พิมพ์พวกวรรณยุกต์หรือสระบน/ล่างไว้ที่ต้นประโยค แล้วมองไม่เห็น ถ้า normalize แล้วมันจะไปทำให้สองประโยคที่ตั้งให้มีช่องว่างแยกกัน ถูกจับมาต่อกันไหม
@bact Tone mark repetition is now get covered.
But the spaces between consonant and tone mark is not yet.</comment_new>
<comment_new>@wannaphong
cases normalize('เค้้้าเดินไปสนามหญา้หนา้บา้น') fixed in PyThaiNLP 2.2.0</comment_new>
<comment_new>@wannaphong
Google colab for test bug. https://colab.research.google.com/drive/1f3urpSHHH_yy5bXYx6s0fQWynGzvqiTz?usp=sharing</comment_new>

Fixes Text normalization not working in some cases #384

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

github-actions · 2026-01-21T19:06:11Z

Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible.

สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้

…e characters Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

bact · 2026-01-21T19:17:51Z

@copilot please also consider the case in this thread #384 (comment)

If not handled with care, the normalization may create a new word unintentionally and distorted the meaning of the text.

In some cases, the space should be removed.
In some other cases, it could be the dangling characters (like tone marks) that should be removed instead.

coveralls · 2026-01-21T19:18:05Z

coverage: 53.656% (+0.008%) from 53.648%
when pulling 2c317ed on copilot/fix-text-normalization-issue
into b4e9173 on dev.

tests/core/test_util.py

bact · 2026-01-21T22:42:25Z

@copilot

Recheck the remove_space_before_marks() function to make sure that it does not introduce new token in the text .

Take this text as an example:

"มีรูปภาพ ุ่มากในห้อง"

(1) If we remove spaces before above/below vowels and tone marks, we will have this:

"มีรูปภาพุ่มากในห้อง" (มี,รูป,ภา,พุ่,มาก,ใน,ห้อง)

(2) If we instead remove the dangling characters after the space, we will have this:

"มีรูปภาพ มากในห้อง" (มี,รูปภาพ, ,มาก,ใน,ห้อง)

(1) and (2) give different sets of tokens.
And arguably (2) is likely to be more correct.

Please try to deal with these cases carefully and conservatively.

wannaphong · 2026-01-22T09:38:16Z

@copilot run again

Copilot · 2026-01-22T09:39:46Z

@copilot run again

Could you clarify what you'd like me to run again? The tests are passing and the original issue cases are working correctly. However, if you're referring to addressing @bact's concern about the "มีรูปภาพ ุ่มากในห้อง" case creating new tokens, I can investigate a more conservative approach that checks context before removing spaces.

bact · 2026-01-22T11:10:12Z

@copilot Do these 2 tasks, A and B:

--

Task A:

Recheck the remove_space_before_marks() function to make sure that it does not introduce superfluous new token that may change the meaning of the text.

Take this text as an example:

"มีรูปภาพ ุ่มากในห้อง"

(1) If we remove spaces before above/below vowels and tone marks, we will have this:

"มีรูปภาพุ่มากในห้อง" (มี,รูป,ภา,พุ่,มาก,ใน,ห้อง)

(2) If we instead remove the dangling characters after the space, we will have this:

"มีรูปภาพ มากในห้อง" (มี,รูปภาพ, ,มาก,ใน,ห้อง)

(1) and (2) give different sets of tokens.
And arguably (2) is likely to be more correct.

Please try to deal with these cases carefully and more conservatively.

--

Task B:

For any changes introduced in this PR, check for unused imports and unused variables.

- Update remove_spaces_before_marks() to only remove space after consonants not preceded by vowels - Add removal of dangling chars after spaces in remove_dangling() - Remove unused import remove_spaces_before_marks from tests - Fix import ordering per ruff recommendations Co-authored-by: bact <128572+bact@users.noreply.github.com>

Copilot · 2026-01-22T11:16:41Z

@copilot Do these 2 tasks, A and B:

--
...

Completed both tasks in commit 2c317ed:

Task A - Conservative normalization:

Modified remove_spaces_before_marks() to only remove spaces after consonants not preceded by vowels (e.g., fixes "พ ุ่ม" but preserves "ภาพ ุ่")
Enhanced remove_dangling() to also remove dangling characters after spaces
This handles the "มีรูปภาพ ุ่มากในห้อง" case correctly: it now outputs "มีรูปภาพ มากในห้อง" (tokens: มี,รูปภาพ, ,มาก,ใน,ห้อง) instead of creating "มีรูปภาพุ่มากในห้อง"

Task B - Unused imports:

Removed unused import remove_spaces_before_marks from test file
Fixed import ordering per ruff recommendations

All tests pass including the original issue cases and the conservative tokenization case.

sonarqubecloud · 2026-01-22T11:17:13Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Initial plan

12785dc

Copilot AI assigned Copilot and wannaphong Jan 21, 2026

Copilot started work on behalf of wannaphong January 21, 2026 19:04 View session

Fix text normalization to remove spaces before tone marks and non-bas…

a0fd76b

…e characters Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix text normalization not working in some cases~~ Fix normalize() to remove spaces before Thai tone marks and non-base characters Jan 21, 2026

Copilot AI requested a review from wannaphong January 21, 2026 19:13

Copilot finished work on behalf of wannaphong January 21, 2026 19:13

bact approved these changes Jan 21, 2026

View reviewed changes

Copilot started work on behalf of bact January 21, 2026 19:18 View session

bact requested changes Jan 21, 2026

View reviewed changes

tests/core/test_util.py Outdated Show resolved Hide resolved

Copilot AI requested a review from bact January 21, 2026 19:21

Copilot started work on behalf of bact January 21, 2026 19:25 View session

Copilot AI requested a review from bact January 21, 2026 19:27

Copilot started work on behalf of bact January 21, 2026 21:33 View session

bact removed their request for review January 21, 2026 22:29

Copilot started work on behalf of bact January 21, 2026 22:42 View session

Copilot started work on behalf of wannaphong January 22, 2026 09:38 View session

Copilot finished work on behalf of wannaphong January 22, 2026 09:40

Copilot started work on behalf of bact January 22, 2026 11:10 View session

Copilot AI changed the title ~~Fix normalize() to remove spaces before Thai tone marks and non-base characters~~ Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) Jan 22, 2026

Copilot finished work on behalf of bact January 22, 2026 11:21

bact approved these changes Jan 22, 2026

View reviewed changes

bact marked this pull request as ready for review January 22, 2026 16:45

wannaphong approved these changes Jan 22, 2026

View reviewed changes

bact merged commit c441c66 into dev Jan 22, 2026
32 checks passed

bact deleted the copilot/fix-text-normalization-issue branch January 22, 2026 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) #1222

Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) #1222

Copilot AI commented Jan 21, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

bact commented Jan 21, 2026 •

edited

Loading

Uh oh!

coveralls commented Jan 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

bact commented Jan 21, 2026

Uh oh!

wannaphong commented Jan 22, 2026

Uh oh!

Copilot AI commented Jan 22, 2026

Uh oh!

bact commented Jan 22, 2026

Uh oh!

Copilot AI commented Jan 22, 2026

Uh oh!

sonarqubecloud bot commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) #1222

Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) #1222

Conversation

Copilot AI commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this changes

What was wrong

How this fixes it

Your checklist for this pull request

Comments on the Issue (you are @copilot in this section)

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

bact commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bact commented Jan 21, 2026

Uh oh!

wannaphong commented Jan 22, 2026

Uh oh!

Copilot AI commented Jan 22, 2026

Uh oh!

bact commented Jan 22, 2026

Uh oh!

Copilot AI commented Jan 22, 2026

Uh oh!

sonarqubecloud bot commented Jan 22, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Jan 21, 2026 •

edited

Loading

bact commented Jan 21, 2026 •

edited

Loading

coveralls commented Jan 21, 2026 •

edited

Loading