Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 21, 2026

What does this changes

Adds conservative handling for spaces before Thai tone marks and non-base characters (above/below vowels) in text normalization, while avoiding the creation of new words that could distort text meaning.

What was wrong

The normalize() function removed duplicate spaces but didn't handle spaces inserted before non-base characters, a common typo in Thai text:

from pythainlp.util import normalize

# Space between consonant and below vowel + tone mark
normalize('พ ุ่มดอกไม้')  # Returns 'พ ุ่มดอกไม้' (space remains)

Additionally, the initial implementation could unintentionally create new words by merging characters that should remain separated, leading to incorrect tokenization.

How this fixes it

  • Added _RE_REMOVE_SPACES_BEFORE_NONBASE regex with conservative pattern that only removes spaces after consonants not preceded by vowels (using negative lookbehind (?<![{thai_vowels}][{thai_consonants}]))
  • Implemented remove_spaces_before_marks() applying the conservative pattern to fix obvious typos like "พ ุ่ม" while preserving word boundaries like "ภาพ ุ่"
  • Enhanced remove_dangling() to also remove dangling non-base characters that appear after spaces, handling cases where spaces separate complete words from dangling characters
  • Integrated into normalize() pipeline after remove_dup_spaces(), before remove_repeat_vowels()
# Original issue cases - fixed
normalize('พ ุ่มดอกไม้')  # Now returns 'พุ่มดอกไม้'
normalize('เค้้้าเดินไปสนามหญา้หนา้บา้น')  # Returns 'เค้าเดินไปสนามหญ้าหน้าบ้าน'

# Conservative handling - preserves word boundaries
normalize('มีรูปภาพ ุ่มากในห้อง')  # Returns 'มีรูปภาพ มากในห้อง'
# Tokenization: ['มี', 'รูปภาพ', ' ', 'มาก', 'ใน', 'ห้อง'] ✓

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

This section details on the original issue you should resolve

<issue_title>Text normalization not working in some cases</issue_title>
<issue_description>Describe the bug
Text normalization not working in some cases such as 'เค้้้าเดินไปสนามหญา้หนา้บา้น' output 'เค้้้าเดินไปสนามหญ้าหน้าบ้าน' and 'พ ุ่มดอกไม้ในสนามหญา้หนา้บา้น' output 'พ ุ่มดอกไม้ในสนามหญ้าหน้าบ้าน'.

To Reproduce
Steps to reproduce the behavior:

  1. from pythainlp.util import normalize
  2. normalize('เค้้้าเดินไปสนามหญา้หนา้บา้น')
  3. normalize('พ ุ่มดอกไม้ในสนามหญา้หนา้บา้น')

Desktop (please complete the following information):
Colab. PyThaiNLP 2.2.0dev0</issue_description>

Comments on the Issue (you are @copilot in this section)

@bact To generalize, the current normalize() cannot handle (some of) these cases: 1. Tone mark repetition 2. Spaces between consonant and tone mark

Correct?</comment_new>
<comment_new>@wannaphong

Correct

Correct.</comment_new>
<comment_new>@bact
#389 should fix the (1) case (tone mark repetition)

For (2), the space thing, it is possible to have this kind of text: "มีรูปภาพ ุ่มากในห้อง".
A normalization will create: "มีรูปภาพุ่มากในห้อง" - which may lead to a different word tokens after segmentation. So I'm not sure should we do the normalization or not.

Chars like tone marks and above/below vowels are often found at the beginning of line/sentence as the user accidentally typed it and text editor or text field doesn't show it explicitly enough. If we trying to remove spaces between consonant and these chars, we may mistakenly remove the space that author has an intention to put it be there.

กรณีที่ (2) นี่ไม่แน่ใจครับ ว่าเราควรพยายามลบช่องว่างออกดีไหม เพราะมันอาจจะไปสร้างคำใหม่ก็ได้
ตามตัวอย่าง "มีรูปภาพ ุ่มากในห้อง" -> "มีรูปภาพุ่มากในห้อง

หรืออาจมีกรณีที่พิมพ์พวกวรรณยุกต์หรือสระบน/ล่างไว้ที่ต้นประโยค แล้วมองไม่เห็น ถ้า normalize แล้วมันจะไปทำให้สองประโยคที่ตั้งให้มีช่องว่างแยกกัน ถูกจับมาต่อกันไหม

@bact Tone mark repetition is now get covered.

But the spaces between consonant and tone mark is not yet.</comment_new>
<comment_new>@wannaphong
cases normalize('เค้้้าเดินไปสนามหญา้หนา้บา้น') fixed in PyThaiNLP 2.2.0</comment_new>
<comment_new>@wannaphong
Google colab for test bug. https://colab.research.google.com/drive/1f3urpSHHH_yy5bXYx6s0fQWynGzvqiTz?usp=sharing</comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@github-actions
Copy link

Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible.

สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้

…e characters

Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix text normalization not working in some cases Fix normalize() to remove spaces before Thai tone marks and non-base characters Jan 21, 2026
Copilot AI requested a review from wannaphong January 21, 2026 19:13
@bact
Copy link
Member

bact commented Jan 21, 2026

@copilot please also consider the case in this thread #384 (comment)

If not handled with care, the normalization may create a new word unintentionally and distorted the meaning of the text.

In some cases, the space should be removed.
In some other cases, it could be the dangling characters (like tone marks) that should be removed instead.

@coveralls
Copy link

coveralls commented Jan 21, 2026

Coverage Status

coverage: 53.656% (+0.008%) from 53.648%
when pulling 2c317ed on copilot/fix-text-normalization-issue
into b4e9173 on dev.

@bact
Copy link
Member

bact commented Jan 21, 2026

@copilot

Recheck the remove_space_before_marks() function to make sure that it does not introduce new token in the text .

Take this text as an example:

"มีรูปภาพ ุ่มากในห้อง"

(1) If we remove spaces before above/below vowels and tone marks, we will have this:

"มีรูปภาพุ่มากในห้อง" (มี,รูป,ภา,พุ่,มาก,ใน,ห้อง)

(2) If we instead remove the dangling characters after the space, we will have this:

"มีรูปภาพ มากในห้อง" (มี,รูปภาพ, ,มาก,ใน,ห้อง)

(1) and (2) give different sets of tokens.
And arguably (2) is likely to be more correct.

Please try to deal with these cases carefully and conservatively.

@wannaphong
Copy link
Member

@copilot run again

Copy link
Contributor Author

Copilot AI commented Jan 22, 2026

@copilot run again

Could you clarify what you'd like me to run again? The tests are passing and the original issue cases are working correctly. However, if you're referring to addressing @bact's concern about the "มีรูปภาพ ุ่มากในห้อง" case creating new tokens, I can investigate a more conservative approach that checks context before removing spaces.

@bact
Copy link
Member

bact commented Jan 22, 2026

@copilot Do these 2 tasks, A and B:

--

Task A:

Recheck the remove_space_before_marks() function to make sure that it does not introduce superfluous new token that may change the meaning of the text.

Take this text as an example:

"มีรูปภาพ ุ่มากในห้อง"

(1) If we remove spaces before above/below vowels and tone marks, we will have this:

"มีรูปภาพุ่มากในห้อง" (มี,รูป,ภา,พุ่,มาก,ใน,ห้อง)

(2) If we instead remove the dangling characters after the space, we will have this:

"มีรูปภาพ มากในห้อง" (มี,รูปภาพ, ,มาก,ใน,ห้อง)

(1) and (2) give different sets of tokens.
And arguably (2) is likely to be more correct.

Please try to deal with these cases carefully and more conservatively.

--

Task B:

For any changes introduced in this PR, check for unused imports and unused variables.

- Update remove_spaces_before_marks() to only remove space after consonants not preceded by vowels
- Add removal of dangling chars after spaces in remove_dangling()
- Remove unused import remove_spaces_before_marks from tests
- Fix import ordering per ruff recommendations

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 22, 2026

@copilot Do these 2 tasks, A and B:

--
...

Completed both tasks in commit 2c317ed:

Task A - Conservative normalization:

  • Modified remove_spaces_before_marks() to only remove spaces after consonants not preceded by vowels (e.g., fixes "พ ุ่ม" but preserves "ภาพ ุ่")
  • Enhanced remove_dangling() to also remove dangling characters after spaces
  • This handles the "มีรูปภาพ ุ่มากในห้อง" case correctly: it now outputs "มีรูปภาพ มากในห้อง" (tokens: มี,รูปภาพ, ,มาก,ใน,ห้อง) instead of creating "มีรูปภาพุ่มากในห้อง"

Task B - Unused imports:

  • Removed unused import remove_spaces_before_marks from test file
  • Fixed import ordering per ruff recommendations

All tests pass including the original issue cases and the conservative tokenization case.

@sonarqubecloud
Copy link

Copilot AI changed the title Fix normalize() to remove spaces before Thai tone marks and non-base characters Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) Jan 22, 2026
@bact bact marked this pull request as ready for review January 22, 2026 16:45
@bact bact merged commit c441c66 into dev Jan 22, 2026
32 checks passed
@bact bact deleted the copilot/fix-text-normalization-issue branch January 22, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Text normalization not working in some cases

4 participants