-
Notifications
You must be signed in to change notification settings - Fork 288
Fix normalize() to remove spaces before Thai tone marks and non-base characters (conservative approach) #1222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello @Copilot, thank you for submitting a PR! We will respond as soon as possible. สวัสดี @Copilot ขอบคุณที่ส่ง PR เข้ามา เราจะตอบกลับให้เร็วที่สุดเท่าที่จะทำได้ |
…e characters Co-authored-by: wannaphong <8536487+wannaphong@users.noreply.github.com>
|
@copilot please also consider the case in this thread #384 (comment) If not handled with care, the normalization may create a new word unintentionally and distorted the meaning of the text. In some cases, the space should be removed. |
|
Recheck the remove_space_before_marks() function to make sure that it does not introduce new token in the text . Take this text as an example: "มีรูปภาพ ุ่มากในห้อง" (1) If we remove spaces before above/below vowels and tone marks, we will have this: "มีรูปภาพุ่มากในห้อง" (มี,รูป,ภา,พุ่,มาก,ใน,ห้อง) (2) If we instead remove the dangling characters after the space, we will have this: "มีรูปภาพ มากในห้อง" (มี,รูปภาพ, ,มาก,ใน,ห้อง) (1) and (2) give different sets of tokens. Please try to deal with these cases carefully and conservatively. |
|
@copilot run again |
Could you clarify what you'd like me to run again? The tests are passing and the original issue cases are working correctly. However, if you're referring to addressing @bact's concern about the "มีรูปภาพ ุ่มากในห้อง" case creating new tokens, I can investigate a more conservative approach that checks context before removing spaces. |
|
@copilot Do these 2 tasks, A and B: -- Task A: Recheck the remove_space_before_marks() function to make sure that it does not introduce superfluous new token that may change the meaning of the text. Take this text as an example: "มีรูปภาพ ุ่มากในห้อง" (1) If we remove spaces before above/below vowels and tone marks, we will have this: "มีรูปภาพุ่มากในห้อง" (มี,รูป,ภา,พุ่,มาก,ใน,ห้อง) (2) If we instead remove the dangling characters after the space, we will have this: "มีรูปภาพ มากในห้อง" (มี,รูปภาพ, ,มาก,ใน,ห้อง) (1) and (2) give different sets of tokens. Please try to deal with these cases carefully and more conservatively. -- Task B: For any changes introduced in this PR, check for unused imports and unused variables. |
- Update remove_spaces_before_marks() to only remove space after consonants not preceded by vowels - Add removal of dangling chars after spaces in remove_dangling() - Remove unused import remove_spaces_before_marks from tests - Fix import ordering per ruff recommendations Co-authored-by: bact <128572+bact@users.noreply.github.com>
Completed both tasks in commit 2c317ed: Task A - Conservative normalization:
Task B - Unused imports:
All tests pass including the original issue cases and the conservative tokenization case. |
|



What does this changes
Adds conservative handling for spaces before Thai tone marks and non-base characters (above/below vowels) in text normalization, while avoiding the creation of new words that could distort text meaning.
What was wrong
The
normalize()function removed duplicate spaces but didn't handle spaces inserted before non-base characters, a common typo in Thai text:Additionally, the initial implementation could unintentionally create new words by merging characters that should remain separated, leading to incorrect tokenization.
How this fixes it
_RE_REMOVE_SPACES_BEFORE_NONBASEregex with conservative pattern that only removes spaces after consonants not preceded by vowels (using negative lookbehind(?<![{thai_vowels}][{thai_consonants}]))remove_spaces_before_marks()applying the conservative pattern to fix obvious typos like "พ ุ่ม" while preserving word boundaries like "ภาพ ุ่"remove_dangling()to also remove dangling non-base characters that appear after spaces, handling cases where spaces separate complete words from dangling charactersnormalize()pipeline afterremove_dup_spaces(), beforeremove_repeat_vowels()Your checklist for this pull request
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.