Skip to content

改進豆腐塊處理及維護檢測腳本#1285

Open
danny0838 wants to merge 30 commits into
BYVoid:masterfrom
danny0838:opt-maint-2
Open

改進豆腐塊處理及維護檢測腳本#1285
danny0838 wants to merge 30 commits into
BYVoid:masterfrom
danny0838:opt-maint-2

Conversation

@danny0838

@danny0838 danny0838 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

改進 Python 維護腳本

#1264 類似,主要包括:

  • 最佳化演算法及程式碼
  • 改為以 dict 處理內部資料,自動檢測重複值及報錯
  • 支援基本資料增刪操作
  • 測試腳本更名為 test_common.py,以符合 Python 慣例及方便以 python -m unittest discover ... 之類的方式執行測試。

改進豆腐塊處理

不再於詞典檔使用 @tofu-risk 標註。引進 scheme/AllowedSmpChars.txt 定義所有可能產生豆腐塊的字元及其替換字元,並透過腳本自動生成使用替換字元的 STCharactersBase.txt 及使用豆腐塊字元的 STCharactersExt.txt

為因應未來可能有相關的詞彙,也用同樣方式生成 STPhrasesBase.txtSTPhrasesExt.txt

改進詞典檢查腳本

  • 取消原 dictionary/*.cpp,改用 scripts/test_dictionary.py 檢測詞典規格一致性。
  • 利用 Python 高階語言特性重用既有邏輯及簡化程式碼,程式碼從原來約 1000 行減少至約 300 行,且語意更為直接易讀。
  • 可以直接執行 Python 或透過 bazel test 執行。
  • 直接使用 Python 及 bazel test 檢測詞典排序,取消額外的 GitHub workflow。
  • 允許的 SMP 字元列表改用 scheme/AllowedSmpChars.txt 定義,並改為只檢查輸出端。另外增加檢查輸入的 SMP 字元可能產生豆腐塊且可替換時,須加入相應的替換後詞項。
  • 取消 {TS,ST}Characters.txt 的 SMP 字元輸出檢查豁免,以便統一用 scheme/AllowedSmpChars.txt 檢查系統的字體相容性。
  • CJK 相容字元代換表移至 scheme/CJKCompatibilityIdeographs.txt,並精簡格式,且只檢查輸出端。
  • phrase_character_dependency 測試增加檢查輸入輸出長度不符的情形。
  • 修正 STPhrases 與地區詞組衝突檢查的錯誤,並補上缺少的詞組。
  • 改進錯誤訊息使之更容易理解,各種測試改為一次輸出所有有問題的檔案與行數。

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d9a7c434e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread package.json

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2322f8b6b6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread data/scripts/tofu.py Outdated
@danny0838 danny0838 force-pushed the opt-maint-2 branch 3 times, most recently from 7972543 to dded0b5 Compare June 6, 2026 11:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dded0b5623

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread node/configs.gypi
Comment on lines +9 to +13
"../data/config/hk2sp.json",
"../data/config/hk2t.json",
"../data/config/jp2t.json",
"../data/config/s2hk.json",
"../data/config/s2hkp.json",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Register new configs in the npm CLI

Adding hk2sp.json and s2hkp.json to the packaged assets is not enough for the npm CLI to load them by mode name: node/cli.js::resolveConfigPath only treats entries present in BUILT_IN_CONFIGS as asset-relative, and otherwise resolves -c s2hkp/-c s2hkp.json against the user's current working directory. As a result, these newly documented/packaged configs work through native OpenCC but fail from the npm CLI unless the caller passes an absolute path; add them to the CLI's built-in config list/stem set alongside the existing configs.

Useful? React with 👍 / 👎.

Comment on lines +130 to +132
'hk2sp_jieba.json',
's2hk_jieba.json',
's2hkp_jieba.json',

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Register new Jieba configs in the npm CLI

These new plugin configs are copied into the opencc-jieba package, but the npm CLI only recognizes Jieba mode names listed in OPTIONAL_JIEBA_CONFIGS in node/cli.js; otherwise resolveConfigPath turns -c s2hkp_jieba or -c hk2sp_jieba.json into an absolute path under the caller's cwd, which bypasses resolveJiebaConfigPath and fails to load the packaged plugin config. Add the new Jieba config names to that CLI allowlist when packaging them here.

Useful? React with 👍 / 👎.

@frankslin frankslin marked this pull request as draft June 6, 2026 12:18
@danny0838 danny0838 force-pushed the opt-maint-2 branch 12 times, most recently from d0ccabc to 151c4b3 Compare June 8, 2026 12:21
@danny0838 danny0838 marked this pull request as ready for review June 8, 2026 12:49

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 151c4b32ce

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread data/scripts/tofu.py Outdated
@danny0838 danny0838 force-pushed the opt-maint-2 branch 2 times, most recently from 18ddad9 to 318b9bd Compare June 8, 2026 22:18

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 318b9bd5c7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread data/scripts/test_dictionary.py Outdated
@danny0838 danny0838 force-pushed the opt-maint-2 branch 2 times, most recently from 26dfb45 to 9cb624c Compare June 9, 2026 16:58
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5eab16af2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread data/config/t2s.json
{ "type": "ocd2", "file": "TSPhrases.ocd2" },
{
"type": "ocd2",
"file": "TSPhrasesExt.ocd2",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep inline config compilation working

These new generated dictionary names are now present in standard configs, but scripts/compile_to_inline_config.py still only knows how to load source .txt files from data/dictionary (plus reversed dictionaries and the old TSCharactersExt special case). With this config, the documented command python3 scripts/compile_to_inline_config.py -c data/config/t2s.json -d data/dictionary ... now fails with Dictionary file not found for: TSPhrasesExt, so inline-config users cannot compile any of the updated *2s configs unless the script is taught to generate/load the Base/Ext dictionaries or these sources are made available.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9da2b67a3d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread data/config/t2s.json
danny0838 and others added 28 commits June 14, 2026 14:17
- `opencc` raises in such case.
- Make public core attributes more explicit.
- Prevent error when dumping a new dict.
- For easier test discovery using something like `python -m unittest discover scripts/`
- Key and values are now accessed like a dict.
- Rework comments handling:
  - Comment blocks for an entry is accessed via `entry.block`.
  - Header and footer are now accessed via `Table.header` and `Table.footer`.
- Support auto-deduplication.
- Implement `dump(sort=bool)` and remove `Table.sort()`.
- Define tofu chars in a separated `AllowedSmpChars.txt` scheme file rather than modifying the source dict file.
- Separate tofu related files to `TSCharacters{Base,Ext}.txt` and `TSPhrases{Base,Ext}.txt`.
- Check for values only.
- Declare allowed chars in `scheme/AllowedSmpChars.txt`.
- Move CJK compatibility table to `scheme/CJKCompatibilityIdeographs.txt` and reformat to be more compact.
- Check only chars in entry values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants