Improve cross-language keyword matching#75
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces cross-language support for keyword extraction and search by prioritizing English keywords while maintaining original-language fallbacks. Key changes include updated LLM prompts for English output, a repair mechanism for non-English keywords, and a fallback system that appends tokens from the original question to the search set. The relevance scoring logic was also enhanced to handle mixed-language keyword sets. Feedback suggests refining the token normalization and regex patterns to avoid mangling technical terms like 'C++' or 'C#', which would otherwise be stripped by aggressive punctuation filtering.
| value = strings.TrimFunc(value, func(r rune) bool { | ||
| return unicode.IsPunct(r) || unicode.IsSymbol(r) | ||
| }) |
There was a problem hiding this comment.
The use of unicode.IsPunct and unicode.IsSymbol in strings.TrimFunc is too aggressive for technical search keywords. It will strip meaningful characters from terms like C++ (becoming c) or C# (becoming c). Since this engine is intended to index technical content, these suffixes should be preserved. Consider trimming only a specific set of non-technical punctuation like .,;:!?()[]{}.
| value = strings.TrimFunc(value, func(r rune) bool { | |
| return unicode.IsPunct(r) || unicode.IsSymbol(r) | |
| }) | |
| value = strings.TrimFunc(value, func(r rune) bool { | |
| return strings.ContainsRune(".,;:!?()[]{}", r) | |
| }) |
| return false | ||
| } | ||
|
|
||
| var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#-]{2,}|[^\x00-\x7F]+`) |
There was a problem hiding this comment.
The regex pattern for ASCII tokens is missing the + character, which is common in technical terms like C++. Additionally, since the pattern requires at least 2 characters ({2,}), a term like C++ would be completely ignored by this fallback mechanism if + is not included in the character class.
| var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#-]{2,}|[^\x00-\x7F]+`) | |
| var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#+-]{2,}|[^\x00-\x7F]+`) |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements cross-language search capabilities by prioritizing English keyword extraction while maintaining original-language fallbacks. Key changes include updated LLM prompts for translation, a repair mechanism for non-English keywords, and enhanced relevance scoring for mixed-language sets. Documentation and tests were also updated to reflect these improvements. Feedback was provided regarding the normalization of non-English tokens, specifically suggesting the inclusion of full-width punctuation in the trimming logic to improve search recall.
| value = strings.TrimFunc(value, func(r rune) bool { | ||
| return strings.ContainsRune(".,;:!?()[]{}", r) | ||
| }) |
There was a problem hiding this comment.
The normalizeSearchToken function doesn't trim full-width punctuation like ? from fallback search tokens. This could lead to poor search recall, as a search for "分割區?" will likely not match documents containing just 分割區.
Consider expanding the set of trimmed characters to include common full-width punctuation to make the fallback search terms more robust. Note that this change will require updating the test expectations in TestExtractKeywordsPromptsForEnglishAndAddsOriginalLanguageFallbacks.
| value = strings.TrimFunc(value, func(r rune) bool { | |
| return strings.ContainsRune(".,;:!?()[]{}", r) | |
| }) | |
| value = strings.TrimFunc(value, func(r rune) bool { | |
| return strings.ContainsRune(".,;:!?()[]{},。?!", r) | |
| }) |
Closes #74
Summary
Tests