Skip to content

Improve cross-language keyword matching#75

Merged
DemonGiggle merged 2 commits into
mainfrom
fix/issue-74-keyword-language-mismatch
May 10, 2026
Merged

Improve cross-language keyword matching#75
DemonGiggle merged 2 commits into
mainfrom
fix/issue-74-keyword-language-mismatch

Conversation

@DemonGiggle
Copy link
Copy Markdown
Owner

Closes #74

Summary

  • align quote tag extraction and recall keyword extraction around English keyword terms
  • repair non-English query keyword outputs into English when models do not follow the first prompt
  • append original-language question tokens as fallback search terms
  • adjust relevance scoring for mixed translated/original-language fallback sets
  • document the cross-language keyword behavior

Tests

  • go test ./...

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cross-language support for keyword extraction and search by prioritizing English keywords while maintaining original-language fallbacks. Key changes include updated LLM prompts for English output, a repair mechanism for non-English keywords, and a fallback system that appends tokens from the original question to the search set. The relevance scoring logic was also enhanced to handle mixed-language keyword sets. Feedback suggests refining the token normalization and regex patterns to avoid mangling technical terms like 'C++' or 'C#', which would otherwise be stripped by aggressive punctuation filtering.

Comment thread core/engine.go
Comment on lines +581 to +583
value = strings.TrimFunc(value, func(r rune) bool {
return unicode.IsPunct(r) || unicode.IsSymbol(r)
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of unicode.IsPunct and unicode.IsSymbol in strings.TrimFunc is too aggressive for technical search keywords. It will strip meaningful characters from terms like C++ (becoming c) or C# (becoming c). Since this engine is intended to index technical content, these suffixes should be preserved. Consider trimming only a specific set of non-technical punctuation like .,;:!?()[]{}.

Suggested change
value = strings.TrimFunc(value, func(r rune) bool {
return unicode.IsPunct(r) || unicode.IsSymbol(r)
})
value = strings.TrimFunc(value, func(r rune) bool {
return strings.ContainsRune(".,;:!?()[]{}", r)
})

Comment thread core/engine.go Outdated
return false
}

var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#-]{2,}|[^\x00-\x7F]+`)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex pattern for ASCII tokens is missing the + character, which is common in technical terms like C++. Additionally, since the pattern requires at least 2 characters ({2,}), a term like C++ would be completely ignored by this fallback mechanism if + is not included in the character class.

Suggested change
var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#-]{2,}|[^\x00-\x7F]+`)
var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#+-]{2,}|[^\x00-\x7F]+`)

@DemonGiggle
Copy link
Copy Markdown
Owner Author

/gemini review

@DemonGiggle DemonGiggle merged commit 0b07d96 into main May 10, 2026
7 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements cross-language search capabilities by prioritizing English keyword extraction while maintaining original-language fallbacks. Key changes include updated LLM prompts for translation, a repair mechanism for non-English keywords, and enhanced relevance scoring for mixed-language sets. Documentation and tests were also updated to reflect these improvements. Feedback was provided regarding the normalization of non-English tokens, specifically suggesting the inclusion of full-width punctuation in the trimming logic to improve search recall.

Comment thread core/engine.go
Comment on lines +581 to +583
value = strings.TrimFunc(value, func(r rune) bool {
return strings.ContainsRune(".,;:!?()[]{}", r)
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The normalizeSearchToken function doesn't trim full-width punctuation like from fallback search tokens. This could lead to poor search recall, as a search for "分割區?" will likely not match documents containing just 分割區.

Consider expanding the set of trimmed characters to include common full-width punctuation to make the fallback search terms more robust. Note that this change will require updating the test expectations in TestExtractKeywordsPromptsForEnglishAndAddsOriginalLanguageFallbacks.

Suggested change
value = strings.TrimFunc(value, func(r rune) bool {
return strings.ContainsRune(".,;:!?()[]{}", r)
})
value = strings.TrimFunc(value, func(r rune) bool {
return strings.ContainsRune(".,;:!?()[]{},。?!", r)
})

@DemonGiggle DemonGiggle deleted the fix/issue-74-keyword-language-mismatch branch May 10, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

the language of keyword insertion and the language of keyword query don't match

2 participants