Improve cross-language keyword matching by DemonGiggle · Pull Request #75 · DemonGiggle/iRecall

DemonGiggle · 2026-05-10T05:03:50Z

Closes #74

Summary

align quote tag extraction and recall keyword extraction around English keyword terms
repair non-English query keyword outputs into English when models do not follow the first prompt
append original-language question tokens as fallback search terms
adjust relevance scoring for mixed translated/original-language fallback sets
document the cross-language keyword behavior

Tests

go test ./...

gemini-code-assist

Code Review

This pull request introduces cross-language support for keyword extraction and search by prioritizing English keywords while maintaining original-language fallbacks. Key changes include updated LLM prompts for English output, a repair mechanism for non-English keywords, and a fallback system that appends tokens from the original question to the search set. The relevance scoring logic was also enhanced to handle mixed-language keyword sets. Feedback suggests refining the token normalization and regex patterns to avoid mangling technical terms like 'C++' or 'C#', which would otherwise be stripped by aggressive punctuation filtering.

gemini-code-assist · 2026-05-10T05:05:39Z

+	value = strings.TrimFunc(value, func(r rune) bool {
+		return unicode.IsPunct(r) || unicode.IsSymbol(r)
+	})


The use of unicode.IsPunct and unicode.IsSymbol in strings.TrimFunc is too aggressive for technical search keywords. It will strip meaningful characters from terms like C++ (becoming c) or C# (becoming c). Since this engine is intended to index technical content, these suffixes should be preserved. Consider trimming only a specific set of non-technical punctuation like .,;:!?()[]{}.

Suggested change

value = strings.TrimFunc(value, func(r rune) bool {

return unicode.IsPunct(r) || unicode.IsSymbol(r)

})

value = strings.TrimFunc(value, func(r rune) bool {

return strings.ContainsRune(".,;:!?()[]{}", r)

})

gemini-code-assist · 2026-05-10T05:05:39Z

+	return false
+}
+
+var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#-]{2,}|[^\x00-\x7F]+`)


The regex pattern for ASCII tokens is missing the + character, which is common in technical terms like C++. Additionally, since the pattern requires at least 2 characters ({2,}), a term like C++ would be completely ignored by this fallback mechanism if + is not included in the character class.

Suggested change

var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#-]{2,}|[^\x00-\x7F]+`)

var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#+-]{2,}|[^\x00-\x7F]+`)

DemonGiggle · 2026-05-10T06:34:15Z

/gemini review

gemini-code-assist

Code Review

This pull request implements cross-language search capabilities by prioritizing English keyword extraction while maintaining original-language fallbacks. Key changes include updated LLM prompts for translation, a repair mechanism for non-English keywords, and enhanced relevance scoring for mixed-language sets. Documentation and tests were also updated to reflect these improvements. Feedback was provided regarding the normalization of non-English tokens, specifically suggesting the inclusion of full-width punctuation in the trimming logic to improve search recall.

gemini-code-assist · 2026-05-10T06:47:01Z

+	value = strings.TrimFunc(value, func(r rune) bool {
+		return strings.ContainsRune(".,;:!?()[]{}", r)
+	})


The normalizeSearchToken function doesn't trim full-width punctuation like ？ from fallback search tokens. This could lead to poor search recall, as a search for "分割區？" will likely not match documents containing just 分割區.

Consider expanding the set of trimmed characters to include common full-width punctuation to make the fallback search terms more robust. Note that this change will require updating the test expectations in TestExtractKeywordsPromptsForEnglishAndAddsOriginalLanguageFallbacks.

Suggested change

value = strings.TrimFunc(value, func(r rune) bool {

return strings.ContainsRune(".,;:!?()[]{}", r)

})

value = strings.TrimFunc(value, func(r rune) bool {

return strings.ContainsRune(".,;:!?()[]{}，。？！", r)

})

Improve cross-language keyword matching

c19c564

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Address keyword review and flaky TUI test

65a868b

DemonGiggle merged commit 0b07d96 into main May 10, 2026
7 checks passed

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

DemonGiggle deleted the fix/issue-74-keyword-language-mismatch branch May 10, 2026 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cross-language keyword matching#75

Improve cross-language keyword matching#75
DemonGiggle merged 2 commits into
mainfrom
fix/issue-74-keyword-language-mismatch

DemonGiggle commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

DemonGiggle commented May 10, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#-]{2,}\|[^\x00-\x7F]+`)
	var originalQuestionTokenPattern = regexp.MustCompile(`[A-Za-z0-9_./#+-]{2,}\|[^\x00-\x7F]+`)

Conversation

DemonGiggle commented May 10, 2026

Summary

Tests

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

DemonGiggle commented May 10, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants