Skip to content

fix(trends): use TF-IDF to filter out low-signal words (#329)#338

Open
axisrow wants to merge 1 commit intomainfrom
fix/trends-stopwords-329
Open

fix(trends): use TF-IDF to filter out low-signal words (#329)#338
axisrow wants to merge 1 commit intomainfrom
fix/trends-stopwords-329

Conversation

@axisrow
Copy link
Copy Markdown
Owner

@axisrow axisrow commented Apr 1, 2026

Summary

  • Replace raw word frequency with TF-IDF scoring via TfidfVectorizer(max_df=0.85) to automatically suppress ubiquitous words like "hello", "если", "буду" that appear in most messages
  • Add scikit-learn>=1.4 dependency
  • Rewrite get_trending_topics to use TF-IDF with configurable thresholds
  • Update tests for new filtering logic + add regression test for Trends showing non important words #329

Test plan

  • All 23 tests in test_trend_service.py pass
  • Lint check passes
  • Regression test verifies "hello" (100% docs) is filtered, "криптовалюта" (20% docs) appears in trends

Closes #329

🤖 Generated with Claude Code

Replace raw word frequency with TfidfVectorizer(max_df=0.85) to
automatically suppress ubiquitous words like "hello", "если", "буду"
that appear in most messages. This eliminates the need for a
hardcoded stop-words list.

- Add scikit-learn>=1.4 dependency
- Rewrite get_trending_topics to use TF-IDF scoring
- Update tests for new filtering logic
- Add regression test for issue #329

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 1, 2026 15:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates trending-topic extraction to reduce low-signal “ubiquitous” words by switching from raw frequency counting to TF‑IDF scoring.

Changes:

  • Reworked TrendService.get_trending_topics to compute trends using TfidfVectorizer (with max_df/min_df and a RU/EN token pattern).
  • Updated/added unit tests to reflect TF‑IDF-based filtering and added a regression test for #329.
  • Added scikit-learn as a project dependency.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
src/services/trend_service.py Replaces frequency-based keyword extraction with TF‑IDF scoring and returns top-scoring terms.
tests/test_trend_service.py Adjusts tests for TF‑IDF semantics and adds a regression test ensuring ubiquitous words are suppressed.
pyproject.toml Adds scikit-learn dependency required for TF‑IDF vectorization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 58 to +61
batch_size = 5000
offset = 0
word_counts: dict[str, int] = {}
stop_words = {
"и", "в", "на", "с", "по", "не", "это", "то", "что",
"как", "из", "за", "от", "для", "или", "но", "а",
"the", "and", "is", "in", "to", "of", "a", "for",
}
texts: list[str] = []

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_trending_topics now accumulates all matching message texts into texts before running TF‑IDF, which can become a large in-memory list for big databases and defeats the previous batch/memory-saving approach. Consider adding an upper bound (e.g., cap to last N messages / most recent rows), sampling, or persisting a precomputed rolling corpus so this cannot OOM the process for large datasets.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +89
vectorizer = TfidfVectorizer(
token_pattern=r"(?u)\b[а-яёa-z]{4,}\b", # 4+ символов, RU + EN
max_df=0.85, # слова в >85% сообщений — шум, убираются автоматически
min_df=2, # слова менее чем в 2 сообщениях — слишком редкие
)
try:
tfidf_matrix = vectorizer.fit_transform(texts)
except ValueError:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TfidfVectorizer.fit_transform is CPU-bound and is executed inside an async def, which will block the event loop when called from async web routes/CLI handlers. Consider offloading the TF‑IDF computation to a worker thread (e.g., asyncio.to_thread) or moving this work to a sync context/background job so concurrent requests/tasks aren’t stalled.

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +100
top_indices = scores.argsort()[::-1][:limit]
return [
TrendingTopic(keyword=feature_names[i], count=int(tfidf_matrix.getcol(i).nnz))
for i in top_indices
]
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned TrendingTopic.count value has changed semantics: it’s now nnz (document frequency) rather than total occurrences/mentions, but callers format it as “упоминаний” and tests refer to it as a count. Either rename the field to reflect document frequency, adjust user-facing labels, or compute an actual occurrence count to preserve the existing contract.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +86
vectorizer = TfidfVectorizer(
token_pattern=r"(?u)\b[а-яёa-z]{4,}\b", # 4+ символов, RU + EN
max_df=0.85, # слова в >85% сообщений — шум, убираются автоматически
min_df=2, # слова менее чем в 2 сообщениях — слишком редкие
)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions “configurable thresholds”, but max_df=0.85, min_df=2, and the token pattern are hard-coded here. If these are intended to be configurable, consider adding parameters (with defaults) or reading from config so behavior can be tuned without code changes.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +70
# Generate messages with unique words that appear in multiple docs
# Each word appears in 2 messages to pass min_df=2
messages = []
for i in range(50):
messages.append({"text": f"unique{i:03d}word sharedword"})
messages.append({"text": f"unique{i:03d}word anotherword"})
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test no longer meaningfully exercises the limit behavior: unique{i:03d}word contains digits, so it won’t match the service’s token_pattern (letters only) and will be dropped; that leaves only a couple shared terms, so len(result) <= 5 will pass regardless of limit logic. Use alphabetic-only tokens and ensure there are more than limit eligible terms after min_df/max_df filtering.

Suggested change
# Generate messages with unique words that appear in multiple docs
# Each word appears in 2 messages to pass min_df=2
messages = []
for i in range(50):
messages.append({"text": f"unique{i:03d}word sharedword"})
messages.append({"text": f"unique{i:03d}word anotherword"})
# Generate messages with alphabetic-only words that appear in multiple docs
# Each word appears in 2 messages to pass min_df=2 and stays below max_df
keywords = [
"alpha",
"bravo",
"charlie",
"delta",
"echo",
"foxtrot",
"golf",
"hotel",
"india",
"juliet",
]
messages = []
for kw in keywords:
messages.append({"text": f"{kw} sharedword"})
messages.append({"text": f"{kw} anotherword"})

Copilot uses AI. Check for mistakes.
Comment on lines 119 to +133
mock_db.execute_fetchall = AsyncMock(
return_value=[
{"text": "validword 123numeric"},
{"text": "test123"},
{"text": "validword другоеслово"},
{"text": "validword иноеслово"}, # повтор для min_df=2
{"text": "простоеслово ещеслово"}, # без validword, чтобы не попало под max_df
]
)

result = await service.get_trending_topics(days=7, limit=10)

# Non-alpha words like "123numeric" should be filtered
words = [t.keyword for t in result]
# Non-alpha words filtered by token_pattern
assert "123numeric" not in words
assert "test123" not in words
assert "validword" in words
assert "validword" in words # 2/3 = 67% < max_df 85%
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts that digit-containing tokens are filtered, but the provided messages don’t contain 123numeric or test123, so the assertions are vacuously true and won’t catch regressions. Include digit-containing tokens in at least 2 messages (to satisfy min_df=2) and assert they are still excluded by token_pattern.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trends showing non important words

3 participants