fix(trends): use TF-IDF to filter out low-signal words (#329) by axisrow · Pull Request #338 · axisrow/tg_content_factory

axisrow · 2026-04-01T15:30:15Z

Summary

Replace raw word frequency with TF-IDF scoring via TfidfVectorizer(max_df=0.85) to automatically suppress ubiquitous words like "hello", "если", "буду" that appear in most messages
Add scikit-learn>=1.4 dependency
Rewrite get_trending_topics to use TF-IDF with configurable thresholds
Update tests for new filtering logic + add regression test for Trends showing non important words #329

Test plan

All 23 tests in test_trend_service.py pass
Lint check passes
Regression test verifies "hello" (100% docs) is filtered, "криптовалюта" (20% docs) appears in trends

Closes #329

🤖 Generated with Claude Code

Replace raw word frequency with TfidfVectorizer(max_df=0.85) to automatically suppress ubiquitous words like "hello", "если", "буду" that appear in most messages. This eliminates the need for a hardcoded stop-words list. - Add scikit-learn>=1.4 dependency - Rewrite get_trending_topics to use TF-IDF scoring - Update tests for new filtering logic - Add regression test for issue #329 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Updates trending-topic extraction to reduce low-signal “ubiquitous” words by switching from raw frequency counting to TF‑IDF scoring.

Changes:

Reworked TrendService.get_trending_topics to compute trends using TfidfVectorizer (with max_df/min_df and a RU/EN token pattern).
Updated/added unit tests to reflect TF‑IDF-based filtering and added a regression test for #329.
Added scikit-learn as a project dependency.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`src/services/trend_service.py`	Replaces frequency-based keyword extraction with TF‑IDF scoring and returns top-scoring terms.
`tests/test_trend_service.py`	Adjusts tests for TF‑IDF semantics and adds a regression test ensuring ubiquitous words are suppressed.
`pyproject.toml`	Adds `scikit-learn` dependency required for TF‑IDF vectorization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T15:34:13Z

src/services/trend_service.py

        batch_size = 5000
        offset = 0
-        word_counts: dict[str, int] = {}
-        stop_words = {
-            "и", "в", "на", "с", "по", "не", "это", "то", "что",
-            "как", "из", "за", "от", "для", "или", "но", "а",
-            "the", "and", "is", "in", "to", "of", "a", "for",
-        }
+        texts: list[str] = []
+


get_trending_topics now accumulates all matching message texts into texts before running TF‑IDF, which can become a large in-memory list for big databases and defeats the previous batch/memory-saving approach. Consider adding an upper bound (e.g., cap to last N messages / most recent rows), sampling, or persisting a precomputed rolling corpus so this cannot OOM the process for large datasets.

Copilot · 2026-04-01T15:34:13Z

src/services/trend_service.py

+        vectorizer = TfidfVectorizer(
+            token_pattern=r"(?u)\b[а-яёa-z]{4,}\b",  # 4+ символов, RU + EN
+            max_df=0.85,  # слова в >85% сообщений — шум, убираются автоматически
+            min_df=2,  # слова менее чем в 2 сообщениях — слишком редкие
+        )
+        try:
+            tfidf_matrix = vectorizer.fit_transform(texts)
+        except ValueError:


TfidfVectorizer.fit_transform is CPU-bound and is executed inside an async def, which will block the event loop when called from async web routes/CLI handlers. Consider offloading the TF‑IDF computation to a worker thread (e.g., asyncio.to_thread) or moving this work to a sync context/background job so concurrent requests/tasks aren’t stalled.

Copilot · 2026-04-01T15:34:14Z

src/services/trend_service.py

+        top_indices = scores.argsort()[::-1][:limit]
+        return [
+            TrendingTopic(keyword=feature_names[i], count=int(tfidf_matrix.getcol(i).nnz))
+            for i in top_indices
+        ]


The returned TrendingTopic.count value has changed semantics: it’s now nnz (document frequency) rather than total occurrences/mentions, but callers format it as “упоминаний” and tests refer to it as a count. Either rename the field to reflect document frequency, adjust user-facing labels, or compute an actual occurrence count to preserve the existing contract.

Copilot · 2026-04-01T15:34:14Z

src/services/trend_service.py

+        vectorizer = TfidfVectorizer(
+            token_pattern=r"(?u)\b[а-яёa-z]{4,}\b",  # 4+ символов, RU + EN
+            max_df=0.85,  # слова в >85% сообщений — шум, убираются автоматически
+            min_df=2,  # слова менее чем в 2 сообщениях — слишком редкие
+        )


PR description mentions “configurable thresholds”, but max_df=0.85, min_df=2, and the token pattern are hard-coded here. If these are intended to be configurable, consider adding parameters (with defaults) or reading from config so behavior can be tuned without code changes.

Copilot · 2026-04-01T15:34:14Z

tests/test_trend_service.py

+    # Generate messages with unique words that appear in multiple docs
+    # Each word appears in 2 messages to pass min_df=2
+    messages = []
+    for i in range(50):
+        messages.append({"text": f"unique{i:03d}word sharedword"})
+        messages.append({"text": f"unique{i:03d}word anotherword"})


This test no longer meaningfully exercises the limit behavior: unique{i:03d}word contains digits, so it won’t match the service’s token_pattern (letters only) and will be dropped; that leaves only a couple shared terms, so len(result) <= 5 will pass regardless of limit logic. Use alphabetic-only tokens and ensure there are more than limit eligible terms after min_df/max_df filtering.

Suggested change

# Generate messages with unique words that appear in multiple docs

# Each word appears in 2 messages to pass min_df=2

messages = []

for i in range(50):

messages.append({"text": f"unique{i:03d}word sharedword"})

messages.append({"text": f"unique{i:03d}word anotherword"})

# Generate messages with alphabetic-only words that appear in multiple docs

# Each word appears in 2 messages to pass min_df=2 and stays below max_df

keywords = [

"alpha",

"bravo",

"charlie",

"delta",

"echo",

"foxtrot",

"golf",

"hotel",

"india",

"juliet",

]

messages = []

for kw in keywords:

messages.append({"text": f"{kw} sharedword"})

messages.append({"text": f"{kw} anotherword"})

Copilot · 2026-04-01T15:34:15Z

tests/test_trend_service.py

    mock_db.execute_fetchall = AsyncMock(
        return_value=[
-            {"text": "validword 123numeric"},
-            {"text": "test123"},
+            {"text": "validword другоеслово"},
+            {"text": "validword иноеслово"},  # повтор для min_df=2
+            {"text": "простоеслово ещеслово"},  # без validword, чтобы не попало под max_df
        ]
    )

    result = await service.get_trending_topics(days=7, limit=10)

-    # Non-alpha words like "123numeric" should be filtered
    words = [t.keyword for t in result]
+    # Non-alpha words filtered by token_pattern
    assert "123numeric" not in words
    assert "test123" not in words
-    assert "validword" in words
+    assert "validword" in words  # 2/3 = 67% < max_df 85%


This test asserts that digit-containing tokens are filtered, but the provided messages don’t contain 123numeric or test123, so the assertions are vacuously true and won’t catch regressions. Include digit-containing tokens in at least 2 messages (to satisfy min_df=2) and assert they are still excluded by token_pattern.

Copilot AI review requested due to automatic review settings April 1, 2026 15:30

Copilot started reviewing on behalf of axisrow April 1, 2026 15:30 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trends): use TF-IDF to filter out low-signal words (#329)#338

fix(trends): use TF-IDF to filter out low-signal words (#329)#338
axisrow wants to merge 1 commit intomainfrom
fix/trends-stopwords-329

axisrow commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    # Generate messages with unique words that appear in multiple docs
-    # Each word appears in 2 messages to pass min_df=2
-    messages = []
-    for i in range(50):
-        messages.append({"text": f"unique{i:03d}word sharedword"})
-        messages.append({"text": f"unique{i:03d}word anotherword"})
+    # Generate messages with alphabetic-only words that appear in multiple docs
+    # Each word appears in 2 messages to pass min_df=2 and stays below max_df
+    keywords = [
+        "alpha",
+        "bravo",
+        "charlie",
+        "delta",
+        "echo",
+        "foxtrot",
+        "golf",
+        "hotel",
+        "india",
+        "juliet",
+    ]
+    messages = []
+    for kw in keywords:
+        messages.append({"text": f"{kw} sharedword"})
+        messages.append({"text": f"{kw} anotherword"})

Conversation

axisrow commented Apr 1, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants