You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/2.TextAnalysis/tfidf.qmd
+5-7Lines changed: 5 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -43,15 +43,13 @@ $$\text{TF}(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{
43
43
$$\text{IDF}(t) = \log\left(\frac{\text{total number of documents}}{\text{number of documents containing term } t}\right)$$
44
44
45
45
A word gets a **high TF-IDF score** when:
46
+
46
47
- It appears frequently in a particular document (high TF)
47
48
- It appears in few other documents (high IDF)
48
49
49
50
A word gets a **low TF-IDF score** when:
50
-
- It appears in many documents (low IDF), even if it's frequent in one document
51
51
52
-
::: {.callout-note title="Why logarithm in IDF?" collapse="true"}
53
-
The logarithm in the IDF calculation dampens the effect of very rare words. Without it, a word that appears in only one document would dominate the scores. The log transformation ensures that the IDF increases more slowly as words become rarer, creating a more balanced weighting system.
54
-
:::
52
+
- It appears in many documents (low IDF), even if it's frequent in one document
55
53
56
54
## Calculating TF-IDF
57
55
@@ -80,6 +78,7 @@ head(comments_tfidf, 15)
80
78
```
81
79
82
80
The resulting data frame includes:
81
+
83
82
-`tf`: Term frequency (proportion of times the word appears in that season)
84
83
-`idf`: Inverse document frequency (how rare the word is across seasons)
0 commit comments