Skip to content

Commit d27edff

Browse files
committed
refine TF-IDF explanation and formatting in the analysis chapter
1 parent 8918542 commit d27edff

File tree

1 file changed

+5
-7
lines changed

1 file changed

+5
-7
lines changed

chapters/2.TextAnalysis/tfidf.qmd

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,13 @@ $$\text{TF}(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{
4343
$$\text{IDF}(t) = \log\left(\frac{\text{total number of documents}}{\text{number of documents containing term } t}\right)$$
4444

4545
A word gets a **high TF-IDF score** when:
46+
4647
- It appears frequently in a particular document (high TF)
4748
- It appears in few other documents (high IDF)
4849

4950
A word gets a **low TF-IDF score** when:
50-
- It appears in many documents (low IDF), even if it's frequent in one document
5151

52-
::: {.callout-note title="Why logarithm in IDF?" collapse="true"}
53-
The logarithm in the IDF calculation dampens the effect of very rare words. Without it, a word that appears in only one document would dominate the scores. The log transformation ensures that the IDF increases more slowly as words become rarer, creating a more balanced weighting system.
54-
:::
52+
- It appears in many documents (low IDF), even if it's frequent in one document
5553

5654
## Calculating TF-IDF
5755

@@ -80,6 +78,7 @@ head(comments_tfidf, 15)
8078
```
8179

8280
The resulting data frame includes:
81+
8382
- `tf`: Term frequency (proportion of times the word appears in that season)
8483
- `idf`: Inverse document frequency (how rare the word is across seasons)
8584
- `tf_idf`: The product of TF and IDF
@@ -140,11 +139,10 @@ top_tfidf_s1 <- comments_tfidf %>%
140139
arrange(desc(tf_idf)) %>%
141140
head(15)
142141
143-
# Compare
144-
cat("=== Top 15 words by frequency (Season 1) ===\n")
142+
# Top 15 words by frequency (Season 1)
145143
print(top_freq_s1)
146144
147-
cat("\n=== Top 15 words by TF-IDF (Season 1) ===\n")
145+
# Top 15 words by TF-IDF (Season 1)
148146
print(top_tfidf_s1 %>% select(word, n, tf_idf))
149147
```
150148

0 commit comments

Comments
 (0)