Skip to content

Commit 8918542

Browse files
committed
tfidf chapter
1 parent f5c8767 commit 8918542

File tree

1 file changed

+172
-0
lines changed

1 file changed

+172
-0
lines changed

chapters/2.TextAnalysis/tfidf.qmd

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
title: "TF-IDF: Finding Distinctive Vocabulary"
3+
engine: knitr
4+
format:
5+
html:
6+
fig-width: 10
7+
fig-height: 12
8+
dpi: 300
9+
editor_options:
10+
chunk_output_type: inline
11+
---
12+
13+
```{r}
14+
#| include: false
15+
# This is just to render the document correctly in the CI/CD pipeline
16+
library(tidyverse)
17+
library(tidytext)
18+
19+
comments <- readr::read_csv("../../data/clean/comments_preprocessed.csv")
20+
```
21+
22+
So far, we have explored word frequencies and n-grams to understand common terms and phrases in our text data. However, simply counting words has a limitation: some words are frequent because they appear often across *all* documents, not because they are particularly meaningful for a specific document or group.
23+
24+
For example, in our Severance dataset, words like "season," "episode," and "show" might appear frequently in comments about both Season 1 and Season 2. While these words are common, they don't help us understand what makes each season's discussion *distinctive*.
25+
26+
This is where **TF-IDF (Term Frequency-Inverse Document Frequency)** becomes useful. TF-IDF is a statistical measure that evaluates how important a word is to a document in a collection of documents (corpus). It helps us identify words that are frequent in one document but rare across the entire corpus—precisely the words that make a document unique.
27+
28+
## Understanding TF-IDF
29+
30+
TF-IDF combines two metrics:
31+
32+
1. **Term Frequency (TF)**: How often a word appears in a document
33+
2. **Inverse Document Frequency (IDF)**: How rare a word is across all documents
34+
35+
The formula is:
36+
37+
$$\text{TF-IDF} = \text{TF} \times \text{IDF}$$
38+
39+
Where:
40+
41+
$$\text{TF}(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total terms in document } d}$$
42+
43+
$$\text{IDF}(t) = \log\left(\frac{\text{total number of documents}}{\text{number of documents containing term } t}\right)$$
44+
45+
A word gets a **high TF-IDF score** when:
46+
- It appears frequently in a particular document (high TF)
47+
- It appears in few other documents (high IDF)
48+
49+
A word gets a **low TF-IDF score** when:
50+
- It appears in many documents (low IDF), even if it's frequent in one document
51+
52+
::: {.callout-note title="Why logarithm in IDF?" collapse="true"}
53+
The logarithm in the IDF calculation dampens the effect of very rare words. Without it, a word that appears in only one document would dominate the scores. The log transformation ensures that the IDF increases more slowly as words become rarer, creating a more balanced weighting system.
54+
:::
55+
56+
## Calculating TF-IDF
57+
58+
In our case, we want to compare the vocabulary between Season 1 and Season 2 comments. We'll treat each season as a "document" and calculate TF-IDF to find which words are distinctive to each season.
59+
60+
First, we need to extract season information from the `id` column and tokenize the comments:
61+
62+
```{r}
63+
# Calculate TF-IDF by season
64+
comments_tfidf <- comments %>%
65+
mutate(season = str_extract(id, "s[12]")) %>% # Extract season (s1 or s2)
66+
unnest_tokens(word, comments) %>% # Tokenize into words
67+
count(season, word, sort = TRUE) # Count words per season
68+
69+
head(comments_tfidf)
70+
```
71+
72+
Now we can apply the `bind_tf_idf()` function from the `tidytext` package, which automatically calculates TF, IDF, and TF-IDF for us:
73+
74+
```{r}
75+
# Apply TF-IDF calculation
76+
comments_tfidf <- comments_tfidf %>%
77+
bind_tf_idf(word, season, n)
78+
79+
head(comments_tfidf, 15)
80+
```
81+
82+
The resulting data frame includes:
83+
- `tf`: Term frequency (proportion of times the word appears in that season)
84+
- `idf`: Inverse document frequency (how rare the word is across seasons)
85+
- `tf_idf`: The product of TF and IDF
86+
87+
Let's examine the top words by TF-IDF for each season:
88+
89+
```{r}
90+
# Top 10 distinctive words per season
91+
comments_tfidf %>%
92+
group_by(season) %>%
93+
slice_max(tf_idf, n = 10)
94+
```
95+
96+
Notice how these words are much more specific and meaningful than simply looking at the most frequent words. These are the words that truly characterize each season's discussion.
97+
98+
## Visualizing Distinctive Vocabulary
99+
100+
To better understand the distinctive vocabulary of each season, we can create a visualization comparing the top TF-IDF words:
101+
102+
```{r}
103+
# Prepare data for visualization
104+
top_tfidf_words <- comments_tfidf %>%
105+
group_by(season) %>%
106+
slice_max(tf_idf, n = 15) %>%
107+
ungroup() %>%
108+
mutate(word = reorder_within(word, tf_idf, season))
109+
110+
# Plot distinctive vocabulary by season
111+
ggplot(top_tfidf_words, aes(tf_idf, word, fill = season)) +
112+
geom_col(show.legend = FALSE) +
113+
facet_wrap(~season, scales = "free") +
114+
scale_y_reordered() +
115+
labs(
116+
x = "TF-IDF",
117+
y = NULL,
118+
title = "Distinctive Vocabulary by Season"
119+
) +
120+
theme_minimal()
121+
```
122+
123+
This visualization clearly shows which words are most characteristic of each season's discussions. Words with higher TF-IDF scores are those that appear frequently in one season but not in the other, making them useful markers of distinctive content.
124+
125+
## Comparing TF-IDF to Raw Frequency
126+
127+
To appreciate the value of TF-IDF, let's compare it to simple word counts. We'll look at the top words by frequency versus the top words by TF-IDF for Season 1:
128+
129+
```{r}
130+
# Top words by raw frequency for Season 1
131+
top_freq_s1 <- comments %>%
132+
filter(grepl("^s1", id)) %>%
133+
unnest_tokens(word, comments) %>%
134+
count(word, sort = TRUE) %>%
135+
head(15)
136+
137+
# Top words by TF-IDF for Season 1
138+
top_tfidf_s1 <- comments_tfidf %>%
139+
filter(season == "s1") %>%
140+
arrange(desc(tf_idf)) %>%
141+
head(15)
142+
143+
# Compare
144+
cat("=== Top 15 words by frequency (Season 1) ===\n")
145+
print(top_freq_s1)
146+
147+
cat("\n=== Top 15 words by TF-IDF (Season 1) ===\n")
148+
print(top_tfidf_s1 %>% select(word, n, tf_idf))
149+
```
150+
151+
The raw frequency list likely includes many words that are common across both seasons, while the TF-IDF list highlights words that are specifically important to Season 1 discussions.
152+
153+
## When to Use TF-IDF
154+
155+
TF-IDF is particularly useful for:
156+
157+
1. **Document comparison**: Identifying what makes each document unique in a collection
158+
2. **Feature extraction**: Preparing text data for machine learning by emphasizing distinctive words
159+
3. **Topic discovery**: Finding characteristic vocabulary for different groups or categories
160+
4. **Search and retrieval**: Ranking documents by relevance to a query (search engines use variations of TF-IDF)
161+
162+
::: {.callout-tip title="Limitations of TF-IDF"}
163+
While TF-IDF is powerful, it has some limitations:
164+
165+
- **No semantic understanding**: It treats words as independent units and doesn't understand synonyms or context
166+
- **Corpus dependency**: TF-IDF scores depend on the entire corpus, so adding or removing documents changes the scores
167+
- **Document length bias**: Can be affected by document length differences (though this is partially addressed by normalization)
168+
169+
For more advanced semantic analysis, techniques like word embeddings or transformer models might be more appropriate.
170+
:::
171+
172+
TF-IDF bridges the gap between simple word counting and more sophisticated text analysis techniques. By weighing words based on both their local importance (in a document) and their global rarity (across the corpus), it helps us discover the vocabulary that truly distinguishes different parts of our text data.

0 commit comments

Comments
 (0)