|
| 1 | +--- |
| 2 | +title: "TF-IDF: Finding Distinctive Vocabulary" |
| 3 | +engine: knitr |
| 4 | +format: |
| 5 | + html: |
| 6 | + fig-width: 10 |
| 7 | + fig-height: 12 |
| 8 | + dpi: 300 |
| 9 | +editor_options: |
| 10 | + chunk_output_type: inline |
| 11 | +--- |
| 12 | + |
| 13 | +```{r} |
| 14 | +#| include: false |
| 15 | +# This is just to render the document correctly in the CI/CD pipeline |
| 16 | +library(tidyverse) |
| 17 | +library(tidytext) |
| 18 | +
|
| 19 | +comments <- readr::read_csv("../../data/clean/comments_preprocessed.csv") |
| 20 | +``` |
| 21 | + |
| 22 | +So far, we have explored word frequencies and n-grams to understand common terms and phrases in our text data. However, simply counting words has a limitation: some words are frequent because they appear often across *all* documents, not because they are particularly meaningful for a specific document or group. |
| 23 | + |
| 24 | +For example, in our Severance dataset, words like "season," "episode," and "show" might appear frequently in comments about both Season 1 and Season 2. While these words are common, they don't help us understand what makes each season's discussion *distinctive*. |
| 25 | + |
| 26 | +This is where **TF-IDF (Term Frequency-Inverse Document Frequency)** becomes useful. TF-IDF is a statistical measure that evaluates how important a word is to a document in a collection of documents (corpus). It helps us identify words that are frequent in one document but rare across the entire corpus—precisely the words that make a document unique. |
| 27 | + |
| 28 | +## Understanding TF-IDF |
| 29 | + |
| 30 | +TF-IDF combines two metrics: |
| 31 | + |
| 32 | +1. **Term Frequency (TF)**: How often a word appears in a document |
| 33 | +2. **Inverse Document Frequency (IDF)**: How rare a word is across all documents |
| 34 | + |
| 35 | +The formula is: |
| 36 | + |
| 37 | +$$\text{TF-IDF} = \text{TF} \times \text{IDF}$$ |
| 38 | + |
| 39 | +Where: |
| 40 | + |
| 41 | +$$\text{TF}(t, d) = \frac{\text{count of term } t \text{ in document } d}{\text{total terms in document } d}$$ |
| 42 | + |
| 43 | +$$\text{IDF}(t) = \log\left(\frac{\text{total number of documents}}{\text{number of documents containing term } t}\right)$$ |
| 44 | + |
| 45 | +A word gets a **high TF-IDF score** when: |
| 46 | +- It appears frequently in a particular document (high TF) |
| 47 | +- It appears in few other documents (high IDF) |
| 48 | + |
| 49 | +A word gets a **low TF-IDF score** when: |
| 50 | +- It appears in many documents (low IDF), even if it's frequent in one document |
| 51 | + |
| 52 | +::: {.callout-note title="Why logarithm in IDF?" collapse="true"} |
| 53 | +The logarithm in the IDF calculation dampens the effect of very rare words. Without it, a word that appears in only one document would dominate the scores. The log transformation ensures that the IDF increases more slowly as words become rarer, creating a more balanced weighting system. |
| 54 | +::: |
| 55 | + |
| 56 | +## Calculating TF-IDF |
| 57 | + |
| 58 | +In our case, we want to compare the vocabulary between Season 1 and Season 2 comments. We'll treat each season as a "document" and calculate TF-IDF to find which words are distinctive to each season. |
| 59 | + |
| 60 | +First, we need to extract season information from the `id` column and tokenize the comments: |
| 61 | + |
| 62 | +```{r} |
| 63 | +# Calculate TF-IDF by season |
| 64 | +comments_tfidf <- comments %>% |
| 65 | + mutate(season = str_extract(id, "s[12]")) %>% # Extract season (s1 or s2) |
| 66 | + unnest_tokens(word, comments) %>% # Tokenize into words |
| 67 | + count(season, word, sort = TRUE) # Count words per season |
| 68 | +
|
| 69 | +head(comments_tfidf) |
| 70 | +``` |
| 71 | + |
| 72 | +Now we can apply the `bind_tf_idf()` function from the `tidytext` package, which automatically calculates TF, IDF, and TF-IDF for us: |
| 73 | + |
| 74 | +```{r} |
| 75 | +# Apply TF-IDF calculation |
| 76 | +comments_tfidf <- comments_tfidf %>% |
| 77 | + bind_tf_idf(word, season, n) |
| 78 | +
|
| 79 | +head(comments_tfidf, 15) |
| 80 | +``` |
| 81 | + |
| 82 | +The resulting data frame includes: |
| 83 | +- `tf`: Term frequency (proportion of times the word appears in that season) |
| 84 | +- `idf`: Inverse document frequency (how rare the word is across seasons) |
| 85 | +- `tf_idf`: The product of TF and IDF |
| 86 | + |
| 87 | +Let's examine the top words by TF-IDF for each season: |
| 88 | + |
| 89 | +```{r} |
| 90 | +# Top 10 distinctive words per season |
| 91 | +comments_tfidf %>% |
| 92 | + group_by(season) %>% |
| 93 | + slice_max(tf_idf, n = 10) |
| 94 | +``` |
| 95 | + |
| 96 | +Notice how these words are much more specific and meaningful than simply looking at the most frequent words. These are the words that truly characterize each season's discussion. |
| 97 | + |
| 98 | +## Visualizing Distinctive Vocabulary |
| 99 | + |
| 100 | +To better understand the distinctive vocabulary of each season, we can create a visualization comparing the top TF-IDF words: |
| 101 | + |
| 102 | +```{r} |
| 103 | +# Prepare data for visualization |
| 104 | +top_tfidf_words <- comments_tfidf %>% |
| 105 | + group_by(season) %>% |
| 106 | + slice_max(tf_idf, n = 15) %>% |
| 107 | + ungroup() %>% |
| 108 | + mutate(word = reorder_within(word, tf_idf, season)) |
| 109 | +
|
| 110 | +# Plot distinctive vocabulary by season |
| 111 | +ggplot(top_tfidf_words, aes(tf_idf, word, fill = season)) + |
| 112 | + geom_col(show.legend = FALSE) + |
| 113 | + facet_wrap(~season, scales = "free") + |
| 114 | + scale_y_reordered() + |
| 115 | + labs( |
| 116 | + x = "TF-IDF", |
| 117 | + y = NULL, |
| 118 | + title = "Distinctive Vocabulary by Season" |
| 119 | + ) + |
| 120 | + theme_minimal() |
| 121 | +``` |
| 122 | + |
| 123 | +This visualization clearly shows which words are most characteristic of each season's discussions. Words with higher TF-IDF scores are those that appear frequently in one season but not in the other, making them useful markers of distinctive content. |
| 124 | + |
| 125 | +## Comparing TF-IDF to Raw Frequency |
| 126 | + |
| 127 | +To appreciate the value of TF-IDF, let's compare it to simple word counts. We'll look at the top words by frequency versus the top words by TF-IDF for Season 1: |
| 128 | + |
| 129 | +```{r} |
| 130 | +# Top words by raw frequency for Season 1 |
| 131 | +top_freq_s1 <- comments %>% |
| 132 | + filter(grepl("^s1", id)) %>% |
| 133 | + unnest_tokens(word, comments) %>% |
| 134 | + count(word, sort = TRUE) %>% |
| 135 | + head(15) |
| 136 | +
|
| 137 | +# Top words by TF-IDF for Season 1 |
| 138 | +top_tfidf_s1 <- comments_tfidf %>% |
| 139 | + filter(season == "s1") %>% |
| 140 | + arrange(desc(tf_idf)) %>% |
| 141 | + head(15) |
| 142 | +
|
| 143 | +# Compare |
| 144 | +cat("=== Top 15 words by frequency (Season 1) ===\n") |
| 145 | +print(top_freq_s1) |
| 146 | +
|
| 147 | +cat("\n=== Top 15 words by TF-IDF (Season 1) ===\n") |
| 148 | +print(top_tfidf_s1 %>% select(word, n, tf_idf)) |
| 149 | +``` |
| 150 | + |
| 151 | +The raw frequency list likely includes many words that are common across both seasons, while the TF-IDF list highlights words that are specifically important to Season 1 discussions. |
| 152 | + |
| 153 | +## When to Use TF-IDF |
| 154 | + |
| 155 | +TF-IDF is particularly useful for: |
| 156 | + |
| 157 | +1. **Document comparison**: Identifying what makes each document unique in a collection |
| 158 | +2. **Feature extraction**: Preparing text data for machine learning by emphasizing distinctive words |
| 159 | +3. **Topic discovery**: Finding characteristic vocabulary for different groups or categories |
| 160 | +4. **Search and retrieval**: Ranking documents by relevance to a query (search engines use variations of TF-IDF) |
| 161 | + |
| 162 | +::: {.callout-tip title="Limitations of TF-IDF"} |
| 163 | +While TF-IDF is powerful, it has some limitations: |
| 164 | + |
| 165 | +- **No semantic understanding**: It treats words as independent units and doesn't understand synonyms or context |
| 166 | +- **Corpus dependency**: TF-IDF scores depend on the entire corpus, so adding or removing documents changes the scores |
| 167 | +- **Document length bias**: Can be affected by document length differences (though this is partially addressed by normalization) |
| 168 | + |
| 169 | +For more advanced semantic analysis, techniques like word embeddings or transformer models might be more appropriate. |
| 170 | +::: |
| 171 | + |
| 172 | +TF-IDF bridges the gap between simple word counting and more sophisticated text analysis techniques. By weighing words based on both their local importance (in a document) and their global rarity (across the corpus), it helps us discover the vocabulary that truly distinguishes different parts of our text data. |
0 commit comments