Skip to content

Commit a9b0fce

Browse files
authored
Merge pull request #54 from UCSB-Library-Research-Data-Services/rcomm
Text Analysis Chapter
2 parents 478421f + abd8e00 commit a9b0fce

File tree

15 files changed

+6887
-637
lines changed

15 files changed

+6887
-637
lines changed

.github/workflows/quarto.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,12 @@ jobs:
3131
- uses: r-lib/actions/setup-r@v2
3232
with:
3333
use-public-rspm: true
34+
35+
- name: Install system dependencies
36+
run: |
37+
sudo apt-get update
38+
sudo apt-get install -y libglpk-dev libxml2-dev libcurl4-openssl-dev libssl-dev
39+
3440
- name: Restore R packages via renv (if present)
3541
run: |
3642
if [ -f renv.lock ]; then

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,6 @@ _site/
1111
.DS_Store
1212
.Rproj.user
1313

14-
/data/raw/
14+
/data/raw/
15+
16+
.RData

_quarto-ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
execute:
2-
enabled: false
2+
enabled: true
33
freeze: auto
44
cache: true

_quarto.yml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ project:
44
render:
55
- "*.qmd"
66
- "!scripts/"
7+
- "!data/"
8+
- "workbook.qmd"
79

810
execute:
911
enabled: false
@@ -51,11 +53,11 @@ website:
5153
contents:
5254
- href: chapters/2.TextAnalysis/introduction.qmd
5355
text: Text Analysis
54-
- href: chapters/2.TextAnalysis/common_processes.qmd
55-
text: Common Text Analysis Processes
56-
- href: chapters/2.TextAnalysis/corpus_analysis.qmd
57-
text: Corpus-level Analysis
58-
- href: chapters/2.TextAnalysis/frequency_analysis.qmd
56+
- href: chapters/2.TextAnalysis/word_frequencies.qmd
57+
text: Basic Word Frequencies
58+
- href: chapters/2.TextAnalysis/ngrams.qmd
59+
text: N-grams and Collocations
60+
- href: chapters/2.TextAnalysis/tfidf.qmd
5961
text: Frequency Analysis
6062
- section: "Sentiment Analysis"
6163
contents:

chapters/2.TextAnalysis/common_processes.qmd

Whitespace-only changes.

chapters/2.TextAnalysis/corpus_analysis.qmd

Whitespace-only changes.

chapters/2.TextAnalysis/frequency_analysis.qmd

Whitespace-only changes.
Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,23 @@
11
---
2-
title: "Introduction to Text Analysis"
3-
---
2+
title: "What is Text Analysis?"
3+
engine: knitr
4+
format:
5+
html:
6+
fig-width: 10
7+
fig-height: 12
8+
dpi: 300
9+
editor_options:
10+
chunk_output_type: inline
11+
---
12+
13+
Text analysis is an umbrella concept that involves multiple techniques, methods, and approaches for "extracting" the meaning, structure, or general characteristics of a text by analyzing its constitutive words and symbols, and their relationships with a context, epoch, trend, intention, etc.
14+
15+
Thanks to the massification of computers and the miniaturization of computer power, computational methods for text analysis have become prevalent in certain contexts, allowing researchers to analyze large corpora of texts and also extrapolate those concepts for purposes beyond academic research, such as commercial text processing, sentiment analysis, or information retrieval.
16+
17+
Building on these foundations, this episode focuses on the introductory analytical techniques that establish common ground for more complex tasks such as sentiment analysis, language modeling, topic modeling, or text generation.
18+
19+
::: {.callout-note title="NLP"}
20+
Although Natural Language Processing (NLP) is sometimes used as a synonym for text analysis, Text Analysis encompasses both computational and non-computational approaches to analyzing text. NLP is primarily concerned with the interaction between computers and human language. It focuses on developing algorithms and models that enable machines to understand, interpret, and generate human language.
21+
:::
22+
23+

chapters/2.TextAnalysis/ngrams.qmd

Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
---
2+
title: "N-grams and Word Sequences"
3+
engine: knitr
4+
format:
5+
html:
6+
fig-width: 10
7+
fig-height: 12
8+
dpi: 300
9+
editor_options:
10+
chunk_output_type: inline
11+
---
12+
13+
```{r}
14+
#| include: false
15+
# This is just to render the document correctly in the CI/CD pipeline
16+
library(tidyverse)
17+
library(tidytext)
18+
19+
comments <- readr::read_csv("../../data/clean/comments_preprocessed.csv")
20+
```
21+
22+
As you can notice, counting words can be useful to explore common terms in a text corpus, but it does not capture the context in which words are used. To gain deeper insights into the relationships between words, we can analyze sequences of words, known as **n-grams**. N-grams are contiguous sequences of 'n' items (words) from a given text. For example, a bigram is a sequence of two words, while a trigram is a sequence of three words.
23+
24+
## Creating N-grams
25+
26+
Because creating n-grams involves tokenizing text into sequences of words, we can use the `unnest_tokens()` function from the `tidytext` package again, but this time specifying the `token` argument to create n-grams.
27+
28+
```{r}
29+
# Creating bigrams (2-grams) from the comments
30+
ngrams <- comments %>%
31+
unnest_tokens(ngrams, comments, token = "ngrams", n = 2) #bigrams
32+
33+
ngrams
34+
```
35+
36+
The resulting `ngrams` data frame contains bigrams extracted from the comments. Each row represents a bigram, which consists of two consecutive words from the original text.
37+
38+
By changing the value of `n` in the `unnest_tokens()` function, we can create trigrams (3-grams), four-grams, and so on, depending on our analysis needs.
39+
40+
```{r}
41+
# Creating trigrams (3-grams) from the comments
42+
trigrams <- comments %>%
43+
unnest_tokens(ngrams, comments, token = "ngrams", n = 3) #trigrams
44+
trigrams
45+
```
46+
47+
## Next Word Prediction Using N-grams
48+
49+
One practical application of n-grams is in building simple predictive text models. For instance, we can create a function that predicts the next word based on a given word using bigrams.
50+
51+
```{r}
52+
# Function to predict the next word based on a given word using bigrams
53+
next_word <- function(word, ngrams_df) {
54+
matches <- ngrams_df %>%
55+
separate(ngrams, into = c("w1", "w2"), sep = " ", remove = FALSE) %>%
56+
filter(w1 == word) %>%
57+
pull(w2)
58+
freq <- table(matches)
59+
nw <- max(freq)
60+
return(names(freq[freq == nw]))
61+
}
62+
```
63+
64+
This function takes a word and the n-grams data frame as inputs, finds all bigrams where the first word matches the input word, and returns the most frequently occurring second word as the predicted next word.
65+
66+
We can see how this function works by providing an example:
67+
68+
```{r}
69+
type_any_word = "ben"
70+
71+
next_word(type_any_word, ngrams)
72+
```
73+
74+
We can even play with a simple loop to see how the prediction evolves:
75+
76+
```{r}
77+
current_word = "wow"
78+
for (i in 1:5) {
79+
predicted_word = next_word(current_word, ngrams)
80+
cat(current_word, "->", predicted_word, "\n")
81+
current_word = predicted_word
82+
}
83+
```
84+
85+
If you have played with this code, you might notice that the predictions can sometimes lead to repetitive or nonsensical sequences. This is a limitation of using simple n-gram models without additional context or smoothing techniques. We can explore by using trigrams to see if predictions improve:
86+
87+
```{r}
88+
# Function to predict the next word based on a given two-word phrase using trigrams
89+
next_word_trigram <- function(phrase, trigrams_df) {
90+
words <- unlist(strsplit(phrase, " "))
91+
if (length(words) != 2) {
92+
stop("Please provide a two-word phrase.")
93+
}
94+
matches <- trigrams_df %>%
95+
separate(ngrams, into = c("w1", "w2", "w3"), sep = " ", remove = FALSE) %>%
96+
filter(w1 == words[1], w2 == words[2]) %>%
97+
pull(w3)
98+
freq <- table(matches)
99+
nw <- max(freq)
100+
return(names(freq[freq == nw]))
101+
}
102+
```
103+
104+
To use this function you would provide a two-word phrase, for instance "best show":
105+
106+
```{r}
107+
type_any_phrase = "best show"
108+
next_word_trigram(type_any_phrase, trigrams)
109+
```
110+
111+
## From N-grams to Collocations
112+
113+
While n-grams capture all consecutive word sequences, not all of them are equally meaningful. **Collocations** are word combinations that occur together more frequently than would be expected by chance. They represent meaningful multi-word expressions like "strong coffee," "make a decision," or in our data, perhaps "plot twist" or "character development."
114+
115+
The key difference:
116+
- **N-grams**: mechanical extraction of all consecutive words
117+
- **Collocations**: statistically significant word pairs that carry specific meaning
118+
119+
### Identifying Collocations
120+
121+
To find collocations, we need to measure how "associated" two words are. One common metric is **Pointwise Mutual Information (PMI)**, which compares how often words appear together versus how often we'd expect them to appear together if they were independent.
122+
123+
::: {.callout-note title="Other Collocation Metrics" collapse="true"}
124+
While we use PMI in this workshop, there are several other statistical measures commonly used to identify collocations:
125+
126+
- **Chi-square (χ²)**: Tests the independence of two words by comparing observed vs. expected frequencies. Higher values indicate stronger association.
127+
128+
- **Log-likelihood ratio (G²)**: Similar to chi-square but more reliable for small sample sizes. Commonly used in corpus linguistics.
129+
130+
- **T-score**: Measures the confidence in the association between two words. Less sensitive to low-frequency pairs than PMI.
131+
132+
- **Dice coefficient**: Measures the overlap between two words' contexts. Values range from 0 to 1.
133+
134+
Each metric has different strengths. PMI favors rare but strongly associated pairs, while t-score is more conservative and favors frequent collocations. The choice depends on your research goals and corpus characteristics.
135+
:::
136+
137+
First, let's separate our bigrams and count them:
138+
139+
```{r}
140+
library(tidyr)
141+
142+
# Separate bigrams into individual words and count
143+
bigram_counts <- ngrams %>%
144+
separate(ngrams, into = c("word1", "word2"), sep = " ", remove = FALSE) %>%
145+
count(word1, word2, sort = TRUE)
146+
147+
head(bigram_counts, 10)
148+
```
149+
150+
Now we'll calculate PMI for each bigram. PMI is calculated as:
151+
152+
$$\text{PMI}(w_1, w_2) = \log_2\left(\frac{P(w_1, w_2)}{P(w_1) \times P(w_2)}\right)$$
153+
154+
Where:
155+
156+
- $P(w_1, w_2)$ is the probability of the bigram occurring
157+
- $P(w_1)$ and $P(w_2)$ are the probabilities of each word occurring independently
158+
159+
```{r}
160+
library(dplyr)
161+
162+
# Calculate individual word frequencies
163+
word_freqs <- comments %>%
164+
unnest_tokens(word, comments) %>%
165+
count(word, name = "word_count")
166+
167+
# Total number of words in corpus
168+
total_words <- sum(word_freqs$word_count)
169+
170+
# Total number of bigrams
171+
total_bigrams <- sum(bigram_counts$n)
172+
173+
# Calculate PMI
174+
collocations <- bigram_counts %>%
175+
left_join(word_freqs, by = c("word1" = "word")) %>%
176+
rename(word1_count = word_count) %>%
177+
left_join(word_freqs, by = c("word2" = "word")) %>%
178+
rename(word2_count = word_count) %>%
179+
mutate(
180+
# Probability of bigram
181+
p_bigram = n / total_bigrams,
182+
# Probability of each word
183+
p_word1 = word1_count / total_words,
184+
p_word2 = word2_count / total_words,
185+
# PMI calculation
186+
pmi = log2(p_bigram / (p_word1 * p_word2))
187+
) %>%
188+
arrange(desc(pmi))
189+
190+
head(collocations, 15)
191+
```
192+
193+
High PMI values indicate strong collocations, that means word pairs that appear together much more than chance would predict.
194+
195+
### Visualizing Collocations
196+
197+
Let's visualize the strongest collocations to see what meaningful phrases emerge from our Severance comments:
198+
199+
```{r}
200+
library(ggplot2)
201+
202+
# Top 20 collocations by PMI
203+
top_collocations <- collocations %>%
204+
head(20) %>%
205+
unite(bigram, word1, word2, sep = " ")
206+
207+
ggplot(top_collocations, aes(x = reorder(bigram, pmi), y = pmi)) +
208+
geom_col(fill = "steelblue") +
209+
coord_flip() +
210+
labs(
211+
title = "Top 20 Collocations by PMI",
212+
x = "Bigram",
213+
y = "Pointwise Mutual Information"
214+
) +
215+
theme_minimal()
216+
```
217+
218+
### Using Collocations for Smarter Prediction
219+
220+
Remember our simple n-gram predictor that sometimes got stuck in loops? We can create a more "intelligent" predictor using collocations instead of raw frequency counts. The idea is simple: instead of picking the most frequent next word, we pick the word with the highest PMI (strongest association).
221+
222+
```{r}
223+
# Function to predict next word using collocation strength (PMI)
224+
next_word_collocation <- function(word, collocations_df, min_freq = 2) {
225+
candidates <- collocations_df %>%
226+
filter(word1 == word, n >= min_freq, pmi > 0) %>%
227+
arrange(desc(pmi))
228+
229+
# Return the word with highest PMI, or NA if no matches
230+
if (nrow(candidates) > 0) {
231+
return(candidates$word2[1])
232+
} else {
233+
return(NA)
234+
}
235+
}
236+
```
237+
238+
Let's compare the two approaches side by side:
239+
240+
```{r}
241+
# Compare frequency-based vs. collocation-based prediction
242+
test_word <- "mark"
243+
244+
freq_prediction <- next_word(test_word, ngrams)
245+
colloc_prediction <- next_word_collocation(test_word, collocations)
246+
247+
cat("Frequency-based predictor:", test_word, "->", freq_prediction, "\n")
248+
cat("Collocation-based predictor:", test_word, "->", colloc_prediction, "\n")
249+
```
250+
251+
Now let's run both predictors in a loop and see which produces more meaningful sequences:
252+
253+
```{r}
254+
# Frequency-based prediction
255+
current_word <- "wow"
256+
for (i in 1:10) {
257+
predicted_word <- next_word(current_word, ngrams)
258+
cat(current_word, "->", predicted_word, "\n")
259+
current_word <- predicted_word
260+
}
261+
262+
current_word <- "wow"
263+
for (i in 1:10) {
264+
predicted_word <- next_word_collocation(current_word, collocations)
265+
if (is.na(predicted_word)) {
266+
cat(current_word, "-> (no strong collocation found)\n")
267+
break
268+
}
269+
cat(current_word, "->", predicted_word, "\n")
270+
current_word <- predicted_word
271+
}
272+
```
273+
274+
As you can notice, both approaches are similar in structure, both are looking for the next word based on the current word. However, the collocation-based predictor leverages statistical associations between words, potentially leading to more contextually relevant predictions. This is an example of how different text analysis techniques can produce varying results based on the underlying data and methods used.

0 commit comments

Comments
 (0)