UCSB-Library-Research-Data-Services
diff --git a/‎.github/workflows/quarto.yml‎
Lines changed: 6 additions & 0 deletions b/‎.github/workflows/quarto.yml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 1 deletion b/‎.gitignore‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎_quarto-ci.yml‎
Lines changed: 1 addition & 1 deletion b/‎_quarto-ci.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎_quarto.yml‎
Lines changed: 7 additions & 5 deletions b/‎_quarto.yml‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎chapters/2.TextAnalysis/common_processes.qmd‎ b/‎chapters/2.TextAnalysis/common_processes.qmd‎
diff --git a/‎chapters/2.TextAnalysis/corpus_analysis.qmd‎ b/‎chapters/2.TextAnalysis/corpus_analysis.qmd‎
diff --git a/‎chapters/2.TextAnalysis/frequency_analysis.qmd‎ b/‎chapters/2.TextAnalysis/frequency_analysis.qmd‎
diff --git a/‎chapters/2.TextAnalysis/introduction.qmd‎
Lines changed: 22 additions & 2 deletions b/‎chapters/2.TextAnalysis/introduction.qmd‎
Lines changed: 22 additions & 2 deletions
diff --git a/‎chapters/2.TextAnalysis/ngrams.qmd‎
Lines changed: 274 additions & 0 deletions b/‎chapters/2.TextAnalysis/ngrams.qmd‎
Lines changed: 274 additions & 0 deletions
@@ -31,6 +31,12 @@ jobs:
         - uses: r-lib/actions/setup-r@v2
           with:
             use-public-rspm: true
+        
+        - name: Install system dependencies
+          run: |
+            sudo apt-get update
+            sudo apt-get install -y libglpk-dev libxml2-dev libcurl4-openssl-dev libssl-dev
+        
         - name: Restore R packages via renv (if present)
           run: |
             if [ -f renv.lock ]; then
 
@@ -11,4 +11,6 @@ _site/
 .DS_Store
 .Rproj.user
 
-/data/raw/
+/data/raw/
+
+.RData
@@ -1,4 +1,4 @@
 execute:
-  enabled: false
+  enabled: true
   freeze: auto
   cache: true
@@ -4,6 +4,8 @@ project:
   render:
     - "*.qmd"
     - "!scripts/"
+    - "!data/"
+    - "workbook.qmd"
 
 execute: 
   enabled: false
@@ -51,11 +53,11 @@ website:
         contents:
           - href: chapters/2.TextAnalysis/introduction.qmd
             text: Text Analysis
-          - href: chapters/2.TextAnalysis/common_processes.qmd
-            text: Common Text Analysis Processes
-          - href: chapters/2.TextAnalysis/corpus_analysis.qmd
-            text: Corpus-level Analysis
-          - href: chapters/2.TextAnalysis/frequency_analysis.qmd
+          - href: chapters/2.TextAnalysis/word_frequencies.qmd
+            text: Basic Word Frequencies
+          - href: chapters/2.TextAnalysis/ngrams.qmd
+            text: N-grams and Collocations
+          - href: chapters/2.TextAnalysis/tfidf.qmd
             text: Frequency Analysis
       - section: "Sentiment Analysis"
         contents:
 
@@ -1,3 +1,23 @@
 ---
-title: "Introduction to Text Analysis"
----
+title: "What is Text Analysis?"
+engine: knitr
+format:
+  html:
+    fig-width: 10
+    fig-height: 12
+    dpi: 300
+editor_options: 
+  chunk_output_type: inline
+---
+
+Text analysis is an umbrella concept that involves multiple techniques, methods, and approaches for "extracting" the meaning, structure, or general characteristics of a text by analyzing its constitutive words and symbols, and their relationships with a context, epoch, trend, intention, etc.
+
+Thanks to the massification of computers and the miniaturization of computer power, computational methods for text analysis have become prevalent in certain contexts, allowing researchers to analyze large corpora of texts and also extrapolate those concepts for purposes beyond academic research, such as commercial text processing, sentiment analysis, or information retrieval.
+
+Building on these foundations, this episode focuses on the introductory analytical techniques that establish common ground for more complex tasks such as sentiment analysis, language modeling, topic modeling, or text generation.
+
+::: {.callout-note title="NLP"}
+Although Natural Language Processing (NLP) is sometimes used as a synonym for text analysis, Text Analysis encompasses both computational and non-computational approaches to analyzing text. NLP is primarily concerned with the interaction between computers and human language. It focuses on developing algorithms and models that enable machines to understand, interpret, and generate human language.
+:::
+
+
@@ -0,0 +1,274 @@
+---
+title: "N-grams and Word Sequences"
+engine: knitr
+format:
+  html:
+    fig-width: 10
+    fig-height: 12
+    dpi: 300
+editor_options: 
+  chunk_output_type: inline
+---
+
+```{r}
+#| include: false
+# This is just to render the document correctly in the CI/CD pipeline
+library(tidyverse)
+library(tidytext)
+
+comments <- readr::read_csv("../../data/clean/comments_preprocessed.csv") 
+```
+
+As you can notice, counting words can be useful to explore common terms in a text corpus, but it does not capture the context in which words are used. To gain deeper insights into the relationships between words, we can analyze sequences of words, known as **n-grams**. N-grams are contiguous sequences of 'n' items (words) from a given text. For example, a bigram is a sequence of two words, while a trigram is a sequence of three words.
+
+## Creating N-grams
+
+Because creating n-grams involves tokenizing text into sequences of words, we can use the `unnest_tokens()` function from the `tidytext` package again, but this time specifying the `token` argument to create n-grams.
+
+```{r}
+# Creating bigrams (2-grams) from the comments
+ngrams <- comments %>%
+  unnest_tokens(ngrams, comments, token = "ngrams", n = 2) #bigrams 
+
+ngrams
+```
+
+The resulting `ngrams` data frame contains bigrams extracted from the comments. Each row represents a bigram, which consists of two consecutive words from the original text.
+
+By changing the value of `n` in the `unnest_tokens()` function, we can create trigrams (3-grams), four-grams, and so on, depending on our analysis needs.
+
+```{r}
+# Creating trigrams (3-grams) from the comments
+trigrams <- comments %>%
+  unnest_tokens(ngrams, comments, token = "ngrams", n = 3) #trigrams
+trigrams
+```
+
+## Next Word Prediction Using N-grams
+
+One practical application of n-grams is in building simple predictive text models. For instance, we can create a function that predicts the next word based on a given word using bigrams.
+
+```{r}
+# Function to predict the next word based on a given word using bigrams
+next_word <- function(word, ngrams_df) {
+    matches <- ngrams_df %>%
+        separate(ngrams, into = c("w1", "w2"), sep = " ", remove = FALSE) %>%
+        filter(w1 == word) %>%
+        pull(w2)
+    freq <- table(matches)
+    nw <- max(freq)
+    return(names(freq[freq == nw]))
+}
+```
+
+This function takes a word and the n-grams data frame as inputs, finds all bigrams where the first word matches the input word, and returns the most frequently occurring second word as the predicted next word.
+
+We can see how this function works by providing an example:
+
+```{r}
+type_any_word = "ben"
+
+next_word(type_any_word, ngrams)
+```
+
+We can even play with a simple loop to see how the prediction evolves:
+
+```{r}
+current_word = "wow"
+for (i in 1:5) {
+  predicted_word = next_word(current_word, ngrams)
+  cat(current_word, "->", predicted_word, "\n")
+  current_word = predicted_word
+}
+```
+
+If you have played with this code, you might notice that the predictions can sometimes lead to repetitive or nonsensical sequences. This is a limitation of using simple n-gram models without additional context or smoothing techniques. We can explore by using trigrams to see if predictions improve:
+
+```{r}
+# Function to predict the next word based on a given two-word phrase using trigrams
+next_word_trigram <- function(phrase, trigrams_df) {
+    words <- unlist(strsplit(phrase, " "))
+    if (length(words) != 2) {
+        stop("Please provide a two-word phrase.")
+    }
+    matches <- trigrams_df %>%
+        separate(ngrams, into = c("w1", "w2", "w3"), sep = " ", remove = FALSE) %>%
+        filter(w1 == words[1], w2 == words[2]) %>%
+        pull(w3)
+    freq <- table(matches)
+    nw <- max(freq)
+    return(names(freq[freq == nw]))
+}
+```
+
+To use this function you would provide a two-word phrase, for instance "best show":
+
+```{r}
+type_any_phrase = "best show"
+next_word_trigram(type_any_phrase, trigrams)
+```
+
+## From N-grams to Collocations
+
+While n-grams capture all consecutive word sequences, not all of them are equally meaningful. **Collocations** are word combinations that occur together more frequently than would be expected by chance. They represent meaningful multi-word expressions like "strong coffee," "make a decision," or in our data, perhaps "plot twist" or "character development."
+
+The key difference:
+- **N-grams**: mechanical extraction of all consecutive words
+- **Collocations**: statistically significant word pairs that carry specific meaning
+
+### Identifying Collocations
+
+To find collocations, we need to measure how "associated" two words are. One common metric is **Pointwise Mutual Information (PMI)**, which compares how often words appear together versus how often we'd expect them to appear together if they were independent.
+
+::: {.callout-note title="Other Collocation Metrics" collapse="true"}
+While we use PMI in this workshop, there are several other statistical measures commonly used to identify collocations:
+
+- **Chi-square (χ²)**: Tests the independence of two words by comparing observed vs. expected frequencies. Higher values indicate stronger association.
+
+- **Log-likelihood ratio (G²)**: Similar to chi-square but more reliable for small sample sizes. Commonly used in corpus linguistics.
+
+- **T-score**: Measures the confidence in the association between two words. Less sensitive to low-frequency pairs than PMI.
+
+- **Dice coefficient**: Measures the overlap between two words' contexts. Values range from 0 to 1.
+
+Each metric has different strengths. PMI favors rare but strongly associated pairs, while t-score is more conservative and favors frequent collocations. The choice depends on your research goals and corpus characteristics.
+:::
+
+First, let's separate our bigrams and count them:
+
+```{r}
+library(tidyr)
+
+# Separate bigrams into individual words and count
+bigram_counts <- ngrams %>%
+  separate(ngrams, into = c("word1", "word2"), sep = " ", remove = FALSE) %>%
+  count(word1, word2, sort = TRUE)
+
+head(bigram_counts, 10)
+```
+
+Now we'll calculate PMI for each bigram. PMI is calculated as:
+
+$$\text{PMI}(w_1, w_2) = \log_2\left(\frac{P(w_1, w_2)}{P(w_1) \times P(w_2)}\right)$$
+
+Where:
+
+- $P(w_1, w_2)$ is the probability of the bigram occurring
+- $P(w_1)$ and $P(w_2)$ are the probabilities of each word occurring independently
+
+```{r}
+library(dplyr)
+
+# Calculate individual word frequencies
+word_freqs <- comments %>%
+  unnest_tokens(word, comments) %>%
+  count(word, name = "word_count")
+
+# Total number of words in corpus
+total_words <- sum(word_freqs$word_count)
+
+# Total number of bigrams
+total_bigrams <- sum(bigram_counts$n)
+
+# Calculate PMI
+collocations <- bigram_counts %>%
+  left_join(word_freqs, by = c("word1" = "word")) %>%
+  rename(word1_count = word_count) %>%
+  left_join(word_freqs, by = c("word2" = "word")) %>%
+  rename(word2_count = word_count) %>%
+  mutate(
+    # Probability of bigram
+    p_bigram = n / total_bigrams,
+    # Probability of each word
+    p_word1 = word1_count / total_words,
+    p_word2 = word2_count / total_words,
+    # PMI calculation
+    pmi = log2(p_bigram / (p_word1 * p_word2))
+  ) %>%
+  arrange(desc(pmi))
+
+head(collocations, 15)
+```
+
+High PMI values indicate strong collocations, that means word pairs that appear together much more than chance would predict.
+
+### Visualizing Collocations
+
+Let's visualize the strongest collocations to see what meaningful phrases emerge from our Severance comments:
+
+```{r}
+library(ggplot2)
+
+# Top 20 collocations by PMI
+top_collocations <- collocations %>%
+  head(20) %>%
+  unite(bigram, word1, word2, sep = " ")
+
+ggplot(top_collocations, aes(x = reorder(bigram, pmi), y = pmi)) +
+  geom_col(fill = "steelblue") +
+  coord_flip() +
+  labs(
+    title = "Top 20 Collocations by PMI",
+    x = "Bigram",
+    y = "Pointwise Mutual Information"
+  ) +
+  theme_minimal()
+```
+
+### Using Collocations for Smarter Prediction
+
+Remember our simple n-gram predictor that sometimes got stuck in loops? We can create a more "intelligent" predictor using collocations instead of raw frequency counts. The idea is simple: instead of picking the most frequent next word, we pick the word with the highest PMI (strongest association).
+
+```{r}
+# Function to predict next word using collocation strength (PMI)
+next_word_collocation <- function(word, collocations_df, min_freq = 2) {
+    candidates <- collocations_df %>%
+        filter(word1 == word, n >= min_freq, pmi > 0) %>%
+        arrange(desc(pmi))
+    
+    # Return the word with highest PMI, or NA if no matches
+    if (nrow(candidates) > 0) {
+        return(candidates$word2[1])
+    } else {
+        return(NA)
+    }
+}
+```
+
+Let's compare the two approaches side by side:
+
+```{r}
+# Compare frequency-based vs. collocation-based prediction
+test_word <- "mark"
+
+freq_prediction <- next_word(test_word, ngrams)
+colloc_prediction <- next_word_collocation(test_word, collocations)
+
+cat("Frequency-based predictor:", test_word, "->", freq_prediction, "\n")
+cat("Collocation-based predictor:", test_word, "->", colloc_prediction, "\n")
+```
+
+Now let's run both predictors in a loop and see which produces more meaningful sequences:
+
+```{r}
+# Frequency-based prediction
+current_word <- "wow"
+for (i in 1:10) {
+  predicted_word <- next_word(current_word, ngrams)
+  cat(current_word, "->", predicted_word, "\n")
+  current_word <- predicted_word
+}
+
+current_word <- "wow"
+for (i in 1:10) {
+  predicted_word <- next_word_collocation(current_word, collocations)
+  if (is.na(predicted_word)) {
+    cat(current_word, "-> (no strong collocation found)\n")
+    break
+  }
+  cat(current_word, "->", predicted_word, "\n")
+  current_word <- predicted_word
+}
+```
+
+As you can notice, both approaches are similar in structure, both are looking for the next word based on the current word. However, the collocation-based predictor leverages statistical associations between words, potentially leading to more contextually relevant predictions. This is an example of how different text analysis techniques can produce varying results based on the underlying data and methods used.