Skip to content

Commit 09e89e3

Browse files
committed
fix rendering issues
1 parent 74b45bb commit 09e89e3

File tree

3 files changed

+204
-164
lines changed

3 files changed

+204
-164
lines changed
Lines changed: 182 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "What is Text Analysis?"
2+
title: "Basic Text Analysis"
33
engine: knitr
44
format:
55
html:
@@ -10,14 +10,190 @@ editor_options:
1010
chunk_output_type: inline
1111
---
1212

13-
Text analysis is an umbrella concept that involves multiple techniques, methods, and approaches for "extracting" the meaning, structure, or general characteristics of a text by analyzing its constitutive words and symbols, and their relationships with a context, epoch, trend, intention, etc.
13+
In this chapter, we will explore some basic techniques for analyzing text data.
1414

15-
Thanks to the massification of computers and the miniaturization of computer power, computational methods for text analysis have become prevalent in certain contexts, allowing researchers to analyze large corpora of texts and also extrapolate those concepts for purposes beyond academic research, such as commercial text processing, sentiment analysis, or information retrieval.
15+
## Importing the Data
1616

17-
Building on these foundations, this episode focuses on the introductory analytical techniques that establish common ground for more complex tasks such as sentiment analysis, language modeling, topic modeling, or text generation.
17+
```{r}
18+
#| output: false
19+
# Load necessary libraries
20+
library(tidyverse)
21+
library(tidytext)
22+
```
1823

19-
::: {.callout-note title="NLP"}
20-
Although Natural Language Processing (NLP) is sometimes used as a synonym for text analysis, Text Analysis encompasses both computational and non-computational approaches to analyzing text. NLP is primarily concerned with the interaction between computers and human language. It focuses on developing algorithms and models that enable machines to understand, interpret, and generate human language.
24+
```{r}
25+
#| output: false
26+
# Load the text data
27+
comments <- readr::read_csv("../../data/clean/comments_preprocessed.csv") # Adjust the path to your data location
28+
```
29+
30+
Explore the first few rows of the dataset to understand its structure.
31+
32+
```{r}
33+
head(comments)
34+
```
35+
36+
We can see that the dataset is imported as a tibble with three columns: `..1`, `id`, and `comments`. We are going to focus on the `comments` column for our text analysis, but the `id` column can be useful for grouping or filtering the data by season.
37+
38+
## Word Frequency Analysis
39+
40+
Word frequency analysis is one of the most fundamental techniques in text analysis. It helps us understand which words appear most often in our texts and can reveal important patterns about content, style, and themes.
41+
42+
### Word Counts
43+
44+
We can start by calculating the frequency of each word in our corpus. This involves tokenizing the text into individual words, counting the occurrences of each word, and then sorting them by frequency.
45+
46+
**Tokenization** is the process of breaking down text into smaller units, such as words or phrases. In this case, we will use the `unnest_tokens()` function from the `tidytext` package to tokenize our comments into words.
47+
48+
::: {.callout-note title="Why not use strsplit()?" collapse="true"}
49+
While the `strsplit()` function can be used for basic tokenization, it lacks the advanced features provided by `unnest_tokens()`, such as handling punctuation, converting to lowercase, and removing stop words. Using `unnest_tokens()` ensures a more accurate and efficient tokenization process, especially for larger datasets.
2150
:::
2251

52+
```{r}
53+
# Tokenizing the comments into words
54+
tokens <- comments %>%
55+
unnest_tokens(word, comments)
56+
57+
head(tokens)
58+
```
59+
60+
Note that the resulting `tokens` tibble contains a column named `word`, which holds the individual words extracted from each comment.
61+
62+
With this tokenized data, we can now counting words. For instance, just simply counting the occurrences of each word:
63+
64+
```{r}
65+
# Counting word frequencies
66+
word_counts <- tokens %>%
67+
count(word, sort = TRUE)
68+
69+
head(word_counts)
70+
```
71+
72+
This will give us a list of words along with their corresponding frequencies, sorted in descending order. We can also visualize the most common words using a bar plot or a word cloud.
73+
74+
```{r}
75+
# Visualizing the top 20 most common words
76+
top_words <- word_counts %>%
77+
top_n(20)
78+
79+
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
80+
geom_bar(stat = "identity") +
81+
labs(title = "Top 20 Most Common Words", x = "Words", y = "Frequency")
82+
```
83+
84+
We can also create a word cloud to visualize word frequencies in a more engaging way.
85+
86+
```{r}
87+
#| fig-height: 6
88+
# Creating a word cloud
89+
library(ggwordcloud)
90+
91+
ggplot(word_counts %>% top_n(100), aes(label = word, size = n)) +
92+
geom_text_wordcloud() +
93+
scale_size_area(max_size = 20) +
94+
theme_minimal() +
95+
coord_fixed(ratio = 1) +
96+
labs(title = "Word Cloud")
97+
```
98+
99+
As expected, even in a preprocessed corpus, some words become dominant due to their frequent usage. In this case, "severance", "season", and "finale", pop up as the most frequent words. To get a more meaningful analysis, we can filter out these common words.
100+
101+
```{r}
102+
# Filtering out common words for a more meaningful word cloud
103+
common_words <- c("severance", "season", "appleTV", "apple", "tv", "show", "finale", "episode") # you can expand this list as needed
104+
105+
filtered_word_counts <- word_counts %>%
106+
filter(!word %in% common_words)
107+
108+
# Creating a filtered word cloud
109+
ggplot(filtered_word_counts %>% top_n(100), aes(label = word, size = n)) +
110+
geom_text_wordcloud() +
111+
scale_size_area(max_size = 20) +
112+
theme_minimal() +
113+
coord_fixed(ratio = 1) +
114+
labs(title = "Filtered Word Cloud")
115+
```
116+
117+
Now we have a more distributed word cloud that highlights other significant words in the corpus.
118+
119+
### Words by Season
120+
121+
A simple but effective way to analyze text data is to compare word frequencies across different categories or groups. In this case, we can compare the word frequencies between different seasons of the show.
122+
123+
```{r}
124+
#| fig-height: 5
125+
# Filtering and creating word clouds by season
126+
127+
season_1_tokens <- tokens %>%
128+
filter(grepl("^s1", id)) %>%
129+
count(word, sort = TRUE) %>%
130+
filter(!word %in% common_words) %>%
131+
top_n(20)
132+
133+
season_2_tokens <- tokens %>%
134+
filter(grepl("^s2", id)) %>%
135+
count(word, sort = TRUE) %>%
136+
filter(!word %in% common_words) %>%
137+
top_n(20)
138+
139+
library(patchwork)
140+
141+
p1 <- ggplot(season_1_tokens, aes(label = word, size = n)) +
142+
geom_text_wordcloud(color = "darkblue") +
143+
scale_size_area(max_size = 20) +
144+
theme_minimal() +
145+
coord_fixed(ratio = 1) +
146+
labs(title = "Season 1")
147+
148+
p2 <- ggplot(season_2_tokens, aes(label = word, size = n)) +
149+
geom_text_wordcloud(color = "darkred") +
150+
scale_size_area(max_size = 20) +
151+
theme_minimal() +
152+
coord_fixed(ratio = 1) +
153+
labs(title = "Season 2")
154+
155+
p1 + p2
156+
```
157+
158+
We can even select the 50 more frequent words per season and extract those that are unique to each season.
159+
160+
```{r}
161+
#| fig-height: 5
162+
# Finding unique words per season
163+
top_50_s1 <- season_1_tokens %>%
164+
top_n(50) %>%
165+
pull(word)
166+
top_50_s2 <- season_2_tokens %>%
167+
top_n(50) %>%
168+
pull(word)
169+
170+
unique_s1 <- setdiff(top_50_s1, top_50_s2)
171+
unique_s2 <- setdiff(top_50_s2, top_50_s1)
172+
173+
unique_s1_tokens <- season_1_tokens %>%
174+
filter(word %in% unique_s1)
175+
unique_s2_tokens <- season_2_tokens %>%
176+
filter(word %in% unique_s2)
177+
178+
# Displaying unique word clouds for each season
179+
p3 <- ggplot(unique_s1_tokens, aes(label = word, size = n)) +
180+
geom_text_wordcloud(color = "lightblue") +
181+
scale_size_area(max_size = 20) +
182+
theme_minimal() +
183+
coord_fixed(ratio = 1) +
184+
labs(title = "Unique Words - Season 1")
185+
186+
p4 <- ggplot(unique_s2_tokens, aes(label = word, size = n)) +
187+
geom_text_wordcloud(color = "lightcoral") +
188+
scale_size_area(max_size = 20) +
189+
theme_minimal() +
190+
coord_fixed(ratio = 1) +
191+
labs(title = "Unique Words - Season 2")
192+
193+
p3 + p4
194+
```
195+
196+
This analysis allows us to see which words are more prominent in each season, providing insights into the themes and topics that were more relevant during those times.
197+
198+
With these basic text analysis techniques, we can start to uncover patterns and insights from our text data. Although simple, these methods are helpful to explore the content and structure of the text, setting the stage for more advanced analyses, or even informing about the quality of the pre-processing steps applied to the data.
23199

Lines changed: 0 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -1,156 +0,0 @@
1-
---
2-
title: "Basic Text Analysis"
3-
engine: knitr
4-
format:
5-
html:
6-
fig-width: 10
7-
fig-height: 12
8-
dpi: 300
9-
editor_options:
10-
chunk_output_type: inline
11-
---
12-
13-
In this chapter, we will explore some basic techniques for analyzing text data.
14-
15-
## Importing the Data
16-
17-
```{r}
18-
# Load necessary libraries
19-
library(tidyverse)
20-
library(tidytext)
21-
```
22-
23-
```{r}
24-
# Load the text data
25-
comments <- readr::read_csv("data/clean/comments_preprocessed.csv") # Adjust the path to your data location
26-
```
27-
28-
Explore the first few rows of the dataset to understand its structure.
29-
30-
```{r}
31-
head(comments)
32-
```
33-
34-
We can see that the dataset is imported as a tibble with three columns: `..1`, `id`, and `comments`. We are going to focus on the `comments` column for our text analysis, but the `id` column can be useful for grouping or filtering the data by season.
35-
36-
## Word Frequency Analysis
37-
38-
Word frequency analysis is one of the most fundamental techniques in text analysis. It helps us understand which words appear most often in our texts and can reveal important patterns about content, style, and themes.
39-
40-
### Word Counts
41-
42-
We can start by calculating the frequency of each word in our corpus. This involves tokenizing the text into individual words, counting the occurrences of each word, and then sorting them by frequency.
43-
44-
**Tokenization** is the process of breaking down text into smaller units, such as words or phrases. In this case, we will use the `unnest_tokens()` function from the `tidytext` package to tokenize our comments into words.
45-
46-
::: {.callout-note title="Why not use strsplit()?" collapse="true"}
47-
While the `strsplit()` function can be used for basic tokenization, it lacks the advanced features provided by `unnest_tokens()`, such as handling punctuation, converting to lowercase, and removing stop words. Using `unnest_tokens()` ensures a more accurate and efficient tokenization process, especially for larger datasets.
48-
:::
49-
50-
```{r}
51-
# Tokenizing the comments into words
52-
tokens <- comments %>%
53-
unnest_tokens(word, comments)
54-
55-
head(tokens)
56-
```
57-
58-
Note that the resulting `tokens` tibble contains a column named `word`, which holds the individual words extracted from each comment.
59-
60-
With this tokenized data, we can now counting words. For instance, just simply counting the occurrences of each word:
61-
62-
```{r}
63-
# Counting word frequencies
64-
word_counts <- tokens %>%
65-
count(word, sort = TRUE)
66-
67-
head(word_counts)
68-
```
69-
70-
This will give us a list of words along with their corresponding frequencies, sorted in descending order. We can also visualize the most common words using a bar plot or a word cloud.
71-
72-
```{r}
73-
# Visualizing the top 20 most common words
74-
top_words <- word_counts %>%
75-
top_n(20)
76-
77-
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
78-
geom_bar(stat = "identity") +
79-
labs(title = "Top 20 Most Common Words", x = "Words", y = "Frequency")
80-
```
81-
82-
We can also create a word cloud to visualize word frequencies in a more engaging way.
83-
84-
```{r}
85-
# Creating a word cloud
86-
library(wordcloud2)
87-
wordcloud2(data = word_counts, size = 1)
88-
```
89-
90-
As expected, even in a preprocessed corpus, some words become dominant due to their frequent usage. In this case, "severance", "season", and "finale", pop up as the most frequent words. To get a more meaningful analysis, we can filter out these common words.
91-
92-
```{r}
93-
# Filtering out common words for a more meaningful word cloud
94-
common_words <- c("severance", "season", "appleTV", "apple", "tv", "show", "finale", "episode") # you can expand this list as needed
95-
96-
filtered_word_counts <- word_counts %>%
97-
filter(!word %in% common_words)
98-
99-
# Creating a filtered word cloud
100-
wordcloud2(data = filtered_word_counts, size = 1)
101-
```
102-
103-
Now we have a more distributed word cloud that highlights other significant words in the corpus.
104-
105-
### Words by Season
106-
107-
A simple but effective way to analyze text data is to compare word frequencies across different categories or groups. In this case, we can compare the word frequencies between different seasons of the show.
108-
109-
```{r}
110-
# Filtering and creating word clouds by season
111-
112-
season_1_tokens <- tokens %>%
113-
filter(grepl("^s1", id)) %>%
114-
count(word, sort = TRUE) %>%
115-
filter(!word %in% common_words) %>%
116-
top_n(20)
117-
118-
season_2_tokens <- tokens %>%
119-
filter(grepl("^s2", id)) %>%
120-
count(word, sort = TRUE) %>%
121-
filter(!word %in% common_words) %>%
122-
top_n(20)
123-
124-
# Displaying word clouds for each season
125-
wordcloud2(data = season_1_tokens, size = 1, color = "random-light", backgroundColor = "black")
126-
wordcloud2(data = season_2_tokens, size = 1, color = "random-light", backgroundColor = "black")
127-
```
128-
129-
We can even select the 50 more frequent words per season and extract those that are unique to each season.
130-
131-
```{r}
132-
# Finding unique words per season
133-
top_50_s1 <- season_1_tokens %>%
134-
top_n(50) %>%
135-
pull(word)
136-
top_50_s2 <- season_2_tokens %>%
137-
top_n(50) %>%
138-
pull(word)
139-
140-
unique_s1 <- setdiff(top_50_s1, top_50_s2)
141-
unique_s2 <- setdiff(top_50_s2, top_50_s1)
142-
143-
unique_s1_tokens <- season_1_tokens %>%
144-
filter(word %in% unique_s1)
145-
unique_s2_tokens <- season_2_tokens %>%
146-
filter(word %in% unique_s2)
147-
148-
# Displaying unique word clouds for each season
149-
wordcloud2(data = unique_s1_tokens, size = 1, color = "random-light", backgroundColor = "black")
150-
wordcloud2(data = unique_s2_tokens, size = 1, color = "random-light", backgroundColor = "black")
151-
```
152-
153-
This analysis allows us to see which words are more prominent in each season, providing insights into the themes and topics that were more relevant during those times.
154-
155-
With these basic text analysis techniques, we can start to uncover patterns and insights from our text data. Although simple, these methods are helpful to explore the content and structure of the text, setting the stage for more advanced analyses, or even informing about the quality of the pre-processing steps applied to the data.
156-
Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,23 @@
11
---
2-
title: "Introduction to Text Analysis"
3-
---
2+
title: "What is Text Analysis?"
3+
engine: knitr
4+
format:
5+
html:
6+
fig-width: 10
7+
fig-height: 12
8+
dpi: 300
9+
editor_options:
10+
chunk_output_type: inline
11+
---
12+
13+
Text analysis is an umbrella concept that involves multiple techniques, methods, and approaches for "extracting" the meaning, structure, or general characteristics of a text by analyzing its constitutive words and symbols, and their relationships with a context, epoch, trend, intention, etc.
14+
15+
Thanks to the massification of computers and the miniaturization of computer power, computational methods for text analysis have become prevalent in certain contexts, allowing researchers to analyze large corpora of texts and also extrapolate those concepts for purposes beyond academic research, such as commercial text processing, sentiment analysis, or information retrieval.
16+
17+
Building on these foundations, this episode focuses on the introductory analytical techniques that establish common ground for more complex tasks such as sentiment analysis, language modeling, topic modeling, or text generation.
18+
19+
::: {.callout-note title="NLP"}
20+
Although Natural Language Processing (NLP) is sometimes used as a synonym for text analysis, Text Analysis encompasses both computational and non-computational approaches to analyzing text. NLP is primarily concerned with the interaction between computers and human language. It focuses on developing algorithms and models that enable machines to understand, interpret, and generate human language.
21+
:::
22+
23+

0 commit comments

Comments
 (0)