Skip to content

Commit a980f6c

Browse files
committed
added code and steps
1 parent 95b8dc2 commit a980f6c

File tree

4 files changed

+217
-14
lines changed

4 files changed

+217
-14
lines changed

chapters/3.SentimentAnalysis/emotion.qmd

Lines changed: 148 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The `syuzhet` package implements the [National Research Council Canada (NRC) Emo
1111

1212
This framework uses eight categories of emotions based on Robert Plutchik's theory of the emotional wheel, a foundational model that illustrates the relationships between human emotions from a psychological perspective. Plutchik’s wheel identifies eight primary emotions: anger, disgust, sadness, surprise, fear, trust, joy, and anticipation. As illustrated in Figure ? below, these emotions are organized into four pairs of opposites on the wheel. Emotions positioned diagonally across from each other represent opposites, while adjacent emotions share similarities, reflecting a positive correlation.
1313

14-
![Figure?. Plutchik’s wheel of emotions. Image from: Zeng, X., Chen, Q., Chen, S., & Zuo, J. (2021). Emotion label enhancement via emotion wheel and lexicon. *Mathematical Problems in Engineering*, *2021*(1), 6695913. <https://doi.org/10.1155/2021/6695913>](images/emotion_wheel.jpg){fig-align="center" width="376"}
14+
![Plutchik’s wheel of emotions. Image from: Zeng, X., Chen, Q., Chen, S., & Zuo, J. (2021). Emotion label enhancement via emotion wheel and lexicon. Mathematical Problems in Engineering, 2021(1), 6695913. https://doi.org/10.1155/2021/6695913](images/emotion_wheel.jpg){fig-align="center" width="376"}
1515

1616
The NRC Emotion Lexicon was developed as part of research into affective computing and sentiment analysis using a combination of manual annotation and crowdsourcing. Human annotators evaluated thousands of words, indicating which emotions were commonly associated with each word. This method ensured that the lexicon captured human-perceived emotional associations, rather than relying solely on statistical co-occurrences in text.
1717

@@ -22,9 +22,154 @@ You may explore NRC's lexicon Tableau dashboard to explore words associated with
2222
```{=html}
2323
<iframe width="780" height="500" src="https://public.tableau.com/views/NRC-Emotion-Lexicon-viz1/NRCEmotionLexicon-viz1?:embed=y&:loadOrderID=0&:display_count=no&:showVizHome=no" title="NRC Lexicon Interactive Visualization"></iframe>
2424
```
25-
Now that we have a better understanding of this package, let's get back to business and perform emotion detection to our data:
25+
Now that we have a better understanding of this package, let's get back to business and perform emotion detection to our data.
26+
27+
#### Emotion Detection with Syuzhet's NRC Lexicon
28+
29+
##### Break Text into Sentences
30+
31+
``` r
32+
sentences <- get_sentences(comments$comments)
33+
```
34+
35+
The `get_sentences()` function splits your text into individual sentences.
36+
37+
This allows us to analyze emotions at a finer level — rather than by entire comments, we examine each sentence separately. If a comment says: “I love the show. The ending made me sad.” It becomes two sentences.
38+
39+
##### Compute Emotion Scores per Sentence
40+
41+
``` r
42+
emotion_score <- get_nrc_sentiment(sentences)
43+
```
44+
45+
The `get_nrc_sentiment()` function assigns emotion and sentiment scores (based on the NRC lexicon) to each sentence. Each sentence gets numeric values (0 or 1) for the eight emotions to represent their absence or presence. The output also includes positive and negative sentiment scores.
46+
47+
##### Review Summary of Emotion Scores
48+
49+
Let's now compute basic statistics (min, max, mean, etc.) for each emotion column and get an overview of how frequent or strong each emotion is on our example dataset.
50+
51+
``` r
52+
summary(emotion_score)
53+
```
54+
55+
##### Rejoin Sentences
56+
57+
After sentence-level analysis, we want to link each emotion score back to its **original comment or ID**.
58+
59+
``` r
60+
comments$comments <- sentences
61+
emotion_data <- bind_cols(comments, emotion_score)
62+
```
63+
64+
`bind_cols()` merges the original `comments` data frame with the new `emotion_score` table.
65+
66+
##### Summarize Emotion Counts Across All Sentences
67+
68+
Now, let's count **how many times each emotion appears** overall.
69+
70+
``` r
71+
emotion_summary <- emotion_data %>%
72+
select(anger:trust) %>% # get only the emotion columns
73+
summarise(across(everything(), sum)) %>% # sum counts
74+
pivot_longer(
75+
cols = everything(),
76+
names_to = "emotion",
77+
values_to = "count"
78+
) %>% # long format for easy plotting
79+
arrange(desc(count)) # sort emotions
80+
```
81+
82+
##### Plot the Overall Emotion Distribution
2683

2784
``` r
85+
ggplot(emotion_summary, aes(x = emotion, y = count, fill = emotion)) +
86+
geom_col(show.legend = FALSE) + # Bar plot for emotion counts
87+
geom_text(aes(label = count), hjust = -0.2, size = 2) + # Add count labels
88+
scale_fill_manual(values = brewer.pal(10, "Paired")) + # Color palette
89+
theme_minimal(base_size = 12) + # Clean theme
90+
labs(title = "Overall Emotion Distribution",
91+
x = "Emotion", y = "Total Count") + # Titles and axis labels
92+
coord_flip() # Flip axes for readability
2893
```
2994

30-
You might be wondering: if the **`syuzhet`** package also computes polarity, why did we choose **`sentimentr`** in our pipeline? The reason is that syuzhet does not inherently account for valence shifters. In the original syuzhet implementation, words are scored in isolation—so “good” = +1, “bad” = −1—regardless of nearby negations or intensifiers. For example, “not good” would still be counted as +1. Because **`sentimentr`** adjusts sentiment scores for negators and amplifiers, polarity results are more nuanced, robust, and reliable.
95+
##### Add a “Season” Variable (Grouping) and Summarize
96+
97+
Let's now add a new column called `season` by looking at the ID pattern — for example, `s1_` means season 1 and `s2_` means season 2. This makes it easy to compare the emotional tone across seasons.
98+
99+
``` r
100+
emotion_seasons <- emotion_data %>%
101+
mutate(season = ifelse(grepl("^s1_", id), "s1",
102+
ifelse(grepl("^s2_", id), "s2", NA)))
103+
```
104+
105+
Time to aggregates the total count of each emotion within each season.
106+
107+
``` r
108+
emotion_by_season <- emotion_seasons %>%
109+
group_by(season) %>%
110+
summarise(across(anger:positive, sum, na.rm = TRUE))
111+
```
112+
113+
##### Compare Emotions by Season (Visualization)
114+
115+
``` r
116+
emotion_long <- emotion_by_season %>%
117+
pivot_longer(cols = anger:positive, names_to = "emotion", values_to = "count")
118+
119+
ggplot(emotion_long, aes(x = reorder(emotion, -count), y = count, fill = season)) +
120+
geom_col(position = "dodge") + # separates bars for clarity
121+
geom_text(aes(label = count), hjust = -0.2, size = 2) +
122+
scale_fill_brewer(palette = "Set2") +
123+
theme_minimal(base_size = 12) +
124+
labs(title = "Emotion Distribution by Season",
125+
x = "Emotion", y = "Total Count", fill = "Season") +
126+
coord_flip()
127+
```
128+
129+
##### Plotting the Data
130+
131+
Now, let's explore to see which emotions tend to occur together, revealing patterns of emotional co-occurrence in the text.
132+
133+
``` r
134+
# Select only emotion columns (excluding overall positive/negative sentiment)
135+
emotion_matrix <- emotion_data %>%
136+
select(anger:trust)
137+
138+
# Compute the correlation matrix for emotions
139+
# Pearson correlation shows how strongly two emotions co-occur
140+
co_occurrence <- cor(emotion_matrix, method = "pearson")
141+
142+
# Remove diagonal values to avoid coloring the perfect self-correlation
143+
diag(co_occurrence) <- NA
144+
145+
# Convert the correlation matrix to long format for ggplot
146+
co_occurrence_long <- as.data.frame(as.table(co_occurrence))
147+
colnames(co_occurrence_long) <- c("emotion1", "emotion2", "correlation")
148+
149+
# Plot the co-occurrence heatmap
150+
ggplot(co_occurrence_long, aes(x = emotion1, y = emotion2, fill = correlation)) +
151+
geom_tile(color = "white") + # draw grid tiles
152+
scale_fill_gradient2(
153+
mid = "white", high = "red", midpoint = 0,
154+
limits = c(0, 1), na.value = "grey95", name = "Correlation"
155+
) +
156+
theme_minimal(base_size = 12) +
157+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # rotate x-axis labels
158+
labs(
159+
title = "Emotion Co-occurrence Heatmap",
160+
x = "Emotion",
161+
y = "Emotion"
162+
)
163+
```
164+
165+
##### Saving our work
166+
167+
After performing all the calculations and visualizations, it’s important to save the results so they can be reused or shared.
168+
169+
``` r
170+
write_csv(emotion_data, "output/sentiment_emotion_results.csv")
171+
```
172+
173+
#### Final Thoughts
174+
175+
You might be wondering: if the **`syuzhet`** package also computes polarity, why did we choose **`sentimentr`** in our pipeline? The reason is that syuzhet does not inherently account for valence shifters. In the original syuzhet implementation, words are scored in isolation—so “good” = +1, “bad” = −1—regardless of nearby negations or intensifiers. For example, “not good” would still be counted as +1. Because **`sentimentr`** adjusts sentiment scores for negators and amplifiers, polarity results are more nuanced, robust, and reliable.
41.6 KB
Loading

chapters/3.SentimentAnalysis/introduction.qmd

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: "Introduction to Sentiment Analysis"
44

55
Now that we have completed all the key preprocessing steps and our example dataset is in much better shape, we can finally proceed with sentiment analysis.
66

7-
![Image from Canva](images/sentiment.png){width="750"}
7+
![Image from Canva](images/sentiment.png){fig-align="center" width="500"}
88

99
## What is Sentiment Analysis?
1010

@@ -23,12 +23,25 @@ Our analysis pipeline will follow a two-step approach. First, we will compute ba
2323
Let’s start by installing and loading the necessary packages, then bringing in the cleaned dataset so we can begin our sentiment analysis. We will discuss the role of each package in the next episodes.
2424

2525
``` r
26-
# Install Packages
27-
install.packages(c("sentimentr", "syuzhet"))
28-
29-
# Load Packages
26+
# Install packages (remove comments for packages you might have skipped)
27+
install.packages("sentimentr")
28+
install.packages("syuzhet")
29+
# install.packages("dplyr")
30+
# install.packages("tidyr")
31+
# install.packages("readr")
32+
# install.packages("ggplot2")
33+
# install.packages("RColorBrewer")
34+
# install.packages("stringr")
35+
36+
# Load all packages
3037
library(sentimentr)
3138
library(syuzhet)
39+
library(dplyr)
40+
library(tidyr)
41+
library(readr)
42+
library(ggplot2)
43+
library(RColorBrewer)
44+
library(stringr)
3245

3346
# Load Data
3447
comments <- readr::read_csv("../data/clean/comments_preprocessed.csv")

chapters/3.SentimentAnalysis/polarity.qmd

Lines changed: 51 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,28 +23,73 @@ Words like “but,” “however,” and “although” also influence the senti
2323

2424
With this approach, we can explore more confidently whether the show’s viewers felt positive, neutral, or negative about it.
2525

26+
#### Computing Polarity with Sentiment R (Valence Sifters Capability)
27+
28+
##### Calculating sentiment scores
29+
2630
``` r
31+
sentiment_scores <- sentiment_by(comments$comments)
32+
```
33+
34+
Here we’re using the **`sentiment_by()`** function which looks at each comment and calculates a **sentiment score** representing how positive or negative the language is.
35+
36+
So after running this, we get a new object called `sentiment_scores` with the average sentiment for every comment.
37+
38+
##### Adding those scores back to our dataset
2739

40+
``` r
41+
polarity <- comments %>%
42+
mutate(score = sentiment_scores$ave_sentiment,
43+
sentiment_label = case_when(
44+
score > 0.1 ~ "positive",
45+
score < -0.1 ~ "negative",
46+
TRUE ~ "neutral"
47+
))
2848
```
2949

50+
Now we’re using the **`dplyr`** package to make our dataset more informative. We take our `comments` dataset, and with **`mutate()`**, we add two new columns: `score` and `sentiment label`. The little rule inside **`case_when()`** decides what label to give. The small buffer around zero (±0.1) helps us avoid overreacting to tiny fluctuations.
51+
3052
Let's now take a look at the `sentiment_scores` data frame:
3153

3254
<add output>
3355

34-
It’s expected that the standard deviation is missing, because each row/case is treated as a single sentence when computing the score. Now, let's make sure to add these scores and the labels our dataset:
56+
To get a sense of the overall mood of our dataset let's run:
3557

3658
``` r
37-
59+
table(polarity$sentiment_label)
3860
```
3961

40-
#### Plotting Things
62+
#### Plotting Scores
4163

42-
Next, let's plot some results and histograms to check the distribution:
64+
Next, let's plot some results and histograms to check the distribution per season:
4365

4466
``` r
67+
# Visualize
68+
ggplot(polarity, aes(x = score)) +
69+
geom_histogram(binwidth = 0.1, fill = "skyblue", color = "white") +
70+
theme_minimal() +
71+
labs(title = "Sentiment Score Distribution", x = "Average Sentiment", y = "Count")
72+
73+
# Extract season info (s1, s2) into a new column
74+
polarity_seasons <- mutate(polarity,
75+
season = str_extract(id, "s\\d+"))
76+
77+
# Histogram comparison by season
78+
ggplot(polarity_seasons, aes(x = score, fill = season)) +
79+
geom_histogram(binwidth = 0.1, position = "dodge", color = "white") +
80+
theme_minimal() +
81+
labs(title = "Sentiment Score Distribution by Season",
82+
x = "Average Sentiment", y = "Count") +
83+
scale_fill_brewer(palette = "Set1")
84+
```
85+
86+
#### Saving Things
4587

88+
``` r
89+
# Save results
90+
write_csv(polarity, "output/polarity_results.csv")
4691
```
4792

48-
We could have spent more time refining these plots, but this is sufficient for our initial exploration. In pairs, review the plots and discuss what they reveal about viewers’ perceptions of the *Severance* show.
93+
We could have spent more time refining these plots, but this is sufficient for our initial exploration. In pairs, review the plots and discuss what they reveal about viewers’ perceptions of the Severance show.
4994

50-
Well, that’s only part of the story. Now we move on to emotion detection to discover what else we can learn from the data.
95+
Well, that’s only part of the story. Now we move on to emotion detection to discover what else we can learn from the data.

0 commit comments

Comments
 (0)