Skip to content

Commit a0ad0ba

Browse files
authored
Merge pull request #7 from UCSB-Library-Research-Data-Services/renata
Renata
2 parents 7d24b82 + 8916f9c commit a0ad0ba

File tree

7 files changed

+57
-10
lines changed

7 files changed

+57
-10
lines changed

_quarto.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,10 +42,10 @@ website:
4242
text: Lemmatization
4343
- href: chapters/1.Preprocessing/06_conclusion.qmd
4444
text: Conclusion
45-
- href: chapters/2.TextAnalysis/introduction.qmd
46-
text: Text Analysis
47-
- href: chapters/3.SentimentAnalysis/introduction.qmd
48-
text: Sentiment Analysis
45+
#- href: chapters/2.TextAnalysis/introduction.qmd
46+
#text: Text Analysis
47+
#- href: chapters/3.SentimentAnalysis/introduction.qmd
48+
#text: Sentiment Analysis
4949
- about.qmd
5050

5151
page-footer:

chapters/1.Preprocessing/01_introduction.qmd

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,27 @@ Key text preprocessing steps include normalization (noise reduction), stop words
2828

2929
Source: Data Literacy Series <https://perma.cc/L8U5-ZEXD>
3030

31-
In the next chapters, we'll dive deeper into this pipeline to prepare the data further analysis.
31+
In the next chapters, we'll dive deeper into this pipeline to prepare the data further analysis, but before, let's take a quick look at the data so we can have a better grasp of the challenge at hand.
32+
33+
## Getting Things Started
34+
35+
The data we pulled for this exercise comes from real social media posts, meaning they are inherently messy, and we know that even before going in. Because it is derived from natural language, this kind of data is unstructured, often filled with inconsistencies and irregularities.
36+
37+
Before we can apply any meaningful analysis or modeling, it’s crucial to visually inspect the data to get a sense of what we’re working with. Eyeballing the raw text helps us identify common patterns, potential noise, and areas that will require careful preprocessing to ensure the downstream tasks are effective and reliable.
38+
39+
40+
Time to launch RStudio and our example!
41+
Open the `worksheet.qmd`. Let's install the required packages (via the console) and load them (run the code chunk). Next, let's inspect the `comments.csv` file and take a quick look at it! (FIXME: RENV AND PROJECT FOLDER?)
42+
43+
``` r
44+
# Inspecting the data
45+
46+
comments <- read_csv("comments.csv")
47+
head(comments$text)
48+
```
49+
50+
::: {.callout-note icon="false"}
51+
# 💬 Discussion
52+
53+
Working in pairs or trios, look briefly at the data and discuss the challenges that may arise when attempting to analyze this dataset on its current form. What could be potential areas of friction that could compromise the results?
54+
:::

chapters/1.Preprocessing/02_normalization.qmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Just as a gardener would prune dead branches, enrich the soil, and care for the
1616
As we've seen, the main goal of normalization is to remove irrelevant punctuation and content, and to standardize the data in order to reduce noise. Below are some key actions we’ll be performing during this workshop:
1717

1818
| Action | Why it matters? |
19-
|-------------|-----------------------------------------------------------|
19+
|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
2020
| Remove URLs | URLs often contain irrelevant noise and don't contribute meaningful content for analysis. |
2121
| Remove Punctuation & Symbols | Punctuation marks and other symbols including those extensively used in social media for mentioning (\@) or tagging (#) rarely adds value in most NLP tasks and can interfere with tokenization (as we will cover in a bit) or word matching. |
2222
| Remove Numbers | Numbers can be noise in most contexts unless specifically relevant (e.g., in financial or medical texts) don't contribute much to the analysis. However, in NLP tasks they are considered important, there might be considerations to replace them with dummy tokens (e.g. \<NUMBER\>), or even converting them into their written form (e.g, 100 becomes one hundred). |
@@ -53,7 +53,7 @@ Another important step is to properly handle contractions. In everyday language,
5353
So, while it may seem like a small step, it often leads to cleaner data, leaner models, and more accurate results. First, however, we need to ensure that apostrophes are handled correctly. It's not uncommon to encounter messy text where nonstandard characters are used in place of the straight apostrophe ('). Such inconsistencies are very common and can disrupt contraction expansion.
5454

5555
| Character | Unicode | Notes |
56-
|-------------|-------------|----------------------------------------------|
56+
|-----------|---------|---------------------------------------------------------|
5757
| `'` | U+0027 | Standard straight apostrophe, used in most dictionaries |
5858
| `` | U+2019 | Right single quotation mark (curly apostrophe) |
5959
| `` | U+2018 | Left single quotation mark |

chapters/1.Preprocessing/03_tokenization.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Tokenization in NLP differs from applications in security and blockchain. It cor
88
Text can be tokenized into sentences, word, subwords or even characters, depending on project goals and analysis plan. Here is a summary of these approaches:
99

1010
| **Type** | **Description** | **Example** | **Common Use Cases** |
11-
|-----------------|--------------------|-----------------|-----------------|
11+
|----------------------------|---------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|
1212
| **Sentence Tokenization** | Splits text into individual sentences | `"I love NLP. It's fascinating!"``["I love NLP.", "It's fascinating!"]` | Ideal for tasks like summarization, machine translation, and sentiment analysis at the sentence level |
1313
| **Word Tokenization** | Divides text into individual words | `"I love NLP"``["I", "love", "NLP"]` | Works well for languages with clear word boundaries, such as English |
1414
| **Character Tokenization** | Breaks text down into individual characters | `"NLP"``["N", "L", "P"]` | Useful for languages without explicit word boundaries or for very fine-grained text analysis |
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
title: "Conclusion"
3+
editor: visual
4+
---
5+
6+
In this workshop, we navigated the challenges of preprocessing of unustrucutred social media data, highlighting how messy, inconsistent, and noisy real-world datasets can be. One key takeaway is the importance of thoroughly assessing the data in the context of your project goals before diving into processing and be mindful that the order of factors do influence the outcome.
7+
8+
Not all cleaning or transformation steps are universally beneficial and decisions should be guided by what is meaningful for your analysis or model objectives. Emojis, for example, can convey sentiment, irony, or context that may be essential for analysis, so decisions on whether to remove, convert, or retain them should be goal-driven.
9+
10+
Similarly, numbers such as dates, prices, or statistics can carry meaningful information, but they can also introduce noise if misinterpreted or inconsistently formatted. Thoughtful handling of these elements ensures that preprocessing enhances the dataset’s usefulness rather than stripping away valuable signals.
11+
12+
Overly aggressive text cleaning removes content that is vital to the context, meaning, or nuance of a text and can damage the performance of natural language processing (NLP) models. The specific steps that lead to this problem depend on the end goal of your NLP task. 
13+
14+
While preprocessing is considered a key step, if performed incorrectly or poorly planned, it can do more harm than good to the analysis. In short, preprocessing is not merely a mechanical phase in the pipeline but a thoughtful design choice that shapes the quality, interpretability, and trustworthiness of all subsequent tasks.
15+
16+
By critically evaluating the data and aligning preprocessing strategies with the end goals, we can ensure that the cleaned dataset not only becomes more manageable but also more valuable for deriving actionable insights. Ultimately, thoughtful data assessment is just as important as the technical preprocessing steps themselves.
17+
18+
::: callout-tip
19+
## 🤓 Suggested Readings
20+
21+
Chai CP. Comparison of text preprocessing methods. *Natural Language Engineering*. 2023;29(3):509-553. <https://doi.org/10.1017/S1351324922000213>
22+
23+
Siino, M., Tinnirello, I., & La Cascia, M. (2024). Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. *Information Systems*, *121*, 102342. <https://doi.org/10.1016/j.is.2023.102342>
24+
:::

chapters/1.Preprocessing/_6_conclusion.qmd

Whitespace-only changes.

index.qmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,11 @@ That’s it! After updating, restart your computer to make sure RStudio finds th
4747

4848
## Access to Data
4949

50-
For this lesson we will analyze a dataset of social media posts related to the Apple TV series *Severance*. The dataset was collected using [Brandwatch](https://www.brandwatch.com/){target='_blank'} (via UCSB Library subscription), and it includes posts from the two days following the finales of Season 1 (April 2022) and Season 2 (March 2025). The dataset contains ~9,000 posts stored in a CSV file.
50+
For this lesson we will analyze a dataset of social media posts related to the Apple TV series *Severance*. The dataset was collected using [Brandwatch](https://www.brandwatch.com/){target='_blank'} (via UCSB Library subscription), and it includes posts from the two days following the finales of Season 1 (April 2022) and Season 2 (March 2025). The dataset contains over 5,800 posts stored in a CSV file.
5151

5252
The dataset is available for download from this link: [Severance Dataset](https://ucsb.box.com/s/z6buv80wmgqm1wb389o1j6vl9k3ldapv){target='_blank'}. You will need an active UCSB NetID and password to access the file (the same you use for your UCSB email).
5353

54-
Please download the file `comments.csv` and save it in your RStudio project folder.
54+
Please download the file `comments_variables.csv` and save it in your RStudio project folder. *(FIXME - SHOULD WE PROVIDE THE FOLDER WITH RENV?)*
5555

5656
## R Skill Level
5757

0 commit comments

Comments
 (0)