You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/1.Preprocessing/01_introduction.qmd
+24-1Lines changed: 24 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -28,4 +28,27 @@ Key text preprocessing steps include normalization (noise reduction), stop words
28
28
29
29
Source: Data Literacy Series <https://perma.cc/L8U5-ZEXD>
30
30
31
-
In the next chapters, we'll dive deeper into this pipeline to prepare the data further analysis.
31
+
In the next chapters, we'll dive deeper into this pipeline to prepare the data further analysis, but before, let's take a quick look at the data so we can have a better grasp of the challenge at hand.
32
+
33
+
## Getting Things Started
34
+
35
+
The data we pulled for this exercise comes from real social media posts, meaning they are inherently messy, and we know that even before going in. Because it is derived from natural language, this kind of data is unstructured, often filled with inconsistencies and irregularities.
36
+
37
+
Before we can apply any meaningful analysis or modeling, it’s crucial to visually inspect the data to get a sense of what we’re working with. Eyeballing the raw text helps us identify common patterns, potential noise, and areas that will require careful preprocessing to ensure the downstream tasks are effective and reliable.
38
+
39
+
40
+
Time to launch RStudio and our example!
41
+
Open the `worksheet.qmd`. Let's install the required packages (via the console) and load them (run the code chunk). Next, let's inspect the `comments.csv` file and take a quick look at it! (FIXME: RENV AND PROJECT FOLDER?)
42
+
43
+
```r
44
+
# Inspecting the data
45
+
46
+
comments<- read_csv("comments.csv")
47
+
head(comments$text)
48
+
```
49
+
50
+
::: {.callout-note icon="false"}
51
+
# 💬 Discussion
52
+
53
+
Working in pairs or trios, look briefly at the data and discuss the challenges that may arise when attempting to analyze this dataset on its current form. What could be potential areas of friction that could compromise the results?
Copy file name to clipboardExpand all lines: chapters/1.Preprocessing/02_normalization.qmd
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ Just as a gardener would prune dead branches, enrich the soil, and care for the
16
16
As we've seen, the main goal of normalization is to remove irrelevant punctuation and content, and to standardize the data in order to reduce noise. Below are some key actions we’ll be performing during this workshop:
| Remove URLs | URLs often contain irrelevant noise and don't contribute meaningful content for analysis. |
21
21
| Remove Punctuation & Symbols | Punctuation marks and other symbols including those extensively used in social media for mentioning (\@) or tagging (#) rarely adds value in most NLP tasks and can interfere with tokenization (as we will cover in a bit) or word matching. |
22
22
| Remove Numbers | Numbers can be noise in most contexts unless specifically relevant (e.g., in financial or medical texts) don't contribute much to the analysis. However, in NLP tasks they are considered important, there might be considerations to replace them with dummy tokens (e.g. \<NUMBER\>), or even converting them into their written form (e.g, 100 becomes one hundred). |
@@ -53,7 +53,7 @@ Another important step is to properly handle contractions. In everyday language,
53
53
So, while it may seem like a small step, it often leads to cleaner data, leaner models, and more accurate results. First, however, we need to ensure that apostrophes are handled correctly. It's not uncommon to encounter messy text where nonstandard characters are used in place of the straight apostrophe ('). Such inconsistencies are very common and can disrupt contraction expansion.
Copy file name to clipboardExpand all lines: chapters/1.Preprocessing/03_tokenization.qmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ Tokenization in NLP differs from applications in security and blockchain. It cor
8
8
Text can be tokenized into sentences, word, subwords or even characters, depending on project goals and analysis plan. Here is a summary of these approaches:
9
9
10
10
|**Type**|**Description**|**Example**|**Common Use Cases**|
|**Sentence Tokenization**| Splits text into individual sentences |`"I love NLP. It's fascinating!"` → `["I love NLP.", "It's fascinating!"]`| Ideal for tasks like summarization, machine translation, and sentiment analysis at the sentence level |
13
13
|**Word Tokenization**| Divides text into individual words |`"I love NLP"` → `["I", "love", "NLP"]`| Works well for languages with clear word boundaries, such as English |
14
14
|**Character Tokenization**| Breaks text down into individual characters |`"NLP"` → `["N", "L", "P"]`| Useful for languages without explicit word boundaries or for very fine-grained text analysis |
In this workshop, we navigated the challenges of preprocessing of unustrucutred social media data, highlighting how messy, inconsistent, and noisy real-world datasets can be. One key takeaway is the importance of thoroughly assessing the data in the context of your project goals before diving into processing and be mindful that the order of factors do influence the outcome.
7
+
8
+
Not all cleaning or transformation steps are universally beneficial and decisions should be guided by what is meaningful for your analysis or model objectives. Emojis, for example, can convey sentiment, irony, or context that may be essential for analysis, so decisions on whether to remove, convert, or retain them should be goal-driven.
9
+
10
+
Similarly, numbers such as dates, prices, or statistics can carry meaningful information, but they can also introduce noise if misinterpreted or inconsistently formatted. Thoughtful handling of these elements ensures that preprocessing enhances the dataset’s usefulness rather than stripping away valuable signals.
11
+
12
+
Overly aggressive text cleaning removes content that is vital to the context, meaning, or nuance of a text and can damage the performance of natural language processing (NLP) models. The specific steps that lead to this problem depend on the end goal of your NLP task.
13
+
14
+
While preprocessing is considered a key step, if performed incorrectly or poorly planned, it can do more harm than good to the analysis. In short, preprocessing is not merely a mechanical phase in the pipeline but a thoughtful design choice that shapes the quality, interpretability, and trustworthiness of all subsequent tasks.
15
+
16
+
By critically evaluating the data and aligning preprocessing strategies with the end goals, we can ensure that the cleaned dataset not only becomes more manageable but also more valuable for deriving actionable insights. Ultimately, thoughtful data assessment is just as important as the technical preprocessing steps themselves.
17
+
18
+
::: callout-tip
19
+
## 🤓 Suggested Readings
20
+
21
+
Chai CP. Comparison of text preprocessing methods. *Natural Language Engineering*. 2023;29(3):509-553. <https://doi.org/10.1017/S1351324922000213>
22
+
23
+
Siino, M., Tinnirello, I., & La Cascia, M. (2024). Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. *Information Systems*, *121*, 102342. <https://doi.org/10.1016/j.is.2023.102342>
Copy file name to clipboardExpand all lines: index.qmd
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -47,11 +47,11 @@ That’s it! After updating, restart your computer to make sure RStudio finds th
47
47
48
48
## Access to Data
49
49
50
-
For this lesson we will analyze a dataset of social media posts related to the Apple TV series *Severance*. The dataset was collected using [Brandwatch](https://www.brandwatch.com/){target='_blank'} (via UCSB Library subscription), and it includes posts from the two days following the finales of Season 1 (April 2022) and Season 2 (March 2025). The dataset contains ~9,000 posts stored in a CSV file.
50
+
For this lesson we will analyze a dataset of social media posts related to the Apple TV series *Severance*. The dataset was collected using [Brandwatch](https://www.brandwatch.com/){target='_blank'} (via UCSB Library subscription), and it includes posts from the two days following the finales of Season 1 (April 2022) and Season 2 (March 2025). The dataset contains over 5,800 posts stored in a CSV file.
51
51
52
52
The dataset is available for download from this link: [Severance Dataset](https://ucsb.box.com/s/z6buv80wmgqm1wb389o1j6vl9k3ldapv){target='_blank'}. You will need an active UCSB NetID and password to access the file (the same you use for your UCSB email).
53
53
54
-
Please download the file `comments.csv` and save it in your RStudio project folder.
54
+
Please download the file `comments_variables.csv` and save it in your RStudio project folder.*(FIXME - SHOULD WE PROVIDE THE FOLDER WITH RENV?)*
0 commit comments