You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/1.Preprocessing/01_introduction.qmd
+29-6Lines changed: 29 additions & 6 deletions
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,9 @@ title: "Introduction to Text Preprocessing"
12
12
13
13
These three elements are co-dependent and work together. Preprocessing prepares messy text for analysis, NLP provides the computational tools to process and interpret the text, and text analysis applies these insights to answer meaningful questions. Without preprocessing, NLP models struggle with noise; without NLP, text analysis would be limited to surface-level counts; and without text analysis, NLP would have little purpose beyond technical processing. Together, they transform raw text into structured, actionable insights.From messy to analysis-ready text
14
14
15
-
Of course, text isn’t quite as “analysis-ready” as numbers. Have you ever looked at raw text data and thought, *where do I even start*? That’s the challenge: before computers can process it meaningfully, text usually needs some cleaning and preparation. It’s extra work, but it’s also the foundation of any meaningful analysis. The exciting part is what happens next; once the text is shaped and structured, it can reveal insights you’d never notice just by skimming. And here’s the real advantage: computers can process enormous amounts of text not only faster but often more effectively than humans, allowing us to see patterns and connections that would otherwise stay hidden.
15
+
Of course, text isn’t quite as “analysis-ready” as numbers. Have you ever looked at raw text data and thought, *where do I even start*? That’s the challenge: before computers can process it meaningfully, text usually needs some cleaning and preparation. Before text can be analyzed computationally, it needs to be standardized. Computers see “Happy,” “happy,” and “HAPPY!!!” as different words — preprocessing fixes that.
16
+
17
+
It’s extra work, but it’s also the foundation of any meaningful analysis. The exciting part is what happens next; once the text is shaped and structured, it can reveal insights you’d never notice just by skimming. And here’s the real advantage: computers can process enormous amounts of text not only faster but often more effectively than humans, allowing us to see patterns and connections that would otherwise stay hidden.
16
18
17
19
### Garbage in, Garbage out
18
20
@@ -36,17 +38,38 @@ The data we pulled for this exercise comes from real social media posts, meaning
36
38
37
39
Before we can apply any meaningful analysis or modeling, it’s crucial to visually inspect the data to get a sense of what we’re working with. Eyeballing the raw text helps us identify common patterns, potential noise, and areas that will require careful preprocessing to ensure the downstream tasks are effective and reliable.
38
40
41
+
### Getting Files and Launching RStudio
42
+
43
+
Time to launch RStudio and our example! Click on this [link](https://ucsb.box.com/s/z6buv80wmgqm1wb389o1j6vl9k3ldapv) to download the `text-preprocessing` subfolder, from the folder `text-analysis-series`. Among other files, this subfolder contains the dataset we will be using `comments.csv`, a worksheet in qmd, a Quarto extension (learn more about [Quarto](https://quarto.org/)), named `preprocessing_worksheet` where we will be performing some coding, and an `renv.lock`(learn more about [Renv](https://rstudio.github.io/renv/articles/renv.html)) file listing all the R packages (and their versions) we’ll use during the workshop. This setup ensures a self-contained environment, so you can run everything needed for the session without installing or changing any packages that might affect your other R projects.
44
+
45
+
After downloading this subfolder, double click on the project file `text-preprocessing.Rproj` to launch Rstudio. Look for and open the file `preprocessing_worksheet` on your Rstudio environment.
39
46
40
-
Time to launch RStudio and our example!
41
-
Open the `worksheet.qmd`. Let's install the required packages (via the console) and load them (run the code chunk). Next, let's inspect the `comments.csv` file and take a quick look at it! (FIXME: RENV AND PROJECT FOLDER?)
47
+
In your R Console, type `renv::restore()` to read the renv.lock file and installs the specific package versions used in the project.
48
+
49
+
### Loading Packages & Inspecting the Data
50
+
51
+
Let's start by loading all the required packages that are pre-installed in the project:
42
52
43
53
```r
44
-
# Inspecting the data
54
+
library(tidyverse) # general data manipulation
55
+
library(tidytext) # tokenization and text processing
56
+
library(stringr) # string manipulation
57
+
library(stringi) # emoji handling
58
+
library(dplyr) # data wrangling
59
+
library(textclean) # expand contractions
60
+
library(emo) # emoji dictionary
61
+
library(textstem) # lemmatization
62
+
```
63
+
64
+
Alright! With all the necessary packages loaded, let's take a look at the dataset we’ll be working with:
You’ll notice that we’ve pre-populated a code chunk with Patterns to save you from the tedious task of typing out regular expressions (regex for short). Don’t worry about them for now, we’ll come back to it shortly.
0 commit comments