UCSB-Library-Research-Data-Services
diff --git a/‎_quarto-ci.yml‎
Lines changed: 1 addition & 1 deletion b/‎_quarto-ci.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎_quarto.yml‎
Lines changed: 0 additions & 1 deletion b/‎_quarto.yml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎chapters/1.Preprocessing/01_introduction.qmd‎
Lines changed: 29 additions & 6 deletions b/‎chapters/1.Preprocessing/01_introduction.qmd‎
Lines changed: 29 additions & 6 deletions
@@ -1,4 +1,4 @@
 execute:
-  enabled: true
+  enabled: false
   freeze: auto
   cache: true
@@ -71,4 +71,3 @@ format:
     code-tools: true
 
 
-
@@ -12,7 +12,9 @@ title: "Introduction to Text Preprocessing"
 
 These three elements are co-dependent and work together. Preprocessing prepares messy text for analysis, NLP provides the computational tools to process and interpret the text, and text analysis applies these insights to answer meaningful questions. Without preprocessing, NLP models struggle with noise; without NLP, text analysis would be limited to surface-level counts; and without text analysis, NLP would have little purpose beyond technical processing. Together, they transform raw text into structured, actionable insights.From messy to analysis-ready text
 
-Of course, text isn’t quite as “analysis-ready” as numbers. Have you ever looked at raw text data and thought, *where do I even start*? That’s the challenge: before computers can process it meaningfully, text usually needs some cleaning and preparation. It’s extra work, but it’s also the foundation of any meaningful analysis. The exciting part is what happens next; once the text is shaped and structured, it can reveal insights you’d never notice just by skimming. And here’s the real advantage: computers can process enormous amounts of text not only faster but often more effectively than humans, allowing us to see patterns and connections that would otherwise stay hidden.
+Of course, text isn’t quite as “analysis-ready” as numbers. Have you ever looked at raw text data and thought, *where do I even start*? That’s the challenge: before computers can process it meaningfully, text usually needs some cleaning and preparation. Before text can be analyzed computationally, it needs to be standardized. Computers see “Happy,” “happy,” and “HAPPY!!!” as different words — preprocessing fixes that.
+
+It’s extra work, but it’s also the foundation of any meaningful analysis. The exciting part is what happens next; once the text is shaped and structured, it can reveal insights you’d never notice just by skimming. And here’s the real advantage: computers can process enormous amounts of text not only faster but often more effectively than humans, allowing us to see patterns and connections that would otherwise stay hidden.
 
 ### Garbage in, Garbage out
 
@@ -36,17 +38,38 @@ The data we pulled for this exercise comes from real social media posts, meaning
 
 Before we can apply any meaningful analysis or modeling, it’s crucial to visually inspect the data to get a sense of what we’re working with. Eyeballing the raw text helps us identify common patterns, potential noise, and areas that will require careful preprocessing to ensure the downstream tasks are effective and reliable.
 
+### Getting Files and Launching RStudio
+
+Time to launch RStudio and our example! Click on this [link](https://ucsb.box.com/s/z6buv80wmgqm1wb389o1j6vl9k3ldapv) to download the `text-preprocessing` subfolder, from the folder `text-analysis-series`. Among other files, this subfolder contains the dataset we will be using `comments.csv`, a worksheet in qmd, a Quarto extension (learn more about [Quarto](https://quarto.org/)), named `preprocessing_worksheet` where we will be performing some coding, and an `renv.lock`(learn more about [Renv](https://rstudio.github.io/renv/articles/renv.html)) file listing all the R packages (and their versions) we’ll use during the workshop. This setup ensures a self-contained environment, so you can run everything needed for the session without installing or changing any packages that might affect your other R projects.
+
+After downloading this subfolder, double click on the project file `text-preprocessing.Rproj` to launch Rstudio. Look for and open the file `preprocessing_worksheet` on your Rstudio environment.
 
-Time to launch RStudio and our example!
-Open the `worksheet.qmd`. Let's install the required packages (via the console) and load them (run the code chunk). Next, let's inspect the `comments.csv` file and take a quick look at it! (FIXME: RENV AND PROJECT FOLDER?)
+In your R Console, type `renv::restore()` to read the renv.lock file and installs the specific package versions used in the project.
+
+### Loading Packages & Inspecting the Data
+
+Let's start by loading all the required packages that are pre-installed in the project:
 
 ``` r
-# Inspecting the data
+library(tidyverse)    # general data manipulation
+library(tidytext)     # tokenization and text processing
+library(stringr)      # string manipulation
+library(stringi)      # emoji handling
+library(dplyr)        # data wrangling
+library(textclean)    # expand contractions
+library(emo)          # emoji dictionary
+library(textstem)     # lemmatization
+```
+
+Alright! With all the necessary packages loaded, let's take a look at the dataset we’ll be working with:
 
-comments <- read_csv("comments.csv")
-head(comments$text)
+``` r
+# Inspecting the data
+comments <- readr::read_csv("./data/raw/comments.csv")
 ```
 
+You’ll notice that we’ve pre-populated a code chunk with Patterns to save you from the tedious task of typing out regular expressions (regex for short). Don’t worry about them for now, we’ll come back to it shortly.
+
 ::: {.callout-note icon="false"}
 # 💬 Discussion
Original file line number	Diff line number	Diff line change
`@@ -71,4 +71,3 @@ format:`
`71`	`71`	`code-tools: true`
`72`	`72`
`73`	`73`
`74`		`-`