Skip to content

Commit 202c9f1

Browse files
authored
Merge pull request #38 from UCSB-Library-Research-Data-Services/renata
added POS and new code to rejoin sentences before applying lemmas
2 parents 715b542 + 442206b commit 202c9f1

File tree

2 files changed

+28
-33
lines changed

2 files changed

+28
-33
lines changed

chapters/1.Preprocessing/05_lemmatization.qmd

Lines changed: 28 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@ For this reason, we will stick with lemmatization and skip stemming in our pipel
1111

1212
An important thing to consider is that we look into words as separate units (tokens) as we saw in the previous episode. For example, think about the word "leaves". That could both represent the plural of the noun "leaf" or the verb in third person for the word "leave". That is a good reminder of always remember to apply part of speech (POS) because lemmatization algorithms utilize a lexicon with linguistic rules based on pre-determined tags to avoid misinterpretation.
1313

14-
::: {.callout-note icon="false"}
14+
**Part of Speech (POS)** refers to the grammatical category that a word belongs to, indicating its syntactic function and role within a sentence. For example, the word *run* can serve as a verb in *“I like to run every morning”* or as a noun in *“I went for a long run”*. Without POS information, an NLP system might incorrectly treat *run* as always being a verb, producing inaccurate results. By applying POS tagging, systems can correctly recognize each word’s role, ensuring more accurate text processing.
15+
16+
:::: {.callout-note icon="false"}
1517
## 🧠 Knowledge Check
1618

1719
In pairs or groups of three, apply **lemmatization** to the following sentence. Identify the base forms (lemmas) of each word:
@@ -29,28 +31,40 @@ After applying lemmatization, the sentence should look like:
2931

3032
*Note*: Adverbs and prepositions usually remain unchanged because they are already in their simplest dictionary form and do not have a more basic lemma.
3133
:::
32-
:::
34+
::::
3335

3436
Alright, back to our pipeline, we will now convert words to their dictionary form, remove any remaining noise, and finalize our preprocessing steps.
3537

36-
We will be using the `lemmatize_words()` function from the **`textstem`** package. And add a new column called `word_lemmatized` to your `comments_clean` dataset, containing the base form of each word:
38+
## Rebuilding Sentences (Comments)
39+
40+
After tokenization, our data consists of individual words. However, in order to preserve the ability to apply lemmatization while taking into account each word’s part of speech (POS), we need to first reconstruct sentences; otherwise, the lemmatizer would operate on isolated tokens without context, which can lead to incorrect or less accurate base forms.
41+
42+
To ensure the words are reassembled in the correct order for each original text, we rely on the ID column. Having an ID column is crucial because it allows us to track which words belong to which original text, preventing confusion or misalignment when reconstructing our comments into sentences, especially in large or complex datasets.
43+
44+
``` r
45+
rejoined <- nonstopwords %>%
46+
group_by(id) %>% # group all tokens from the same sentence
47+
summarise(comments = paste(word, collapse = " "), .groups = "drop")
48+
```
49+
50+
## Applying Lemmatization
51+
52+
Next, we will be using creating a new dataframe named `lemmatized` using the `lemmatize_strings()` function from the **`textstem`** package, and a new column called `comments` to it, containing the dictonary form of each word.
3753

3854
``` r
3955
# Applying Lemmas
40-
lemmatized <- nonstopwords %>%
41-
mutate(word = lemmatize_words(word))
56+
lemmatized <- rejoined %>%
57+
mutate(comments = lemmatize_strings(comments))
4258
```
4359

4460
Great! Let's take a look at the lemmatized data frame. For example, words such as "telling" and "captivating" were converted into "tell" and "captivate".
4561

46-
![](images/output-lemmas-issue.png){width="422"}
62+
![](images/output_lemmasentences.png){width="416"}
4763

48-
Wait a second! If we look closely, we’ll notice an outlier lemma. Do you see the number two in the last row of the screenshot above? This is a known issue with the `textstem` package. While it hasn’t been fully resolved yet, we can apply a workaround to address it:
64+
Wait a second! If we look closely, we’ll notice an outlier lemma. Do you see the number two in the third row? This is a known issue with the `textstem` package. While it hasn’t been fully resolved yet, we can apply a workaround to address it:
4965

5066
``` r
51-
# Load the full dictionary
5267
custom_dict <- as.data.frame(lexicon::hash_lemmas)
53-
# Look for the word "second" in the dictonary
5468

5569
# Find rows where token is "second"
5670
idx <- custom_dict$token == "second"
@@ -59,40 +73,21 @@ idx <- custom_dict$token == "second"
5973
custom_dict$lemma[idx] <- custom_dict$token[idx]
6074

6175
# Now lemmatize your text
62-
lemmatized_nonumbers <- nonstopwords
76+
lemmatized_nonumbers <- rejoined
6377
custom_dict <- as.data.frame(lexicon::hash_lemmas, stringsAsFactors = FALSE)
6478
```
6579

66-
Let's see that row (20) once again in the new `lemmatized_nonumbers` dataframe we have created.
80+
Let's look at the third row once again in the new dataframe we have created.
6781

6882
Alright! Problem solved. Keep in mind, however, this would apply to most words referring to numbers, but to save us time let's address only this specifc case.
6983

70-
## Rebuilding Sentences
71-
72-
But we’re not done yet. After tokenization, our data consists of individual words. We still need to reconstruct full sentences from these lemmatized words so that each row represents a complete piece of text. To ensure the words are reassembled in the correct order for each original text, we rely on the ID column. Having an ID column is crucial because it allows us to track which words belong to which original text, preventing confusion or misalignment when reconstructing sentences, especially in large or complex datasets.
73-
74-
``` r
75-
# Reconstruct sentences from lemmatized words
76-
preprocessed <- lemmatized %>%
77-
group_by(id, text) %>%
78-
summarise(text_preprocessed = paste(word, collapse = " "), .groups = "drop")
79-
```
80-
81-
After you use `group_by(id, text)`, each group contains all the lemmatized words that belong to the same original text. The `summarise()` function then takes each group and creates one summary row per group.
82-
83-
Inside `summarise`, `text_preprocessed = paste(word, collapse = " ")` takes all the words in the group and joins them together into a single string, with a space between each word. This produces a full sentence (or comment) instead of separate words. We will also ensure to describe a different path to save our progress to a folder named `preprocessed` under the `data` folder.
84-
8584
## Saving your Work for Analysis
8685

87-
let's save it as a new file named `comments_preprocessed`:
86+
Let's save our work as a new file named `comments_preprocessed`:
8887

8988
``` r
90-
# Select only important columns
91-
output <- preprocessed %>%
92-
select(id, text_preprocessed)
93-
9489
# Save to CSV
95-
write.csv(output, "./data/preprocessed/comments_preprocessed.csv")
90+
write.csv(lemmatized_nonumbers, "./data/preprocessed/comments_preprocessed.csv")
9691
```
9792

9893
## Before we go
@@ -101,4 +96,4 @@ If you notice that the output object saved as `comments_preprocessed` contains o
10196

10297
![](images/output-missingID.png)
10398

104-
Well done! That concludes all our preprocessing steps. Let's now cover some important considerations for your future text preprocessing projects.
99+
Well done! That concludes all our preprocessing steps. Let's now cover some important considerations for your future text preprocessing projects.
99.5 KB
Loading

0 commit comments

Comments
 (0)