You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: chapters/1.Preprocessing/05_lemmatization.qmd
+28-33Lines changed: 28 additions & 33 deletions
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,9 @@ For this reason, we will stick with lemmatization and skip stemming in our pipel
11
11
12
12
An important thing to consider is that we look into words as separate units (tokens) as we saw in the previous episode. For example, think about the word "leaves". That could both represent the plural of the noun "leaf" or the verb in third person for the word "leave". That is a good reminder of always remember to apply part of speech (POS) because lemmatization algorithms utilize a lexicon with linguistic rules based on pre-determined tags to avoid misinterpretation.
13
13
14
-
::: {.callout-note icon="false"}
14
+
**Part of Speech (POS)** refers to the grammatical category that a word belongs to, indicating its syntactic function and role within a sentence. For example, the word *run* can serve as a verb in *“I like to run every morning”* or as a noun in *“I went for a long run”*. Without POS information, an NLP system might incorrectly treat *run* as always being a verb, producing inaccurate results. By applying POS tagging, systems can correctly recognize each word’s role, ensuring more accurate text processing.
15
+
16
+
:::: {.callout-note icon="false"}
15
17
## 🧠 Knowledge Check
16
18
17
19
In pairs or groups of three, apply **lemmatization** to the following sentence. Identify the base forms (lemmas) of each word:
@@ -29,28 +31,40 @@ After applying lemmatization, the sentence should look like:
29
31
30
32
*Note*: Adverbs and prepositions usually remain unchanged because they are already in their simplest dictionary form and do not have a more basic lemma.
31
33
:::
32
-
:::
34
+
::::
33
35
34
36
Alright, back to our pipeline, we will now convert words to their dictionary form, remove any remaining noise, and finalize our preprocessing steps.
35
37
36
-
We will be using the `lemmatize_words()` function from the **`textstem`** package. And add a new column called `word_lemmatized` to your `comments_clean` dataset, containing the base form of each word:
38
+
## Rebuilding Sentences (Comments)
39
+
40
+
After tokenization, our data consists of individual words. However, in order to preserve the ability to apply lemmatization while taking into account each word’s part of speech (POS), we need to first reconstruct sentences; otherwise, the lemmatizer would operate on isolated tokens without context, which can lead to incorrect or less accurate base forms.
41
+
42
+
To ensure the words are reassembled in the correct order for each original text, we rely on the ID column. Having an ID column is crucial because it allows us to track which words belong to which original text, preventing confusion or misalignment when reconstructing our comments into sentences, especially in large or complex datasets.
43
+
44
+
```r
45
+
rejoined<-nonstopwords %>%
46
+
group_by(id) %>% # group all tokens from the same sentence
Next, we will be using creating a new dataframe named `lemmatized` using the `lemmatize_strings()` function from the **`textstem`** package, and a new column called `comments` to it, containing the dictonary form of each word.
37
53
38
54
```r
39
55
# Applying Lemmas
40
-
lemmatized<-nonstopwords %>%
41
-
mutate(word=lemmatize_words(word))
56
+
lemmatized<-rejoined %>%
57
+
mutate(comments=lemmatize_strings(comments))
42
58
```
43
59
44
60
Great! Let's take a look at the lemmatized data frame. For example, words such as "telling" and "captivating" were converted into "tell" and "captivate".
Wait a second! If we look closely, we’ll notice an outlier lemma. Do you see the number two in the last row of the screenshot above? This is a known issue with the `textstem` package. While it hasn’t been fully resolved yet, we can apply a workaround to address it:
64
+
Wait a second! If we look closely, we’ll notice an outlier lemma. Do you see the number two in the third row? This is a known issue with the `textstem` package. While it hasn’t been fully resolved yet, we can apply a workaround to address it:
Let's see that row (20) once again in the new`lemmatized_nonumbers` dataframe we have created.
80
+
Let's look at the third row once again in the new dataframe we have created.
67
81
68
82
Alright! Problem solved. Keep in mind, however, this would apply to most words referring to numbers, but to save us time let's address only this specifc case.
69
83
70
-
## Rebuilding Sentences
71
-
72
-
But we’re not done yet. After tokenization, our data consists of individual words. We still need to reconstruct full sentences from these lemmatized words so that each row represents a complete piece of text. To ensure the words are reassembled in the correct order for each original text, we rely on the ID column. Having an ID column is crucial because it allows us to track which words belong to which original text, preventing confusion or misalignment when reconstructing sentences, especially in large or complex datasets.
After you use `group_by(id, text)`, each group contains all the lemmatized words that belong to the same original text. The `summarise()` function then takes each group and creates one summary row per group.
82
-
83
-
Inside `summarise`, `text_preprocessed = paste(word, collapse = " ")` takes all the words in the group and joins them together into a single string, with a space between each word. This produces a full sentence (or comment) instead of separate words. We will also ensure to describe a different path to save our progress to a folder named `preprocessed` under the `data` folder.
84
-
85
84
## Saving your Work for Analysis
86
85
87
-
let's save it as a new file named `comments_preprocessed`:
86
+
Let's save our work as a new file named `comments_preprocessed`:
0 commit comments