Skip to content

Commit 0b066f7

Browse files
committed
referring to comments as in the raw dataset
1 parent 9957e6f commit 0b066f7

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

chapters/1.Preprocessing/05_lemmatization.qmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,11 @@ After applying lemmatization, the sentence should look like:
3535

3636
Alright, back to our pipeline, we will now convert words to their dictionary form, remove any remaining noise, and finalize our preprocessing steps.
3737

38-
## Rebuilding Sentences
38+
## Rebuilding Sentences (Comments)
3939

4040
After tokenization, our data consists of individual words. However, in order to preserve the ability to apply lemmatization while taking into account each word’s part of speech (POS), we need to first reconstruct sentences; otherwise, the lemmatizer would operate on isolated tokens without context, which can lead to incorrect or less accurate base forms.
4141

42-
To ensure the words are reassembled in the correct order for each original text, we rely on the ID column. Having an ID column is crucial because it allows us to track which words belong to which original text, preventing confusion or misalignment when reconstructing sentences, especially in large or complex datasets.
42+
To ensure the words are reassembled in the correct order for each original text, we rely on the ID column. Having an ID column is crucial because it allows us to track which words belong to which original text, preventing confusion or misalignment when reconstructing our comments into sentences, especially in large or complex datasets.
4343

4444
``` r
4545
rejoined <- nonstopwords %>%
@@ -49,7 +49,7 @@ rejoined <- nonstopwords %>%
4949

5050
## Applying Lemmatization
5151

52-
Next, we will be using creating a new dataframe named `lemmatized` using the `lemmatize_strings()` function from the **`textstem`** package, and a new column called `sentences` to it, containing the dictonary form of each word.
52+
Next, we will be using creating a new dataframe named `lemmatized` using the `lemmatize_strings()` function from the **`textstem`** package, and a new column called `comments` to it, containing the dictonary form of each word.
5353

5454
``` r
5555
# Applying Lemmas

0 commit comments

Comments
 (0)