Skip to content

Commit d991d72

Browse files
committed
fix: unnamed-chunk-1
1 parent 733db69 commit d991d72

File tree

1 file changed

+14
-14
lines changed

1 file changed

+14
-14
lines changed

chapters/1.Preprocessing/02_normalization.qmd

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -15,16 +15,16 @@ Just as a gardener would prune dead branches, enrich the soil, and care for the
1515

1616
The main goal of normalization is to remove irrelevant content and standardize the data in order to reduce noise. Below are some key actions we’ll be performing during this workshop:
1717

18-
| Action | Why it matters? |
19-
|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
20-
| Remove URLs | URLs often contain irrelevant noise and don't contribute meaningful content for analysis. |
21-
| Remove Punctuation & Symbols | Punctuation marks and other symbols including those extensively used in social media for mentioning (\@) or tagging (#) rarely adds value in most NLP tasks and can interfere with tokenization (as we will cover in a bit) or word matching. |
22-
| Remove Numbers | Numbers can be noise in most contexts unless specifically relevant (e.g., in financial or medical texts) don't contribute much to the analysis. However, in NLP tasks they are considered important, there might be considerations to replace them with dummy tokens (e.g. \<NUMBER\>), or even converting them into their written form (e.g, 100 becomes one hundred). |
23-
| Normalize Whitespaces | Ensures consistent word boundaries and avoids issues during tokenization or frequency analysis. |
24-
| Convert to Lowercase | Prevents case sensitivity from splitting word counts due to case variations (e.g., “AppleTV” ≠ "APPLETV" ≠ “appleTV” ≠ “appletv”), improving model consistency. |
25-
| Convert Emojis to Text | Emojis play a unique role in text analysis, as they often convey sentiment. Rather than removing them, we will convert them into their corresponding text descriptions. |
26-
27-
::: {.callout-note icon="false"}
18+
| Action | Why it matters? |
19+
|-------------|-----------------------------------------------------------|
20+
| Remove URLs | URLs often contain irrelevant noise and don't contribute meaningful content for analysis. |
21+
| Remove Punctuation & Symbols | Punctuation marks and other symbols including those extensively used in social media for mentioning (\@) or tagging (#) rarely adds value in most NLP tasks and can interfere with tokenization (as we will cover in a bit) or word matching. |
22+
| Remove Numbers | Numbers can be noise in most contexts unless specifically relevant (e.g., in financial or medical texts) don't contribute much to the analysis. However, in NLP tasks they are considered important, there might be considerations to replace them with dummy tokens (e.g. \<NUMBER\>), or even converting them into their written form (e.g, 100 becomes one hundred). |
23+
| Normalize Whitespaces | Ensures consistent word boundaries and avoids issues during tokenization or frequency analysis. |
24+
| Convert to Lowercase | Prevents case sensitivity from splitting word counts due to case variations (e.g., “AppleTV” ≠ "APPLETV" ≠ “appleTV” ≠ “appletv”), improving model consistency. |
25+
| Convert Emojis to Text | Emojis play a unique role in text analysis, as they often convey sentiment. Rather than removing them, we will convert them into their corresponding text descriptions. |
26+
27+
:::: {.callout-note icon="false"}
2828
## 🧠 Knowledge Check
2929

3030
In pairs or groups of three, identify the techniques you would consider using to normalize and reduce noise in the following sentence:
@@ -40,7 +40,7 @@ After applying them the sentence should look like:
4040

4141
*omg \[face scream in fear\] I can not believe it this is crazy unreal \[exploding head\]*
4242
:::
43-
:::
43+
::::
4444

4545
A caveat when working with emojis is that they are figurative and highly contextual. Also, there may be important generational and cultural variability in how people interpret them. For example, some countries may use the Folded Hands Emoji (🙏) as a sign of thank you where others may seem as religious expression. Also, some may use it in a more positive way as gratitude, hope or respect, or in a negative context, where they might be demonstrating submission or begging.
4646

@@ -143,7 +143,7 @@ Now that we have normalized variations of apostrophes, we can properly handle co
143143
So, while it may seem like a small step, it often leads to cleaner data, leaner models, and more accurate results. First, however, we need to ensure that apostrophes are handled correctly. It's not uncommon to encounter messy text where nonstandard characters are used in place of the straight apostrophe ('). Such inconsistencies are very common and can disrupt contraction expansion.
144144

145145
| Character | Unicode | Notes |
146-
|-----------|---------|---------------------------------------------------------|
146+
|-------------|-------------|----------------------------------------------|
147147
| `'` | U+0027 | Standard straight apostrophe, used in most dictionaries |
148148
| `` | U+2019 | Right single quotation mark (curly apostrophe) |
149149
| `` | U+2018 | Left single quotation mark |
@@ -256,7 +256,7 @@ replace_emojis <- function(text, emoji_dict) {
256256

257257
Wait, we are not done yet! We still have to add the `replace_emojis` function, based on our loaded dictionary, into our code chunk. This will replace the emojis with their corresponding text on our dataset:
258258

259-
```{r}
259+
``` r
260260
replace_emojis(emoji_dict) %>%
261261
```
262262

@@ -333,4 +333,4 @@ Let's re-run the code chunk and check how those emojis were taken care of. With
333333
Bai, Q., Dan, Q., Mu, Z., & Yang, M. (2019). A systematic review of emoji: Current research and future perspectives. *Frontiers in psychology*, *10*, <https://doi.org/10.3389/fpsyg.2019.02221>
334334

335335
Graham, P. V. (2024). Emojis: An Approach to Interpretation. *UC L. SF Commc'n and Ent. J.*, *46*, 123. <https://repository.uclawsf.edu/cgi/viewcontent.cgi?article=1850&context=hastings_comm_ent_law_journal>
336-
:::
336+
:::

0 commit comments

Comments
 (0)