space

rcurty · rcurty · commit 3c50c2d12367 · 2025-10-20T10:09:40.000-07:00
diff --git a/chapters/1.Preprocessing/02_normalization.qmd b/chapters/1.Preprocessing/02_normalization.qmd
@@ -16,7 +16,7 @@ Just as a gardener would prune dead branches, enrich the soil, and care for the
 The main goal of normalization is to remove irrelevant content and standardize the data in order to reduce noise. Below are some key actions we’ll be performing during this workshop:
 
 | Action                       | Why it matters?                                                                                                                                                                                                                                                                                                                                                         |
-|-------------|-----------------------------------------------------------|
+|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | Remove URLs                  | URLs often contain irrelevant noise and don't contribute meaningful content for analysis.                                                                                                                                                                                                                                                                               |
 | Remove Punctuation & Symbols | Punctuation marks and other symbols including those extensively used in social media for mentioning (\@) or tagging (#) rarely adds value in most NLP tasks and can interfere with tokenization (as we will cover in a bit) or word matching.                                                                                                                           |
 | Remove Numbers               | Numbers can be noise in most contexts unless specifically relevant (e.g., in financial or medical texts) don't contribute much to the analysis. However, in NLP tasks they are considered important, there might be considerations to replace them with dummy tokens (e.g. \<NUMBER\>), or even converting them into their written form (e.g, 100 becomes one hundred). |
@@ -143,7 +143,7 @@ Now that we have normalized variations of apostrophes, we can properly handle co
 So, while it may seem like a small step, it often leads to cleaner data, leaner models, and more accurate results. First, however, we need to ensure that apostrophes are handled correctly. It's not uncommon to encounter messy text where nonstandard characters are used in place of the straight apostrophe ('). Such inconsistencies are very common and can disrupt contraction expansion.
 
 | Character | Unicode | Notes                                                   |
-|-------------|-------------|----------------------------------------------|
+|-----------|---------|---------------------------------------------------------|
 | `'`       | U+0027  | Standard straight apostrophe, used in most dictionaries |
 | `’`       | U+2019  | Right single quotation mark (curly apostrophe)          |
 | `‘`       | U+2018  | Left single quotation mark                              |