Skip to content

Commit 8916f9c

Browse files
committed
worsmithing + suggested reads
1 parent 4eeafd1 commit 8916f9c

File tree

1 file changed

+14
-2
lines changed

1 file changed

+14
-2
lines changed

chapters/1.Preprocessing/06_conclusion.qmd

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,22 @@ title: "Conclusion"
33
editor: visual
44
---
55

6-
In this workshop, we navigated the challenges of preprocessing of social media data, highlighting how messy, inconsistent, and noisy real-world datasets can be. One key takeaway is the importance of thoroughly assessing the data in the context of your project goals before diving into processing.
6+
In this workshop, we navigated the challenges of preprocessing of unustrucutred social media data, highlighting how messy, inconsistent, and noisy real-world datasets can be. One key takeaway is the importance of thoroughly assessing the data in the context of your project goals before diving into processing and be mindful that the order of factors do influence the outcome.
77

8-
Not all cleaning or transformation steps are universally beneficial; decisions should be guided by what is meaningful for your analysis or model objectives. Emojis, for example, can convey sentiment, irony, or context that may be essential for analysis, so decisions on whether to remove, convert, or retain them should be goal-driven. Similarly, numbers such as dates, prices, or statisticscan carry meaningful information, but they can also introduce noise if misinterpreted or inconsistently formatted. Thoughtful handling of these elements ensures that preprocessing enhances the dataset’s usefulness rather than stripping away valuable signals.
8+
Not all cleaning or transformation steps are universally beneficial and decisions should be guided by what is meaningful for your analysis or model objectives. Emojis, for example, can convey sentiment, irony, or context that may be essential for analysis, so decisions on whether to remove, convert, or retain them should be goal-driven.
9+
10+
Similarly, numbers such as dates, prices, or statistics can carry meaningful information, but they can also introduce noise if misinterpreted or inconsistently formatted. Thoughtful handling of these elements ensures that preprocessing enhances the dataset’s usefulness rather than stripping away valuable signals.
911

1012
Overly aggressive text cleaning removes content that is vital to the context, meaning, or nuance of a text and can damage the performance of natural language processing (NLP) models. The specific steps that lead to this problem depend on the end goal of your NLP task. 
1113

14+
While preprocessing is considered a key step, if performed incorrectly or poorly planned, it can do more harm than good to the analysis. In short, preprocessing is not merely a mechanical phase in the pipeline but a thoughtful design choice that shapes the quality, interpretability, and trustworthiness of all subsequent tasks.
15+
1216
By critically evaluating the data and aligning preprocessing strategies with the end goals, we can ensure that the cleaned dataset not only becomes more manageable but also more valuable for deriving actionable insights. Ultimately, thoughtful data assessment is just as important as the technical preprocessing steps themselves.
17+
18+
::: callout-tip
19+
## 🤓 Suggested Readings
20+
21+
Chai CP. Comparison of text preprocessing methods. *Natural Language Engineering*. 2023;29(3):509-553. <https://doi.org/10.1017/S1351324922000213>
22+
23+
Siino, M., Tinnirello, I., & La Cascia, M. (2024). Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers. *Information Systems*, *121*, 102342. <https://doi.org/10.1016/j.is.2023.102342>
24+
:::

0 commit comments

Comments
 (0)