-
Notifications
You must be signed in to change notification settings - Fork 1
Description
To improve the efficiency of our models, we often preprocess the document before starting the process itself.
This includes methods such as segmenting and tokenizing the text - breaking the document into sentences and words; removing 'stop-words', which are frequent words in the language that don't contribute to the meaning of the text; stemming and/or lemmatizing words, and more.
For more info about the process, we encourage you to visit chapters 3 & 5 in http://www.nltk.org/book/
besides nltk, a possilble package may be spacy: https://spacy.io/usage/spacy-101#section-lightning-tour
To sum up - the script, which should be divided to testable functions, should be able to receive a dataset with documents column as an input, and return the dataset with an additional column preprocessed_docs which contains the preprocessing result.