Write a script under src/papers/features, that preprocess the NIPS dataset

To improve the efficiency of our models, we often preprocess the document before starting the process itself.

This includes methods such as segmenting and tokenizing the text - breaking the document into sentences and words; removing 'stop-words', which are frequent words in the language that don't contribute to the meaning of the text; stemming and/or lemmatizing words, and more.

For more info about the process, we encourage you to visit chapters 3 & 5 in http://www.nltk.org/book/

besides nltk, a possilble package may be spacy: https://spacy.io/usage/spacy-101#section-lightning-tour

To sum up - the script, which should be divided to testable functions, should be able to receive a dataset with documents column as an input, and return the dataset with an additional column preprocessed_docs which contains the preprocessing result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write a script under src/papers/features, that preprocess the NIPS dataset #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Write a script under src/papers/features, that preprocess the NIPS dataset #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions