Skip to content

Write a script under src/papers/features, that preprocess the NIPS dataset #6

@liadmagen

Description

@liadmagen

To improve the efficiency of our models, we often preprocess the document before starting the process itself.

This includes methods such as segmenting and tokenizing the text - breaking the document into sentences and words; removing 'stop-words', which are frequent words in the language that don't contribute to the meaning of the text; stemming and/or lemmatizing words, and more.

For more info about the process, we encourage you to visit chapters 3 & 5 in http://www.nltk.org/book/

besides nltk, a possilble package may be spacy: https://spacy.io/usage/spacy-101#section-lightning-tour

To sum up - the script, which should be divided to testable functions, should be able to receive a dataset with documents column as an input, and return the dataset with an additional column preprocessed_docs which contains the preprocessing result.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions