This is a project to test out nbdev
Preprocess.lemmatize[source]
Preprocess.lemmatize()
Returns stemmed or lemmatized documents with punctuation and stopwords removed
get_freq[source]
get_freq(preprocessed_documents)
Returns list with vocabulary frequencies per document and a vocabalury list
form_matrix[source]
form_matrix(doc_freq,vocabulary)
Returns matrix with td-idf vectors.
get_query_vec[source]
get_query_vec(preprocessed_query,vocab,doc_freq)
Retun tf-idf vector of input query
get_cos_sim[source]
get_cos_sim(matrix,vector)
Returns 10 most similar documents based on cosine similarity between documents and query vector
pip install nbdev_testing
documents = ["Hello world", "NLP is fun", "We work at the bank"]
text = Preprocess(documents)
preprocessed = text.lemmatize()
preprocessed
[['hello', 'world'], ['NLP', 'fun'], ['-PRON-', 'work', 'bank']]
document_frequency, vocabulary = get_freq(preprocessed)
document_frequency
[Counter({'hello': 1, 'world': 1}),
Counter({'NLP': 1, 'fun': 1}),
Counter({'-PRON-': 1, 'work': 1, 'bank': 1})]
vocabulary
['NLP', 'world', 'fun', 'work', 'bank', 'hello', '-PRON-']