NLTK's Punkt tokenizer implements an algorithm for unsupervised sentence boundary detection that's language-independent. This algorithm should be ported to Penelope along with hooks for easy retraining using a corpus and optional list of language-specific special case regexes.
NLTK's Punkt tokenizer implements an algorithm for unsupervised sentence boundary detection that's language-independent. This algorithm should be ported to Penelope along with hooks for easy retraining using a corpus and optional list of language-specific special case regexes.