Skip to content

Example datasets

cjrd edited this page Jun 27, 2012 · 11 revisions

# Topic Models Email List Archive

This dataset consists of 1887 email messages from the Topic-models mailing list archive between September 2006 and May 2012. The quoted text in response emails has been (mostly) scrubbed by removing all lines that begin with '>'. Furthermore, we have (mostly) removed signatures by removing all text that follows a sequence of dashes, e.g.

---

John Smith

# Coursera PGM Video Transcripts

This dataset consists of the 92 video transcripts from Coursera's free Probabilistic Graphical Models course.

# AP Articles

This dataset consists of 1085 Associated Press articles taken randomly from the 2046 AP articles provided as sample data for Dave Blei's LDA implementation.

# NSF Grants

This dataset consists of a random subset 1166 NSF Grant abstracts from the NSF Research Awards Corpus between 2000-2003.

# New York Times

This dataset consists of 845 semi-processed New York Times articles. Dataset taken from David Newman's Topic Modeling Tool. This dataset has been removed from the online version.

Clone this wiki locally