- What is the question?
- What was the approach?
- What problems did I encounter?
- What results did I get?
- What new ideas did this generate?
a) incorrect regex -> missing or extra words
b) vernacular words
c) extra data (title and author randomly placed in text)
d) translator bias
e) time-frame of the public (out of copyright) texts
f) various forms of words (verb tenses, possessive, etc)
d) Translatability - context
g) Time-period, gender - context
h) Genre - context
i) Semantics (the same word used for different meaning) - context
e) Only non-copyright books
a) Empty space as a word (stopwords/html not yet caught)
e) Restricted to books on project Gutenberg
j) Extraneous words (Gutenberg)
k) Words that don't carry significant meaning alone (would/could/will)
i) Semantics (the same word used for different meaning) - context
- Multiple Contexts
i) Semantics (the same word used for different meaning) - context
f) various forms of words (verb tenses, possessive, etc)
- Missing context
l) missing pages
m) table of content (not regular text)
?) words in a vacuum (no information on word frequency in relation to other words)
- Incorrect/tampered/filtered
c) copyright
j) legalese, formatting words
n) incorrect digital copies
a) empty characters
c) The Gutenburg books include lengthy licenses, which should not be accounted for in the analysis
g) Books written in different eras
h) Books of differing genres
l) Partial/Incomplete Books (ex. missing chapters)
d) Incorrect translations
a) Whether words with punctuation in them count as the same word as the version without punctuation changes results
o) wrong author
p) wrong unit (chapter not whole novel?)
q) why exclude stopwords
r) what about handwriting?
s) lost word order by looking into frequencies
t) how about infrequent words or looking at the entire distribution
?) words in a vacuum (no information on word frequency in relation to other words)
a) Empty space as a word (stopwords/html not yet caught)
a) Whether words with punctuation in them count as the same word as the version without punctuation c hanges results
a) empty characters
a) incorrect regex -> missing or extra words
b) vernacular words
c) The Gutenburg books include lengthy licenses, which should not be accounted for in the analysis
c) copyright
c) extra data (title and author randomly placed in text)
d) Incorrect translations
d) Translatability - context
d) translator bias
e) Only non-copyright books
e) Restricted to books on project Gutenberg
e) time-frame of the public (out of copyright) texts
f) various forms of words (verb tenses, possessive, etc)
f) various forms of words (verb tenses, possessive, etc)
g) Books written in different eras
g) Time-period, gender - context
h) Books of differing genres
h) Genre - context
i) Semantics (the same word used for different meaning) - context
i) Semantics (the same word used for different meaning) - context
i) Semantics (the same word used for different meaning) - context
j) Extraneous words (Gutenberg)
j) legalese, formatting words
k) Words that don't carry significant meaning alone (would/could/will)
l) Partial/Incomplete Books (ex. missing chapters)
l) missing pages
m) table of content (not regular text)
n) incorrect digital copies
o) wrong author
p) wrong unit (chapter not whole novel?)
q) why exclude stopwords
r) what about handwriting?
s) lost word order by looking into frequencies
t) how about infrequent words or looking at the entire distribution