Presentations

What is the question?
What was the approach?
What problems did I encounter?
What results did I get?
What new ideas did this generate?

Data problems

T1 From Alok/Deytia/Tyler/Kapil

a) incorrect regex -> missing or extra words

b) vernacular words

c) extra data (title and author randomly placed in text)

d) translator bias

e) time-frame of the public (out of copyright) texts

f) various forms of words (verb tenses, possessive, etc)

T2 From Chunyn/John/RJ

d) Translatability - context

g) Time-period, gender - context

h) Genre - context

i) Semantics (the same word used for different meaning) - context

e) Only non-copyright books

T3 DJ/Mark/Sam

a) Empty space as a word (stopwords/html not yet caught)

e) Restricted to books on project Gutenberg

j) Extraneous words (Gutenberg)

k) Words that don't carry significant meaning alone (would/could/will)

i) Semantics (the same word used for different meaning) - context

T4 Camille/Mohammad/Bryan

Multiple Contexts

i) Semantics (the same word used for different meaning) - context

f) various forms of words (verb tenses, possessive, etc)

Missing context

l) missing pages

m) table of content (not regular text)

?) words in a vacuum (no information on word frequency in relation to other words)

Incorrect/tampered/filtered

c) copyright

j) legalese, formatting words

n) incorrect digital copies

a) empty characters

T5 David/Josh/Chris/Sadika

c) The Gutenburg books include lengthy licenses, which should not be accounted for in the analysis

g) Books written in different eras

h) Books of differing genres

l) Partial/Incomplete Books (ex. missing chapters)

d) Incorrect translations

a) Whether words with punctuation in them count as the same word as the version without punctuation changes results

Additional

o) wrong author

p) wrong unit (chapter not whole novel?)

q) why exclude stopwords

r) what about handwriting?

s) lost word order by looking into frequencies

t) how about infrequent words or looking at the entire distribution

Summary of issues

?) words in a vacuum (no information on word frequency in relation to other words)

a) Empty space as a word (stopwords/html not yet caught)

a) Whether words with punctuation in them count as the same word as the version without punctuation c hanges results

a) empty characters

a) incorrect regex -> missing or extra words

b) vernacular words

c) The Gutenburg books include lengthy licenses, which should not be accounted for in the analysis

c) copyright

c) extra data (title and author randomly placed in text)

d) Incorrect translations

d) Translatability - context

d) translator bias

e) Only non-copyright books

e) Restricted to books on project Gutenberg

e) time-frame of the public (out of copyright) texts

f) various forms of words (verb tenses, possessive, etc)

g) Books written in different eras

g) Time-period, gender - context

h) Books of differing genres

h) Genre - context

i) Semantics (the same word used for different meaning) - context

j) Extraneous words (Gutenberg)

j) legalese, formatting words

k) Words that don't carry significant meaning alone (would/could/will)

l) Partial/Incomplete Books (ex. missing chapters)

l) missing pages

m) table of content (not regular text)

n) incorrect digital copies

o) wrong author

p) wrong unit (chapter not whole novel?)

q) why exclude stopwords

r) what about handwriting?

s) lost word order by looking into frequencies

t) how about infrequent words or looking at the entire distribution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Presentations

Data problems

T1 From Alok/Deytia/Tyler/Kapil

T2 From Chunyn/John/RJ

T3 DJ/Mark/Sam

T4 Camille/Mohammad/Bryan

T5 David/Josh/Chris/Sadika

Additional

Summary of issues

FilesExpand file tree

problems.md

Latest commit

History

problems.md

File metadata and controls

Presentations

Data problems

T1 From Alok/Deytia/Tyler/Kapil

T2 From Chunyn/John/RJ

T3 DJ/Mark/Sam

T4 Camille/Mohammad/Bryan

T5 David/Josh/Chris/Sadika

Additional

Summary of issues