Skip to content

Latest commit

 

History

History
187 lines (105 loc) · 3.69 KB

File metadata and controls

187 lines (105 loc) · 3.69 KB

Presentations

  • What is the question?
  • What was the approach?
  • What problems did I encounter?
  • What results did I get?
  • What new ideas did this generate?

Data problems

T1 From Alok/Deytia/Tyler/Kapil

a) incorrect regex -> missing or extra words

b) vernacular words

c) extra data (title and author randomly placed in text)

d) translator bias

e) time-frame of the public (out of copyright) texts

f) various forms of words (verb tenses, possessive, etc)

T2 From Chunyn/John/RJ

d) Translatability - context

g) Time-period, gender - context

h) Genre - context

i) Semantics (the same word used for different meaning) - context

e) Only non-copyright books

T3 DJ/Mark/Sam

a) Empty space as a word (stopwords/html not yet caught)

e) Restricted to books on project Gutenberg

j) Extraneous words (Gutenberg)

k) Words that don't carry significant meaning alone (would/could/will)

i) Semantics (the same word used for different meaning) - context

T4 Camille/Mohammad/Bryan

  • Multiple Contexts

i) Semantics (the same word used for different meaning) - context

f) various forms of words (verb tenses, possessive, etc)

  • Missing context

l) missing pages

m) table of content (not regular text)

?) words in a vacuum (no information on word frequency in relation to other words)

  • Incorrect/tampered/filtered

c) copyright

j) legalese, formatting words

n) incorrect digital copies

a) empty characters

T5 David/Josh/Chris/Sadika

c) The Gutenburg books include lengthy licenses, which should not be accounted for in the analysis

g) Books written in different eras

h) Books of differing genres

l) Partial/Incomplete Books (ex. missing chapters)

d) Incorrect translations

a) Whether words with punctuation in them count as the same word as the version without punctuation changes results

Additional

o) wrong author

p) wrong unit (chapter not whole novel?)

q) why exclude stopwords

r) what about handwriting?

s) lost word order by looking into frequencies

t) how about infrequent words or looking at the entire distribution


Summary of issues

?) words in a vacuum (no information on word frequency in relation to other words)

a) Empty space as a word (stopwords/html not yet caught)

a) Whether words with punctuation in them count as the same word as the version without punctuation c hanges results

a) empty characters

a) incorrect regex -> missing or extra words

b) vernacular words

c) The Gutenburg books include lengthy licenses, which should not be accounted for in the analysis

c) copyright

c) extra data (title and author randomly placed in text)

d) Incorrect translations

d) Translatability - context

d) translator bias

e) Only non-copyright books

e) Restricted to books on project Gutenberg

e) time-frame of the public (out of copyright) texts

f) various forms of words (verb tenses, possessive, etc)

f) various forms of words (verb tenses, possessive, etc)

g) Books written in different eras

g) Time-period, gender - context

h) Books of differing genres

h) Genre - context

i) Semantics (the same word used for different meaning) - context

i) Semantics (the same word used for different meaning) - context

i) Semantics (the same word used for different meaning) - context

j) Extraneous words (Gutenberg)

j) legalese, formatting words

k) Words that don't carry significant meaning alone (would/could/will)

l) Partial/Incomplete Books (ex. missing chapters)

l) missing pages

m) table of content (not regular text)

n) incorrect digital copies

o) wrong author

p) wrong unit (chapter not whole novel?)

q) why exclude stopwords

r) what about handwriting?

s) lost word order by looking into frequencies

t) how about infrequent words or looking at the entire distribution