-
Notifications
You must be signed in to change notification settings - Fork 12
Adding gmb dataset #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tejasvaidhyadev
wants to merge
16
commits into
JuliaText:master
Choose a base branch
from
tejasvaidhyadev:adding_GMB_Dataset
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
3c86c51
adding GMB.jl
tejasvaidhyadev 18b3939
adding GMB_DataDeps
tejasvaidhyadev 45f0867
checksumupdate
tejasvaidhyadev c21c4dd
updating module
tejasvaidhyadev 65b49f1
Update src/GMB.jl
tejasvaidhyadev d78f5be
Update src/GMB.jl
tejasvaidhyadev 0ebe1e8
Update src/GMB.jl
tejasvaidhyadev 35fda3b
Update src/GMB_DataDeps.jl
tejasvaidhyadev 2bb03e8
indentation correction with use of glob
tejasvaidhyadev 9fa1deb
updated_docs with GMB
tejasvaidhyadev 64896b0
updated make.jl
tejasvaidhyadev 535b8f6
updating Read me
tejasvaidhyadev b322cf4
adding tests
tejasvaidhyadev c96c371
updating docs
tejasvaidhyadev c3bb93f
updated GMB.jl
tejasvaidhyadev 891d5c2
added about POS
tejasvaidhyadev File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| # GMB | ||
| The dataset an extract from GMB corpus which is tagged, annotated, | ||
| and built specifically to train the classifier to predict named entities such as name, location, etc. | ||
|
|
||
| GMB is a fairly large corpus with a lot of annotations. | ||
| Unfortunately, GMB is not perfect. It is not a gold standard corpus, meaning that it’s not completely human annotated and it’s not considered 100% correct. | ||
| The corpus is created by using already existed annotators and then corrected by humans where needed. | ||
|
|
||
| The Groningen Meaning Bank (GMB) consists of public domain English texts with corresponding syntactic and semantic representations. | ||
| The GMB is developed at the [University of Groningen](https://www.rug.nl/). | ||
| A multi-lingual version of the GMB is the [Parallel Meaning Bank](https://pmb.let.rug.nl/). A thorough description of the GMB can be found in the Handbook of Linguistic Annotation. | ||
|
|
||
| the part-of-speech tagset used in the Penn Treebank tagset as listed in Ann Taylor, Mitchell Marcus and Beatrice Santorini (2003): [The Penn Treebank](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.8216): An Overview, Section 1.1. | ||
|
|
||
|
|
||
|
|
||
| For more detail [refer](https://gmb.let.rug.nl/about.php) | ||
|
|
||
| ```julia | ||
|
|
||
| Data= load(GMB()) | ||
| 37789-element Array{Array{PosTaggedWord,1},1}: | ||
| [PosTaggedWord("NNS", "Families"), PosTaggedWord("IN", "of"), PosTaggedWord("NNS", "soldiers"), PosTaggedWord("VBN", "killed"), PosTaggedWord("IN", "in"), PosTaggedWord("DT", "the"), PosTaggedWord("NN", "conflict"), PosTaggedWord("VBD", "joined"), PosTaggedWord("DT", "the"), PosTaggedWord("NNS", "protesters") … PosTaggedWord("CD", "One"), PosTaggedWord("NN", "Terrorist"), PosTaggedWord("RQU", "\""), PosTaggedWord("CC", "and"), PosTaggedWord("LQU", "\""), PosTaggedWord("VB", "Stop"), PosTaggedWord("DT", "the"), PosTaggedWord("NNS", "Bombings"), PosTaggedWord(".", "."), PosTaggedWord("LQU", "\"")] | ||
|
|
||
| [PosTaggedWord("PRP", "They"), PosTaggedWord("VBD", "marched"), PosTaggedWord("IN", "from"), PosTaggedWord("DT", "the"), PosTaggedWord("NNS", "Houses"), PosTaggedWord("IN", "of"), PosTaggedWord("NN", "Parliament"), PosTaggedWord("TO", "to"), PosTaggedWord("DT", "a"), PosTaggedWord("NN", "rally"), PosTaggedWord("IN", "in"), PosTaggedWord("NNP", "Hyde"), PosTaggedWord("NNP", "Park"), PosTaggedWord(".", ".")] | ||
|
|
||
| [PosTaggedWord("NNS", "Police"), PosTaggedWord("VBD", "put"), PosTaggedWord("DT", "the"), PosTaggedWord("NN", "number"), PosTaggedWord("IN", "of"), PosTaggedWord("NNS", "marchers"), PosTaggedWord("IN", "at"), PosTaggedWord("CD", "10,000"), PosTaggedWord("IN", "while"), PosTaggedWord("NNS", "organizers"), PosTaggedWord("VBD", "claimed"), PosTaggedWord("PRP", "it"), PosTaggedWord("VBD", "was"), PosTaggedWord("CD", "100,000"), PosTaggedWord(".", ".")] | ||
|
|
||
| ⋮ | ||
|
|
||
| [PosTaggedWord("IN", "At"), PosTaggedWord("JJ", "last"), PosTaggedWord("DT", "the"), PosTaggedWord("NNP", "Goatherd"), PosTaggedWord("VBD", "threw"), PosTaggedWord("DT", "a"), PosTaggedWord("NN", "stone"), PosTaggedWord(",", ","), PosTaggedWord("CC", "and"), PosTaggedWord("VBG", "breaking") … PosTaggedWord(",", ","), PosTaggedWord("VBD", "begged"), PosTaggedWord("DT", "the"), PosTaggedWord("NNP", "Goat"), PosTaggedWord("RB", "not"), PosTaggedWord("TO", "to"), PosTaggedWord("VB", "tell"), PosTaggedWord("PRP\$", "his"), PosTaggedWord("NN", "master"), PosTaggedWord(".", ".")] | ||
|
|
||
| [PosTaggedWord("DT", "The"), PosTaggedWord("NNP", "Goat"), PosTaggedWord("VBD", "replied"), PosTaggedWord(",", ","), PosTaggedWord("LQU", "\""), PosTaggedWord("WRB", "Why"), PosTaggedWord(",", ","), PosTaggedWord("PRP", "you"), PosTaggedWord("JJ", "silly"), PosTaggedWord("NN", "fellow") … PosTaggedWord("DT", "the"), PosTaggedWord("NN", "horn"), PosTaggedWord("MD", "will"), PosTaggedWord("VB", "speak"), PosTaggedWord("IN", "though"), PosTaggedWord("PRP", "I"), PosTaggedWord("VB", "be"), PosTaggedWord("JJ", "silent"), PosTaggedWord(".", "."), PosTaggedWord("LQU", "\"")] | ||
|
|
||
| [PosTaggedWord("VBP", "Do"), PosTaggedWord("RB", "not"), PosTaggedWord("VB", "attempt"), PosTaggedWord("TO", "to"), PosTaggedWord("VB", "hide"), PosTaggedWord("NNS", "things"), PosTaggedWord("WDT", "which"), PosTaggedWord("MD", "can"), PosTaggedWord("RB", "not"), PosTaggedWord("VB", "be"), PosTaggedWord("JJ", "hid"), PosTaggedWord(".", ".")] | ||
|
|
||
| julia> Data[1] | ||
| 30-element Array{PosTaggedWord,1}: | ||
| PosTaggedWord("NNS", "Families") | ||
| PosTaggedWord("IN", "of") | ||
| PosTaggedWord("NNS", "soldiers") | ||
| PosTaggedWord("VBN", "killed") | ||
| PosTaggedWord("IN", "in") | ||
| PosTaggedWord("DT", "the") | ||
| PosTaggedWord("NN", "conflict") | ||
| PosTaggedWord("VBD", "joined") | ||
| PosTaggedWord("DT", "the") | ||
| PosTaggedWord("NNS", "protesters") | ||
| PosTaggedWord("WP", "who") | ||
| PosTaggedWord("VBD", "carried") | ||
| PosTaggedWord("NNS", "banners") | ||
| PosTaggedWord("IN", "with") | ||
| ⋮ | ||
| PosTaggedWord("IN", "as") | ||
| PosTaggedWord("LQU", "\"") | ||
| PosTaggedWord("NNP", "Bush") | ||
| PosTaggedWord("NN", "Number") | ||
| PosTaggedWord("CD", "One") | ||
| PosTaggedWord("NN", "Terrorist") | ||
| PosTaggedWord("RQU", "\"") | ||
| PosTaggedWord("CC", "and") | ||
| PosTaggedWord("LQU", "\"") | ||
| PosTaggedWord("VB", "Stop") | ||
| PosTaggedWord("DT", "the") | ||
| PosTaggedWord("NNS", "Bombings") | ||
| PosTaggedWord(".", ".") | ||
| PosTaggedWord("LQU", "\"") | ||
|
|
||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| struct GMB{S} | ||
| filepath :: Vector{S} | ||
| end | ||
|
|
||
| function GMB(dirpath) | ||
|
oxinabox marked this conversation as resolved.
|
||
| @assert(isdir(dirpath), dirpath) | ||
| paths = glob("data/*/*/en.tags",dirpath) | ||
| GMB(paths) | ||
| end | ||
|
|
||
| GMB() = GMB(datadep"GMB 2.2.0") | ||
|
|
||
| MultiResolutionIterators.levelname_map(::Type{GMB}) = [ | ||
| :doc=>1, :contextfile=>1, :context=>1, :document=>1, | ||
| :para=>2, :paragraph=>2, | ||
| :sent=>3, :sentence=>3, | ||
| :word=>4, :token=>4, | ||
| :char=>5, :character=>5 | ||
| ] | ||
|
|
||
| function parse_gmb_tagged_word(line::AbstractString) | ||
| tokens_tags = split(line, '\t') | ||
| return PosTaggedWord(tokens_tags[2], tokens_tags[1]) | ||
| end | ||
|
|
||
| function parse_gmb(filename) | ||
| local sent = [] | ||
| sents = @NestedVector(PosTaggedWord, 2)() | ||
|
|
||
| function new_sentence() | ||
| sent = @NestedVector(PosTaggedWord, 1)() | ||
| push!(sents, sent) | ||
| end | ||
|
|
||
|
|
||
| # words | ||
| get_tagged(line) = push!(sent, parse_gmb_tagged_word(line)) | ||
|
|
||
| # parse | ||
| for line in eachline(filename) | ||
| if length(line) == 0 | ||
| new_sentence() | ||
| else | ||
| get_tagged(line) | ||
| end | ||
| end | ||
| return sents | ||
| end | ||
|
|
||
| function load(corpus::GMB) | ||
| ch = @NestedVector(PosTaggedWord, 2)() | ||
| for fn in corpus.filepath | ||
| document = parse_gmb(fn) | ||
| append!(ch, document) | ||
| end | ||
| return(ch) | ||
| end | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| using DataDeps | ||
|
|
||
|
|
||
| for (ver, checksum) in [("1.0.0", "e151d953a0316c5712a52d56a5702f24cc1dc8f22425955821113437ec43a3b8"), | ||
| ("1.1.0", "3830e7071e43ca9e659d51f2f7c5e5afea9e233993251e9f45d628caa6a372c6"), | ||
| ("2.0.0", "30a700e2509eb1a484357a1f1e5f7f06ef8e9516267413061b7dfccdf8ba4215"), | ||
| ("2.1.0", "e4bd7d43f7b2c1618f896784c2b7df3acde3bfe93ef4fd6e5a7a196f54b6a4f9"), | ||
| ("2.2.0", "dd12f2617f745ea3cafa348c60ee374c804be238d184bcf91db7bd9f90261625")] | ||
|
|
||
| register(DataDep("GMB $ver", | ||
| """ | ||
| Website: https://gmb.let.rug.nl/data.php | ||
| Orignal Author: Bos, Johan and Basile, Valerio and Evang, Kilian and Venhuizen, Noortje and Bjerva, Johannes | ||
|
|
||
| The Groningen Meaning Bank (GMB) consists of public domain English texts with corresponding syntactic and semantic representations. | ||
| The GMB is developed at the University of Groningen. A multi-lingual version of the GMB is the Parallel Meaning Bank. | ||
| A thorough description of the GMB can be found in the Handbook of Linguistic Annotation. | ||
|
|
||
| Please cite the following publication if you use the corpora: | ||
| Bos, Johan and Basile, Valerio and Evang, Kilian and Venhuizen, Noortje and Bjerva, Johannes. " Handbook of Linguistic Annotation, Publisher: Springer Netherlands, Editors: Nancy Ide, James Pustejovsky, pp.463-496." | ||
| """, | ||
| "https://gmb.let.rug.nl/releases/gmb-$(ver).zip", | ||
| checksum; | ||
| post_fetch_method = function (fn) | ||
| unpack(fn) | ||
| innerdir = "gbm-$(ver)" | ||
| innerfiles = readdir(innerdir) | ||
| # Move everything to current directory, under same name | ||
| mv.(joinpath.(innerdir, innerfiles), innerfiles) | ||
| rm(innerdir) | ||
| end | ||
| )) | ||
| end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| using CorpusLoaders | ||
| using Test | ||
| using Base.Iterators | ||
| using MultiResolutionIterators | ||
| using DataDeps | ||
|
|
||
| @testset "Using flatten_levels" for path in [datadep"GMB 1.0.0", datadep"GMB 1.1.0", datadep"GMB 2.0.0", datadep"GMB 2.1.0", datadep"GMB 2.2.0"] | ||
| train = load(GMB()) | ||
| docs = train[1:5] | ||
|
|
||
| words = full_consolidate(flatten_levels(docs, (!lvls)(CoNLL, :word))) | ||
| @test length(words) > length(docs) | ||
| @test typeof(words) == Vector{CorpusLoaders.NerOnlyTaggedWord} | ||
|
|
||
| plain_words = word.(words) | ||
| @test typeof(plain_words) <: Vector{String} | ||
|
|
||
| ner_tags = named_entity.(words) | ||
| @test typeof(ner_tags) <: Vector{String} | ||
|
|
||
| end |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.