Skip to content

Commit 0aac077

Browse files
author
guoyongzhi
committed
update the bmk results in readme
1 parent e54ff1b commit 0aac077

File tree

3 files changed

+33
-21
lines changed

3 files changed

+33
-21
lines changed

README.md

Lines changed: 29 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -36,36 +36,47 @@ langid("This is a test.")
3636
langprob("这是一个测试。", topk=3)
3737
```
3838
```julia
39-
["zho" => 0.157798836477618,
40-
"mar" => 0.11718444394383595,
41-
"ben" => 0.10440699125820749,]
39+
["zho" => 0.75607236497363
40+
"jpn" => 0.036749305182980266
41+
"tat" => 0.015681619153487716]
4242
```
4343
# Benchmark
4444

45-
We tested three language identification packages: `LanguageIdentification.jl`, [`Languages.jl`](https://github.com/JuliaText/Languages.jl), and [`LanguageDetect.jl`](https://github.com/SeanLee97/LanguageDetect.jl) on a hold-out test set. The test set was sourced from [`tatoeba`](https://tatoeba.org) and [`wikipedia`](https://www.wikipedia.org/) and comprised of the 50 languages supported by this package.
45+
We tested four language identification packages: `LanguageIdentification.jl`, [`Languages.jl`](https://github.com/JuliaText/Languages.jl), [`LanguageDetect.jl`](https://github.com/SeanLee97/LanguageDetect.jl), and [`LanguageFinder`](https://github.com/nusretipek/LanguageFinder/tree/main/src) on a hold-out test set. The test set was sourced from [`tatoeba`](https://tatoeba.org) and [`wikipedia`](https://www.wikipedia.org/) and comprised of the 50 languages supported by this package. The complete test report can be found [here](https://github.com/guo-yong-zhi/langid_expirement/blob/main/benchmarks/compare/compare.md).
4646

4747
- tatoeba
4848

4949
| | ara | bel | ben | bul | cat | ces | dan | deu | ell | eng | epo | fas | fin | fra | hau | hbs | heb | hin | hun | ido | ina | isl | ita | jpn | kab | kor | kur | lat | lit | mar | mkd | msa | nds | nld | nor | pol | por | ron | rus | slk | spa | swa | swe | tat | tgl | tur | ukr | vie | yid | zho |
5050
|---------------------------|------------|------------|-------------|------------|------------|------------|------------|------------|-------------|------------|------------|------------|------------|------------|------------|------------|-------------|------------|------------|------------|------------|------------|------------|------------|------------|-------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|-------------|------------|-------------|
51-
| **LanguageIdentification.jl** | **98.96%** | **97.21%** | **100.00%** | **83.52%** | **93.75%** | **93.07%** | **84.68%** | **98.96%** | **100.00%** | **99.08%** | **97.86%** | **99.04%** | **99.04%** | **97.58%** | **98.81%** | 23.06% | 98.76% | 88.85% | **99.04%** | **90.62%** | **95.30%** | **99.55%** | **96.99%** | 99.97% | **99.43%** | **100.00%** | **99.20%** | **96.53%** | **99.29%** | **88.13%** | **92.96%** | **97.88%** | **96.37%** | **97.76%** | **85.40%** | **99.31%** | **97.68%** | **97.49%** | **91.13%** | **93.26%** | **93.60%** | **98.66%** | **95.50%** | **91.16%** | **98.93%** | **98.94%** | **87.51%** | **100.00%** | **99.31%** | **100.00%** |
51+
| LanguageIdentification.jl | **98.96%** | **97.21%** | **100.00%** | **83.52%** | **93.75%** | **93.07%** | **84.68%** | **98.96%** | **100.00%** | **99.08%** | **97.86%** | **99.04%** | **99.04%** | **97.58%** | **98.81%** | 23.06% | 98.76% | 88.85% | **99.04%** | **90.62%** | **95.30%** | **99.55%** | **96.99%** | 99.97% | **99.43%** | **100.00%** | **99.20%** | **96.53%** | **99.29%** | **88.13%** | **92.96%** | **97.88%** | **96.37%** | **97.76%** | **85.40%** | **99.31%** | **97.68%** | **97.49%** | 91.13% | **93.26%** | **93.60%** | **98.66%** | **95.50%** | **91.16%** | **98.93%** | **98.94%** | **87.51%** | **100.00%** | **99.31%** | **100.00%** |
5252
| Languages.jl | 85.55% | 80.47% | **100.00%** | 62.46% | - | 48.90% | 47.06% | 90.48% | 99.89% | 78.21% | 64.61% | 95.00% | 76.87% | 82.21% | 92.85% | **60.28%** | 95.75% | 62.99% | 73.99% | - | - | - | 66.27% | **99.97%** | - | 98.97% | - | - | 61.94% | 72.05% | 51.40% | 71.26% | - | 78.91% | 66.74% | 72.66% | 77.35% | 70.87% | 52.59% | - | 61.89% | - | 52.46% | - | 63.96% | 52.10% | 62.63% | 84.06% | 98.39% | 99.86% |
53-
| LanguageDetect.jl | 94.02% | - | **100.00%** | 64.47% | 60.10% | 70.71% | 53.28% | 81.63% | **100.00%** | 75.02% | - | 93.98% | 90.05% | 76.79% | - | 25.27% | **100.00%** | **93.44%** | 86.83% | - | - | - | 68.92% | 99.86% | - | 99.48% | - | - | 81.93% | 87.88% | 74.19% | 85.54% | - | 65.35% | 55.62% | 92.59% | 70.06% | 84.75% | 78.27% | 55.43% | 60.48% | 83.81% | 70.48% | - | 90.40% | 90.27% | 72.24% | 99.92% | - | 98.53% |
54-
53+
| LanguageDetect.jl | 93.68% | - | **100.00%** | 64.15% | 59.86% | 70.87% | 53.14% | 81.88% | **100.00%** | 74.76% | - | 93.68% | 90.37% | 77.41% | - | 27.53% | **100.00%** | 91.60% | 86.61% | - | - | - | 69.16% | 99.85% | - | 99.48% | - | - | 81.41% | 86.60% | 74.70% | 84.67% | - | 65.07% | 54.23% | 92.97% | 69.89% | 84.12% | 78.32% | 57.26% | 60.35% | 83.89% | 70.51% | - | 90.70% | 90.33% | 71.89% | 99.75% | - | 98.53% |
54+
| LanguageFinder.jl | 93.11% | - | - | - | - | 69.58% | 70.80% | 91.68% | **100.00%** | 82.53% | - | 98.60% | 89.31% | 87.57% | - | - | 99.99% | **99.87%** | 73.90% | - | - | - | 82.66% | - | - | 96.38% | - | - | - | - | - | - | - | 88.80% | 29.90% | 85.74% | 68.62% | - | **93.35%** | - | 76.32% | - | 40.42% | - | - | 71.22% | 76.81% | - | - | 45.72% |
5555

5656
- wikipedia
5757

58-
| | ara | bel | ben | bul | cat | ces | dan | deu | ell | eng | epo | fas | fin | fra | hau | hbs | heb | hin | hun | ido | ina | isl | ita | jpn | kab | kor | kur | lat | lit | mar | mkd | msa | nds | nld | nor | pol | por | ron | rus | slk | spa | swa | swe | tat | tgl | tur | ukr | vie | yid | zho |
59-
|---------------------------|------------|------------|-------------|------------|-------------|------------|------------|------------|-------------|-------------|-------------|-------------|------------|-------------|------------|-------------|-------------|------------|------------|------------|------------|------------|-------------|------------|------------|-------------|------------|------------|-------------|------------|------------|------------|------------|------------|------------|-------------|------------|------------|------------|------------|-------------|------------|------------|------------|------------|------------|-------------|------------|------------|------------|
60-
| **LanguageIdentification.jl** | **99.50%** | **99.50%** | **100.00%** | **99.00%** | **100.00%** | **96.50%** | **98.50%** | **96.50%** | **100.00%** | **100.00%** | **100.00%** | **100.00%** | **99.50%** | **100.00%** | **99.50%** | 87.00% | **100.00%** | 91.00% | **99.00%** | **92.50%** | **97.00%** | **98.50%** | **100.00%** | 98.00% | **99.00%** | **100.00%** | **99.00%** | **98.50%** | **100.00%** | 95.50% | 97.50% | **99.50%** | **99.50%** | **97.00%** | **98.00%** | **100.00%** | **99.50%** | **90.00%** | **99.50%** | **97.00%** | **100.00%** | **99.50%** | **98.50%** | **99.00%** | **98.50%** | **98.50%** | **100.00%** | **97.00%** | **98.50%** | **99.50%** |
61-
| Languages.jl | 99.00% | 98.50% | 99.00% | **99.00%** | - | 92.50% | 88.50% | 96.00% | 96.50% | 99.50% | 96.00% | 98.50% | 98.00% | **100.00%** | 99.00% | **100.00%** | 99.50% | 91.00% | 93.00% | - | - | - | 98.50% | **99.50%** | - | 89.50% | - | - | 94.50% | 95.00% | **98.00%** | **99.50%** | - | 94.50% | 95.50% | 90.50% | 94.00% | 81.50% | 97.50% | - | 98.50% | - | 88.00% | - | 97.00% | 92.50% | 93.00% | 74.50% | 98.00% | 96.50% |
62-
| LanguageDetect.jl | **99.50%** | - | **100.00%** | 82.00% | 77.50% | 88.00% | 68.50% | 80.00% | 99.50% | 92.00% | - | 98.50% | 95.50% | 87.50% | - | 5.00% | **100.00%** | **95.50%** | 91.50% | - | - | - | 90.50% | 94.50% | - | 95.00% | - | - | 97.00% | **97.50%** | 91.00% | 97.00% | - | 77.50% | 62.00% | 95.00% | 76.50% | 76.00% | 92.50% | 81.50% | 80.00% | 96.50% | 69.50% | - | **98.50%** | 94.50% | 97.50% | 96.00% | - | 73.50% |
58+
| | ara | bel | ben | bul | cat | ces | dan | deu | ell | eng | epo | fas | fin | fra | hau | hbs | heb | hin | hun | ido | ina | isl | ita | jpn | kab | kor | kur | lat | lit | mar | mkd | msa | nds | nld | nor | pol | por | ron | rus | slk | spa | swa | swe | tat | tgl | tur | ukr | vie | yid | zho |
59+
|---------------------------|------------|------------|-------------|------------|-------------|------------|------------|------------|-------------|-------------|-------------|-------------|------------|-------------|------------|-------------|-------------|-------------|------------|------------|------------|------------|-------------|------------|------------|-------------|------------|------------|-------------|------------|------------|------------|------------|------------|------------|-------------|------------|------------|-------------|------------|-------------|------------|------------|------------|------------|------------|-------------|------------|------------|------------|
60+
| LanguageIdentification.jl | **99.50%** | **99.50%** | **100.00%** | **99.00%** | **100.00%** | **96.50%** | **98.50%** | **96.50%** | **100.00%** | **100.00%** | **100.00%** | **100.00%** | **99.50%** | **100.00%** | **99.50%** | 87.00% | **100.00%** | 91.00% | **99.00%** | **92.50%** | **97.00%** | **98.50%** | **100.00%** | 98.00% | **99.00%** | **100.00%** | **99.00%** | **98.50%** | **100.00%** | 95.50% | 97.50% | **99.50%** | **99.50%** | 97.00% | **98.00%** | **100.00%** | **99.50%** | **90.00%** | 99.50% | **97.00%** | **100.00%** | **99.50%** | **98.50%** | **99.00%** | **98.50%** | **98.50%** | **100.00%** | **97.00%** | **98.50%** | **99.50%** |
61+
| Languages.jl | 99.00% | 98.50% | 99.00% | **99.00%** | - | 92.50% | 88.50% | 96.00% | 96.50% | 99.50% | 96.00% | 98.50% | 98.00% | **100.00%** | 99.00% | **100.00%** | 99.50% | 91.00% | 93.00% | - | - | - | 98.50% | **99.50%** | - | 89.50% | - | - | 94.50% | 95.00% | **98.00%** | **99.50%** | - | 94.50% | 95.50% | 90.50% | 94.00% | 81.50% | 97.50% | - | 98.50% | - | 88.00% | - | 97.00% | 92.50% | 93.00% | 74.50% | 98.00% | 96.50% |
62+
| LanguageDetect.jl | **99.50%** | - | **100.00%** | 80.00% | 79.00% | 80.50% | 61.00% | 81.00% | **100.00%** | 90.00% | - | 99.00% | 94.50% | 90.00% | - | 3.50% | **100.00%** | 94.00% | 93.50% | - | - | - | 87.50% | 94.50% | - | 95.00% | - | - | 96.50% | **97.00%** | 90.00% | 96.50% | - | 74.00% | 55.50% | 94.00% | 78.50% | 74.00% | 91.00% | 77.00% | 77.50% | 95.50% | 69.00% | - | 94.50% | 93.00% | 97.50% | 96.00% | - | 74.00% |
63+
| LanguageFinder.jl | **99.50%** | - | - | - | - | 96.00% | **98.50%** | 95.50% | 99.50% | 99.50% | - | 99.00% | **99.50%** | **100.00%** | - | - | **100.00%** | **100.00%** | 96.00% | - | - | - | 98.50% | - | - | 94.50% | - | - | - | - | - | - | - | **98.50%** | 35.50% | 98.00% | 88.00% | - | **100.00%** | - | **100.00%** | - | 97.00% | - | - | 96.00% | 99.50% | - | - | 85.50% |
64+
65+
We calculated the average accuracy of each package on the intersection of supported languages, and the results are as follows:
66+
- tatoeba
6367

68+
| | 50 languages | 39 languages | 35 languages | 24 languages |
69+
|-------------------------------|--------------|--------------|--------------|--------------|
70+
| **LanguageIdentification.jl** | **94.58%** | **94.24%** | **93.77%** | **95.87%** |
71+
| Languages.jl | - | 74.72% | 73.65% | 74.14% |
72+
| LanguageDetect.jl | - | - | 80.81% | 80.61% |
73+
| LanguageFinder.jl | - | - | - | 79.70% |
6474

65-
There are 35 languages that are supported by all three packages, and the average accuracy of the three packages on these languages is:
75+
- wikipedia
6676

67-
| | tatoeba | wikipedia |
68-
|---------------------------|---------|---------|
69-
| **LanguageIdentification.jl** | **93.77%** | **98.09%** |
70-
| Languages.jl | 73.65% | 94.80% |
71-
| LanguageDetect.jl | 80.92% | 86.70% |
77+
| | 50 languages | 39 languages | 35 languages | 24 languages |
78+
|-------------------------------|--------------|--------------|--------------|--------------|
79+
| **LanguageIdentification.jl** | **98.20%** | **98.22%** | **98.09%** | **98.79%** |
80+
| Languages.jl | - | 95.12% | 94.80% | 95.02% |
81+
| LanguageDetect.jl | - | - | 85.49% | 86.23% |
82+
| LanguageFinder.jl | - | - | - | 94.75% |

src/detector.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ const UNK = UInt8[]
88
const NGRAM = Int[]
99

1010
"""
11-
supported_languages() -> Vector{String}
11+
supported_languages() -> Vector{String}
1212
1313
Return a vector containing all the languages (ISO 639-3 codes) that are supported by this package.
1414
"""
@@ -86,7 +86,7 @@ function normalize_profile!(P)
8686
end
8787

8888
function loglikelihood(p_dict, logq_dict)
89-
sc = 0.0
89+
sc = zero(valtype(p_dict))
9090
for (code, p) in p_dict
9191
if !haskey(logq_dict, code)
9292
code = UNK

test/runtests.jl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ using Test
1818
langid("این یک آزمایش است.", ngram=[2, 4])
1919
langid("", ngram=3)
2020
langid(" ", ngram=3:4)
21-
@test sum(last.(langprob("This is a test.", topk=50))) 1.0
21+
langid(" ", ngram=5:7)
22+
@test sum(last.(langprob("This is a test.", topk=length(LI.supported_languages())))) 1.0
2223
@test langprob("这是一个测试。", topk=1) |> only |> first == "zho"
2324
@test langprob("これはテストです。", ["zho", "ara"], topk=30) |> length == 2
2425
LI.initialize(vocabulary=200)

0 commit comments

Comments
 (0)