You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+28-6Lines changed: 28 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,10 +7,8 @@
7
7
import Pkg; Pkg.add("LanguageIdentification")
8
8
```
9
9
# Usage
10
-
Before using the language identification functionality, you need to initialize the package. This process involves setting some parameters that balance accuracy, speed, and memory usage. If you don't manually initialize the package, it will use default parameters. For more information, please refer to the documentation.
11
10
```julia
12
11
using LanguageIdentification
13
-
LanguageIdentification.initialize()
14
12
```
15
13
Currently, `LanguageIdentification.jl` supports the identification of 50 languages. You can check them with the following command. The language is represented by the [ISO 639-3](https://en.wikipedia.org/wiki/ISO_639_macrolanguage) code.
This package provides a simple interface to identify the language of a given text. The package exports two functions:
24
+
This package provides simple interfaces:
27
25
-`langid`: returns the language code of the tested text.
28
26
-`langprob`: returns the probabilities of the tested text for each language.
29
27
```julia
@@ -42,7 +40,7 @@ langprob("这是一个测试。", topk=3)
42
40
```
43
41
# Benchmark
44
42
45
-
We tested four language identification packages: `LanguageIdentification.jl`, [`Languages.jl`](https://github.com/JuliaText/Languages.jl), [`LanguageDetect.jl`](https://github.com/SeanLee97/LanguageDetect.jl), and [`LanguageFinder`](https://github.com/nusretipek/LanguageFinder/tree/main/src) on a hold-out test set. The test set was sourced from [`tatoeba`](https://tatoeba.org) and [`wikipedia`](https://www.wikipedia.org/) and comprised of the 50 languages supported by this package. The complete test report can be found [here](https://github.com/guo-yong-zhi/langid_expirement/blob/main/benchmarks/compare/compare.md).
43
+
We tested four language identification packages: `LanguageIdentification.jl` (this package), [`Languages.jl`](https://github.com/JuliaText/Languages.jl), [`LanguageDetect.jl`](https://github.com/SeanLee97/LanguageDetect.jl), and [`LanguageFinder`](https://github.com/nusretipek/LanguageFinder/tree/main/src) on a hold-out test set. The test set was sourced from [`tatoeba`](https://tatoeba.org) and [`wikipedia`](https://www.wikipedia.org/) and comprised of the 50 languages supported by this package. The complete test results can be found [here](https://github.com/guo-yong-zhi/langid_expirement/blob/main/benchmarks/compare/compare.md).
46
44
47
45
- tatoeba
48
46
@@ -62,7 +60,7 @@ We tested four language identification packages: `LanguageIdentification.jl`, [`
You can manually initialize the package using the `LanguageIdentification.initialize` function. By adjusting the parameters, you can achieve different balances between accuracy, speed, and memory usage. The default setting is `ngram=1:4`, `cutoff=0.85`, and `vocabulary=1000:5000`. However, this setting may not be optimal for your specific use case.
84
+
For example, the table below shows that using a single-ngram setting of `ngram=4`, `cutoff=1.0`, and `vocabulary=5000` can achieve better accuracy on our tatoeba test set while also being much faster than the multi-ngrams setting. We choose the multi-ngrams as the default due to its stability. You can refer to our detailed benchmark results [here](https://github.com/guo-yong-zhi/langid_expirement/tree/main/benchmarks/matrix) as a reference for parameter tuning.
0 commit comments