GitHub

Language detection

This python library can be used to detect the natural language of a given document. It is based on the relative frequency of n-grams (n-gram model) to create sets of character n-grams and use them as a feature to train the classifier.

The classifier uses the n-grams frequency (one to three) from the testing and training docments, which is weighted with “TF-IDF” (Term Frequency–Inverse Document Frequency), and then apply the classical cosine similarity measure to obtain a distance between the docments.

All the testing documents were submitted to the classifier method. The proposed method achieved 100% accuracy in language detection.

Getting started

resources\train folder has 110 language docments used for training. The docments selected are based on the Universal Declaration of Human Rights (UDHR) because it is available in a large number of languages.
resources\test folder has 10 language docments used for testing. The docments selected are extracted from Wikipedia articles written in different languages.
TrainingTesting.ipynb is the algorithm that creates the n-gram models for testing and training data. Each document is submitted to pre-processing tasks, such as segmentation, remove punctuation, lowercase letters, and tokenization.
Classifier.ipynb is the proposed method to identify the language a document is written in.

Supported languages

This library supports the identification of multilingual documents as shown in the table below. It is based on a plain text version prepared by the “UDHR in Unicode” project. [https://www.unicode.org/udhr]

Language	Language Code	Language	Language Code
Abkhaz	ab	Inuktitut	iu
Afrikaans	af	Italian	it
Albanian	sq	Japanese	ja
Amharic	am	Javanese	jv
Arabic	ar	Kanuri	kr
Armenian	hy	Khmer	km
Aymara	ay	Korean	ko
Azerbaijani, North (Cyrillic)	az-Cyrl	Kurdish	ku
Azerbaijani, North (Latin)	az-Latn	Lao	lo
Basque	eu	Latin	la
Belarusan	be	Latvian	lv
Bengali	bn	Lingala	ln
Bislama	bi	Lithuanian	lt
Bosnian (Cyrillic)	bs-Cyrl	Malay (Arabic)	ms-Arab
Bosnian (Latin)	bs-Latn	Malay (Latin)	ms-Latn
Breton	br	Maltese	mt
Bulgarian	bg	Marshallese	mh
Catalan	ca	Mongolian, Halh (Cyrillic)	mn-Cyrl
Chamorro	ch	Navajo	nv
Chinese, Mandarin (Simplified)	zh-Hans	Ndonga	ng
Chinese, Mandarin (Traditional)	zh-Hant	Norwegian, Bokmål	nb
Corsican	co	Norwegian, Nynorsk	nn
Cree	cr	Persian	fa
Croatian	hr	Polish	pl
Czech	cs	Portuguese (Brazil)	pt-BR
Danish	da	Portuguese (Portugal)	pt-PT
Dutch	nl	Romanian	ro
Dzongkha	dz	Russian	ru
English	en	Sanskrit	sa
Esperanto	eo	Slovak	sk
Estonian	et	Slovene	sl
Faroese	fo	Somali	so
Fijian	fj	Spanish	es
Finnish	fi	Swati	ss
French	fr	Swedish	sv
Frisian	fy	Tagalog	tl
Gaelic, Irish	ga	Tahitian	ty
Gaelic, Scottish	gd	Tamil	ta
Galician	gl	Tatar	tt
Ganda	lg	Thai	th
Georgian	ka	Tibetan	bo
German	de	Tonga	to
Greek (monotonic)	el-monoton	Turkish	tr
Greek (polytonic)	el-polyton	Ukrainian	uk
Guarani	gn	Urdu	ur
Gujarati	gu	Uyghur (Arabic)	ug-Arab
Hausa	ha	Uyghur (Latin)	ug-Latn
Hebrew	he	Uzbek	uz
Hindi	hi	Venda	ve
Hungarian	hu	Vietnamese	vi
Icelandic	is	Walloon	wa
Ido	io	Welsh	cy
Igbo	ig	Wolof	wo
Indonesian	id	Xhosa	xh
Interlingua	ia	Yoruba	yo

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
resources		resources
testing		testing
training		training
Classifier.ipynb		Classifier.ipynb
Classifier.pdf		Classifier.pdf
README.ipynb		README.ipynb
README.md		README.md
README.pdf		README.pdf
TrainingTesting.ipynb		TrainingTesting.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language detection

Getting started

Supported languages

About

Uh oh!

Releases

Packages

Languages

fabiobif/languageDetection

Folders and files

Latest commit

History

Repository files navigation

Language detection

Getting started

Supported languages

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages