awesome-multilingual-llm-benchmarks

A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.

Language-specific Benchmarks

Foundational Language Understanding Benchmarks

Single-Language

These benchmarks focus on core linguistic competencies for individual languages, testing syntax, semantics, natural language inference, sentiment analysis, named entity recognition, and other fundamental language processing capabilities.

Language	Date	Title	Tasks	Links
Amharic 🇪🇹	2025-06	Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval	Passage retrieval for Amharic (Ethiopian)	[paper]
Arabic 🇸🇦	2021-04	ALUE: Arabic Language Understanding Evaluation	SA, NLI, STS, Dialect ID, Toxicity/Offensive, QA	[paper]
Arabic 🇸🇦	2022-12	ORCA: A Challenging Benchmark for Arabic Language Understanding	SA, NLI, QA (MRC/MC), NER, Topic/News Clf, Paraphrase	[paper] [ACL Anthology]
Basque 🇪🇸🇫🇷	2022-06	BasqueGLUE: A Natural Language Understanding Benchmark for Basque	NER, Intent Classification, Slot Filling, Topic Classification, Sentiment Analysis, Stance Detection, QA/NLI, WiC, Coreference Resolution	[paper] [data]
Bengali 🇧🇩	2021-01	BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla	SA, NLI, NER, Span-QA	[paper] [code]
Bengali 🇧🇩	2025-10	Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles	Political stance detection (Government-leaning, Critique, Neutral) with 200 news articles	[paper]
Belarusian 🇧🇾	2025-06	BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian	SA, NLI, QA, NER, Morphology, Topic Classification	[paper]
Bulgarian 🇧🇬	2023-07	bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark	NER, POS Tagging, Sentiment, Check-Worthiness, Humor Detection, NLI, Multi-Choice QA, Factuality Classification	[paper] [code] [data]
Catalan 🇪🇸	2021-12	The Catalan Language CLUB	NER, POS Tagging, NLI, Document Classification, QA, STS	[paper] [data]
Chinese 🇨🇳	2020-04	CLUE: A Chinese Language Understanding Evaluation Benchmark	Short / Long Text Classification, Coreference Resolution, Semantic Similarity, Keyword Recongition, NLI, Machine Reading Comprehension	[paper]
Danish 🇩🇰	2024-05	Towards a Danish Semantic Reasoning Benchmark	Inference, Entailment, Synonymy, Similarity, Relatedness, Word Sense Disambiguation (WiC)	[paper]
Dutch 🇳🇱	2023-12	DUMB: A Benchmark for Smart Evaluation of Dutch Models	POS Tagging, NER, Word Sense Disambiguation, Pronoun Resolution, Causal Reasoning, NLI, Sentiment Analysis, Document Classification, Question Answering	[paper]
Dutch 🇳🇱	2025-01	BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language	Zero-shot information retrieval for Dutch	[paper]
Finnish 🇫🇮	2020-10	Towards Fully Bilingual Deep Language Modeling	POS Tagging, NER, Dependency Parsing, Document Classification	[paper]
French 🇫🇷	2019-12	FLUE: French Language Understanding Evaluation	Text Classification, Paraphrase, NLI, Parsing, POS, WSD	[web] [paper]
French 🇫🇷	2025-10	COLE: a Comprehensive Benchmark for French Language Understanding Evaluation	23 NLU tasks: sentiment analysis, paraphrase detection, QA, NLI, grammar, definition matching, WSD, pronoun resolution	[paper]
German 🇩🇪	2024-06	SuperGLEBer: German Language Understanding Evaluation Benchmark	NER, Document Classification, STS, QA	[paper]
Hindi 🇮🇳	2025-04	Benchmarking and Building Zero-shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5	Zero-shot information retrieval for Hindi	[paper]
Hungarian 🇭🇺	2024-05	HuLU: Hungarian Language Understanding Benchmark Kit	CoPA, RTE, SST, WNLI, CommitmentBank, ReCoRD QA	[paper]
Indonesian 🇮🇩	2020-12	IndoNLU	SA, Aspect-SA, Emotion, POS, NER, NLI, Span-QA/KE	[paper] [ACL Anthology]
Indonesian 🇮🇩	2020-11	IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP	Morpho-syntax (POS), Semantics, Discourse (7 tasks)	[paper] [ACL Anthology]
Italian 🇮🇹	2023-07	UINAUIL: A Unified Benchmark for Italian Natural Language Understanding	Textual Entailment, Event Detection & Classification (EVENTI), Factuality Classification (FactA), Sentiment Analysis (SENTIPOLC), Irony Detection (IronITA), Hate Speech Detection (HaSpeeDe)	[paper]
Italian 🇮🇹	2025-02	Evalita-LLM: Benchmarking Large Language Models on Italian	10 native Italian tasks: WSD, textual entailment, sentiment analysis, hate speech, QA, NER, relation extraction, summarization	[paper]
Japanese 🇯🇵	2022-07	JGLUE: Japanese General Language Understanding Evaluation	SA (MARC-ja), NLI (JNLI), STS (JSTS), QA (JSQuAD/JCommonsenseQA), Acceptability (JCoLA)	[paper] [ACL Anthology]
Norwegian 🇳🇴	2023-05	NorBench -- A Benchmark for Norwegian Language Models	Morpho-syntactic tasks (POS Tagging, Lemmatization, Dependency Parsing), NER, Sentiment Analysis (Document-level, Sentence-level, Targeted), Linguistic Acceptability, Question Answering, Machine Translation, Diagnostics of Harmful Predictions (Gender Bias, Harmfulness)	[paper] [code]
Persian 🇮🇷	2020-12	ParsiNLU: A Suite of Language Understanding Challenges for Persian	Reading Comprehension, Textual Entailment, Sentiment Analysis, Question Paraphrasing, Machine Translation, Query Paraphrasing	[paper] [ACL Anthology]
Polish 🇵🇱	2020-05	KLEJ: Comprehensive Benchmark for Polish Language Understanding	NER, Sentence Relatedness, Textual Entailment, Cyberbullying Detection, Sentiment Analysis (In-Domain & Out-of-Domain), Question Answering, Paraphrase Detection, Sentiment Analysis (Allegro Reviews)	[paper]
Polish 🇵🇱	2022-12	This is the way: Designing and Compiling LEPISZCZE, a Comprehensive NLP Benchmark for Polish	Sentiment Analysis, Abusive Clauses Detection, Political Advertising Detection, NLI, NER, POS Tagging, Paraphrase Classification, Punctuation Restoration, Dialogue Acts Classification	[paper]
Portuguese 🇵🇹🇧🇷	2024-04	PORTULAN ExtraGLUE Datasets and Models	SST-2, MRPC, STS-B, MNLI, QNLI, RTE, WNLI, BoolQ, MultiRC, CoPA	[paper]
Romanian 🇷🇴	2021-12	LiRo: Benchmark and Leaderboard for Romanian Language Tasks	Document Classification, NER, Machine Translation, Sentiment Analysis, POS Tagging, Dependency Parsing, Language Modeling, QA, STS, Gender Debiasing	[paper] [web]
Russian 🇷🇺	2020-10	RussianSuperGLUE	Commonsense/COPA-like, RTE/NLI, QA, WSC-like, Paraphrase	[paper] [ACL Anthology]
Slovak 🇸🇰	2025-06	skLEP: A Slovak General Language Understanding Benchmark	Sentiment Analysis, NER, Text Classification, Paraphrase Detection, Word Sense Disambiguation	[paper]
Slovenian 🇸🇮	2022-02	Slovene SuperGLUE Benchmark: Translation and Evaluation	BoolQ, CB, COPA, MultiRC, RTE, WSC	[paper]
Swedish 🇸🇪	2023-12	Superlim: A Swedish Language Understanding Evaluation Benchmark	Absabank-Imm, Argumentation Sentences, DaLAJ-GED, SweParaphrase, SweDN, SweFAQ, SweNLI, SweWiC, SweWinograd, SuperSim, Swedish Analogy, SweSAT, SweDiagnostics, SweWinogender	[paper]
Vietnamese 🇻🇳	2024-06	ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models	MNLI, QNLI, RTE, VNRTE, WNLI, SST2, VSFC, VSMEC, MRPC, QQP, CoLA, VToC	[paper] [code] [data]
Yoruba Dialects 🇳🇬	2024-06	Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects	Regional dialect evaluation across four Yoruba language regions	[paper]

Multilingual

These benchmarks cover multiple languages and focus on core linguistic competencies such as syntax, semantics, natural language inference, sentiment analysis, named entity recognition, and other fundamental language processing capabilities across different language families.

Languages	Date	Title	Tasks	Links
African Languages 🌍	2023-11	AfroBench: How Good are Large Language Models on African Languages?	Sentiment Analysis, Topic Classification, Named Entity Recognition, Question Answering, Language Identification (64 languages, 15 tasks)	[paper]
Cross-lingual SEA 🌏	2023-09	BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models	Sentiment Analysis, Named Entity Recognition, Natural Language Inference, Part-of-Speech Tagging, Dependency Parsing	[paper]
Indic Languages 🇮🇳	2024-11	MILU: A Multi-task Indic Language Understanding Benchmark	Text Classification, Natural Language Inference, Named Entity Recognition, Part-of-Speech Tagging, Sentiment Analysis	[paper]
Turkic 🌐	2024-03	Kardeş-NLU (Azeri, Kazakh, Kyrgyz, Uzbek, Uyghur)	NLI, STS, COPA	[paper] [data]
Nordic/Scandinavian 🌐	2023-04	ScandEval: A Benchmark for Scandinavian Natural Language Processing	Sentiment analysis, linguistic acceptability, NER, QA (Danish, Swedish, Norwegian, Icelandic, Faroese)	[paper]
Indigenous Americas 🌎	2022-12	AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas	Natural language inference (Asháninka, Aymara, Bribri, Guarani, Nahuatl, Otomí, Quechua, Rarámuri, Shipibo-Konibo, Wixarika)	[paper]
Language Varieties (281) 🌐	2024-03	DialectBench: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages	Comprehensive benchmark covering 281 language varieties and dialects worldwide	[paper]
Indic Languages 🇮🇳	2025-04	INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages	Question answering across multiple Indic languages	[paper]
Text Embeddings 🌐	2025-02	MMTEB: Massive Multilingual Text Embedding Benchmark	Massive multilingual text embedding evaluation across many languages	[paper]

Holistic Benchmarks

These benchmarks focus on factual knowledge, domain expertise, curriculum-based assessments, and subject-matter competency. They test models' ability to recall and apply knowledge from specific academic domains, cultural contexts, or educational curricula.

Single-Language

Language	Date	Title	Tasks	Links
American Sign Language 🇺🇸	2025-05	EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language	200 ASL videos with sentiment and emotion labels for multimodal emotion recognition	[paper]
Ancient Chinese 🇨🇳	2023-10	Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE	Polysemy resolution, homographic character resolution, NER, sentence segmentation, couplet prediction, poetry context/quality/appreciation/sentiment, reading comprehension, basic ancient Chinese, traditional culture, ancient medicine/literature/phonetics	[paper]
Ancient Chinese 🇨🇳	2025-03	Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation	21 tasks: ancient Chinese/idiom/TCM RC, loan character/allegorical saying/TCM QA, book author/dynasty/collection, poetry/famous quote/idiom source tracing, inverse poetry translation, poetry/ancient Chinese translation & appreciation, idiom/prescription explanation, couplet/ci/poetry generation	[paper]
Arabic 🇸🇦	2024-02	ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic	40 tasks covering STEM, humanities, social sciences from school exams across educational levels in North Africa, Levant, and Gulf regions	[paper]
Azerbaijani 🇦🇿	2024-06	Open foundation models for Azerbaijani language	Azerbaijani language model evaluation across NLU tasks, text generation, and domain-specific applications	[paper]
Basque 🇪🇸	2025-01	BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque	Social bias assessment across 8 domains: age, disability, gender, nationality, physical appearance, race/ethnicity, religion, socioeconomic status	[paper]
Bengali 🇧🇩	2025-05	BnMMLU: Measuring Massive Multitask Language Understanding in Bengali	41 domains across STEM, humanities, social sciences, general knowledge; includes BnMMLU-HARD subset for difficult questions	[paper]
Bengali 🇧🇩	2025-06	BenNumEval: A Benchmark to Assess LLMs' Numerical Reasoning Capabilities in Bengali	Numerical reasoning evaluation for Bengali	[paper]
Cantonese 🇭🇰🇨🇳	2024-08	How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models	Yue-TruthfulQA, Yue-GSM8K, Yue-ARC-C, Yue-MMLU, Yue-TRANS	[paper]
Cantonese 🇭🇰	2025-03	HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs	Cantonese-specific tasks testing Hong Kong cultural knowledge, colloquial expressions, code-switching, sentiment analysis	[paper]
Chinese 🇨🇳	2023-06	CMMLU: Measuring massive multitask language understanding in Chinese	67 subjects across STEM (17 tasks: physics, chemistry, math, CS, engineering), humanities (13 tasks: literature, history, philosophy), social sciences (22 tasks: law, psychology, education), other (15 tasks: Chinese food culture, driving rules, etc.)	[paper]
Chinese 🇨🇳	2024-01	CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark	College-level multimodal QA across art & design, business, science, health & medicine, humanities & social science, tech & engineering requiring Chinese cultural context	[paper]
Chinese 🇨🇳	2024-09	CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data	Multi-Choice QA, Bool QA, Fill-in-the Blank QA, Analysis QA	[paper]
Chinese 🇨🇳	2024-03	LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models	Comprehensive knowledge evaluation across STEM, humanities, social sciences with both objective (multiple-choice) and subjective (open-ended) questions	[paper]
Chinese 🇨🇳	2025-06	MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies	Multimodal tasks for classical Chinese: poetry appreciation, calligraphy analysis, historical artifact recognition, classical painting understanding	[paper]
Chinese 🇨🇳	2025-06	MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages	Evaluation across 6 minority languages (Mongolian, Tibetan, Uyghur, Yi, Zhuang, Kazakh) covering reading comprehension, translation, cultural knowledge, linguistic understanding	[paper]
Classical Chinese 🇨🇳	2024-05	C³Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models	50,000 text pairs for classification, retrieval, NER, punctuation, translation across 10 domains	[paper]
Czech 🇨🇿	2024-12	BenCzechMark: A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism	50 tasks across 8 categories: Czech math reasoning, language modeling, NER, reading comprehension, factual knowledge, Czech language understanding, sentiment analysis, NLI	[paper]
Dutch 🇳🇱	2024-12	Fietje: An open, efficient LLM for Dutch	Evaluation suite with reasoning, sentiment analysis, world knowledge, linguistic acceptability, WSD	[paper]
Filipino 🇵🇭	2025-02	Batayan: A Filipino NLP benchmark for evaluating Large Language Models	8 tasks: paraphrase identification, question answering, sentiment analysis, toxicity detection, commonsense reasoning (COPA), NLI, abstractive summarization, machine translation (Tagalog/English)	[paper]
Finnish 🇫🇮	2025-12	FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models	Reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, alignment evaluation	[paper]
French 🇫🇷	2024-02	DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain	Biomedical language understanding evaluation for French	[paper]
Hebrew 🇮🇱	2024-06	HeSum: A Novel Dataset for Abstractive Text Summarization in Hebrew	Hebrew summarization benchmark with evaluation metrics	[paper]
Hindi Analogy 🇮🇳	2025-07	Multilingual LLMs are not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation	405 multiple-choice analogical reasoning questions from authentic Indian government exams	[paper]
Hong Kong (Cantonese/Chinese) 🇭🇰	2025-05	Measuring Hong Kong Massive Multi-Task Language Understanding	MMLU-style benchmark across STEM, social sciences, humanities, other subjects + Cantonese-Mandarin translation tasks (both directions)	[paper]
Indic Languages (9) 🇮🇳	2025-01	IndicMMLU-Pro: Benchmarking Indic Large Language Models	MMLU-Pro style complex reasoning across 9 languages (Hindi, Bengali, Telugu, Marathi, Tamil, Gujarati, Urdu, Kannada, Punjabi) covering STEM, humanities, social sciences, business, health	[paper]
Indonesian 🇮🇩	2025-06	NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts	Indigenous script recognition and understanding (multimodal)	[paper]
Indonesian 🇮🇩	2025-12	Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation	Multimodal emotion recognition (text, audio, visual) with 1,944 video segments, 7 emotion categories	[paper] [code]
Italian 🇮🇹	2024-06	Disce aut Deficere: Evaluating LLMs Proficiency on the Invalsi Italian Benchmark	Locate and Identify Information, Reconstruct Meaning, Reflect on Content/Form, Word Formation, Lexicon and Semantics, Morphology, Spelling, Syntax, Textuality and Pragmatics, Cloze (Fill-in-the-Blank), Multiple Choice (MC), Multiple Complex Choice (MCC), Unique Response (RU), Short Response (RB)	[paper]
Japanese 🇯🇵	2024-10	JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation	1,320 expert-level multimodal questions across 9 subjects: Japanese art (150q), Japanese heritage (150q), Japanese history (150q), world history (150q), art & psychology (90q), business (150q), science (120q), health & medicine (150q), tech & engineering (210q)	[paper]
Kazakh 🇰🇿	2025-02	KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge	23,000 questions covering STEM, humanities, social sciences with bilingual Kazakh-Russian context	[paper]
Korean 🇰🇷	2024-02	KMMLU: Measuring Massive Multitask Language Understanding in Korean	45 subjects across STEM (science, technology, engineering, mathematics), humanities & social sciences (HUMSS: history, psychology, etc.), applied sciences (aviation, gas technology, nondestructive testing), other professional-level knowledge	[paper]
Latvian/Giriama 🇱🇻🇰🇪	2025-03	LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama	Culturally relevant MMLU subset for Latvian and Giriama languages	[paper]
Malaysian/Malay 🇲🇾	2024-01	Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding	Tatabahasa (Malay grammar) evaluation and language understanding	[paper]
Maltese 🇲🇹	2025-06	MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP	11 tasks: sentiment, topic classification, reading comprehension, machine translation, summarization, data-to-text	[paper]
Norwegian 🇳🇴	2025-04	NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark	24 datasets: commonsense reasoning, reading comprehension, sentiment analysis, Norwegian language knowledge, machine translation, truthfulness, text summarization, instruction following, Norwegian-specific & world knowledge (Bokmål & Nynorsk)	[paper]
Polish 🇵🇱	2025-01	LLMzSzŁ: a comprehensive LLM benchmark for Polish	19K questions from Polish national exams across 154 domains, 4 exam types: middle school (math, sciences, biology, physics, Polish language), high school, professional (arts, mechanics/mining/metallurgy, agriculture/forestry)	[paper]
Polish 🇵🇱	2026-01	Bielik 11B v3: Multilingual Large Language Model for European Languages	Evaluated on: sentiment analysis, NER, QA, belebele, flores, Polish EQ-bench, complex understanding, medical leaderboard tasks (multilingual model with Polish focus)	[paper]
Russian 🇷🇺	2024-01	MERA: A Comprehensive LLM Evaluation in Russian	MathLogicQA, MultiQ, PARus, RCB, ruModAr, ruMultiAr, ruOpenBookQA, ruTiE, ruWorldTree, RWSD, SimpleAr, BPS, CheGeKa, LCS, ruHumanEval, ruMMLU, USE, ruDetox, ruEthics, ruHateSpeech, ruHHH	[paper] [web]
Sanskrit 🇮🇳	2024-09	One Model is All You Need: ByT5-Sanskrit, A Unified Model for Sanskrit NLP Tasks	Unified Sanskrit benchmark for word segmentation, lemmatization, morphosyntactic tagging	[paper]
Swahili 🇰🇪🇹🇿	2024-07	SwahBERT: Language Model of Swahili	Swahili NLP benchmark with multiple evaluation tasks	[paper]
Tibetan 🇨🇳	2025-03	TLUE: A Tibetan Language Understanding Evaluation Benchmark	Ti-MMLU (67 subdomains across 5 domains: STEM, humanities, social sciences, China-specific, other) + Ti-SafetyBench (offensive content, physical/mental harm, illegal activities, ethics, privacy)	[paper]
Turkish 🇹🇷	2025-01	Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation	Turkish adaptation of MMLU covering STEM, humanities, social sciences across elementary to professional levels	[paper]
Turkish 🇹🇷	2026-01	TurkBench: A Benchmark for Evaluating Turkish Large Language Models	21 subtasks: general knowledge, MMLU, reading comprehension, NLI, summarization, STS, reasoning, sentiment, NER, POS, instruction following	[paper]
Vietnamese 🇻🇳	2025-06	VMLU Benchmarks: A Comprehensive Benchmark Toolkit for Vietnamese LLMs	Vi-MQA (multiple-choice QA), Vi-SQuAD, Vi-DROP, Vi-Dialog + subjects across elementary (math, science), middle school (biology, chemistry, math, physics), high school, university (clinical pharmacology, etc.)	[paper]
Yoruba 🇳🇬	2024-06	Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects	Yoruba language understanding benchmark with cultural context	[paper]

Multilingual

Languages	Date	Title	Tasks	Links
29 Languages 🌐	2025-03	MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation	Comprehensive multitask language understanding across EN, ZH, JA, KO, FR, DE, ES, PT, AR, TH, HI, BN, SW, and 16 other languages	[paper] [web]
17 Languages 🌐	2025-02	BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models	Simple understanding tasks, instruction following, reasoning, long context understanding, code generation	[paper]
SEA Languages 🌏	2025-02	SEA-HELM: Southeast Asian Holistic Evaluation of Language Models	NLP Classics, LLM-specifics, SEA Linguistics, SEA Culture, Safety (Filipino, Indonesian, Tamil, Thai, Vietnamese)	[paper]
SEA Languages 🌏	2025-02	SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia	STEM, Humanities, Social Sciences, Other subjects (Indonesian, Thai, Vietnamese)	[paper]
Global/42 Languages 🌐	2024-12	Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation	Improved MMLU across 42 languages with cultural bias mitigation	[paper]
Turkic Languages 🌐	2025-02	TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages	Comprehensive multilingual benchmark for Turkic language family	[paper]
EU20 🇪🇺	2024-10	EU20-MMLU: A Benchmark for Evaluating Large Language Models in European Languages	Multilingual evaluation across 20 European languages	[paper]
Iberian Languages 🇪🇸🇵🇹	2025-04	IberBench: LLM Evaluation on Iberian Languages	101 datasets across 22 task types (Spanish, Portuguese, Catalan, Basque, Galician)	[paper]
African Languages 🌍	2024-12	Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages	Scientific question answering and truthfulness evaluation for low-resource African languages	[paper]
African Languages 🌍	2024-06	IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models	Natural Language Inference (AfriXNLI), Mathematical Reasoning (AfriMGSM), Multi-choice QA (AfriMMLU) for 17 African languages including Yoruba, Igbo, Hausa, Nigerian Pidgin	[paper]
Indic Languages 🇮🇳	2026-01	INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects	Dialect translation, dialect detection, multi-task evaluation for Indian dialects	[paper]
Multimodal 11 Languages 🌐	2024-03	EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark	20,932 multiple-choice questions across 11 languages from 7 language families with images, tables, diagrams	[paper]
39 Languages Multimodal 🌐	2024-10	Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages	Holistic evaluation spanning 14 datasets in 47 languages with multimodal capabilities	[paper]
Sign Languages 🤟	2024-08	FLEURS-ASL: Including American Sign Language in Massively Multilingual Multitask Evaluation	American Sign Language extension of FLORES benchmark for multimodal evaluation	[paper]
Code-Switching 10 Languages 🌐	2024-06	Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding	Safety evaluation with code-switching queries combining up to 10 languages	[paper]
Cultural Knowledge 🌐	2025-06	GIMMICK: Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking	Multimodal multitask cultural knowledge evaluation across many languages	[paper]
Cultural Knowledge 🌐	2025-06	CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs' Cultural Knowledge Through Human-AI Red-Teaming	Cultural knowledge through human-AI red-teaming	[paper]
Hallucination Detection 🌐	2025-06	CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection	Cross-lingual and cross-modal hallucination detection	[paper]
Knowledge Editing 🌐	2025-01	MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models	Knowledge editing across multiple languages	[paper]
Truthfulness 🌐	2025-01	VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability	Truthfulness assessment with multilingual transferability	[paper]
Safety & Bias 🇸🇬	2025-07	RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages	Safety classification across Singlish, Chinese, Malay, Tamil with 5,000+ examples	[paper]
Figurative Language 🌐	2025-11	FLUID QA: A Multilingual Benchmark for Figurative Language Usage in Dialogue	Figurative language usage (sarcasm, metaphor, idiom) in English, Korean, Chinese dialogue	[paper]
Factuality 🌐	2025-08	CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation	Cross-lingual and cross-modal (speech/text) factuality in 8 languages	[paper] [code]
Hallucination 🌐	2025-05	MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations	KG-based multilingual hallucination detection with 25.9k curated paths	[paper]
Hallucination 🌐	2025-10	Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection	Scientific hallucination in English, French, Hindi, Italian, Spanish, Bengali, Gujarati, Malayalam, Telugu	[paper]
Bias & Stereotypes 🌐	2025-11	Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs	Debate-style bias evaluation in 7 languages (English, Chinese, Swahili, Nigerian Pidgin, +3) with 8,400 prompts	[paper]
Text Detoxification 🌐	2025-07	Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification	Text detoxification in English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic	[paper]
Multilingual Financial (5) 🌐	2025-06	MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application	First multilingual financial benchmark (English, Chinese, Japanese, Spanish, Greek) across modalities	[paper]
Multilingual Medical 🌐	2024-04	MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering	First multilingual medical benchmark with reference gold explanations written by medical doctors	[paper]
Multilingual Medical 🌐	2024-12	Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs	Multilingual ophthalmological benchmark for low and middle-income countries	[paper]

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-multilingual-llm-benchmarks

Language-specific Benchmarks

Foundational Language Understanding Benchmarks

Single-Language

Multilingual

Holistic Benchmarks

Single-Language

Multilingual

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

awesome-multilingual-llm-benchmarks

Language-specific Benchmarks

Foundational Language Understanding Benchmarks

Single-Language

Multilingual

Holistic Benchmarks

Single-Language

Multilingual

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Packages