Skip to content

NaiveNeuron/awesome-multilingual-llm-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

23 Commits
ย 
ย 

Repository files navigation

awesome-multilingual-llm-benchmarks

A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.

Language-specific Benchmarks

Foundational Language Understanding Benchmarks

Single-Language

These benchmarks focus on core linguistic competencies for individual languages, testing syntax, semantics, natural language inference, sentiment analysis, named entity recognition, and other fundamental language processing capabilities.

Language Date Title Tasks Links
Amharic ๐Ÿ‡ช๐Ÿ‡น 2025-06 Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval Passage retrieval for Amharic (Ethiopian) [paper]
Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ 2021-04 ALUE: Arabic Language Understanding Evaluation SA, NLI, STS, Dialect ID, Toxicity/Offensive, QA [paper]
Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ 2022-12 ORCA: A Challenging Benchmark for Arabic Language Understanding SA, NLI, QA (MRC/MC), NER, Topic/News Clf, Paraphrase [paper] [ACL Anthology]
Basque ๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ซ๐Ÿ‡ท 2022-06 BasqueGLUE: A Natural Language Understanding Benchmark for Basque NER, Intent Classification, Slot Filling, Topic Classification, Sentiment Analysis, Stance Detection, QA/NLI, WiC, Coreference Resolution [paper] [data]
Bengali ๐Ÿ‡ง๐Ÿ‡ฉ 2021-01 BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla SA, NLI, NER, Span-QA [paper] [code]
Bengali ๐Ÿ‡ง๐Ÿ‡ฉ 2025-10 Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles Political stance detection (Government-leaning, Critique, Neutral) with 200 news articles [paper]
Belarusian ๐Ÿ‡ง๐Ÿ‡พ 2025-06 BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian SA, NLI, QA, NER, Morphology, Topic Classification [paper]
Bulgarian ๐Ÿ‡ง๐Ÿ‡ฌ 2023-07 bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark NER, POS Tagging, Sentiment, Check-Worthiness, Humor Detection, NLI, Multi-Choice QA, Factuality Classification [paper] [code] [data]
Catalan ๐Ÿ‡ช๐Ÿ‡ธ 2021-12 The Catalan Language CLUB NER, POS Tagging, NLI, Document Classification, QA, STS [paper] [data]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2020-04 CLUE: A Chinese Language Understanding Evaluation Benchmark Short / Long Text Classification, Coreference Resolution, Semantic Similarity, Keyword Recongition, NLI, Machine Reading Comprehension [paper]
Danish ๐Ÿ‡ฉ๐Ÿ‡ฐ 2024-05 Towards a Danish Semantic Reasoning Benchmark Inference, Entailment, Synonymy, Similarity, Relatedness, Word Sense Disambiguation (WiC) [paper]
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ 2023-12 DUMB: A Benchmark for Smart Evaluation of Dutch Models POS Tagging, NER, Word Sense Disambiguation, Pronoun Resolution, Causal Reasoning, NLI, Sentiment Analysis, Document Classification, Question Answering [paper]
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ 2025-01 BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language Zero-shot information retrieval for Dutch [paper]
Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ 2020-10 Towards Fully Bilingual Deep Language Modeling POS Tagging, NER, Dependency Parsing, Document Classification [paper]
French ๐Ÿ‡ซ๐Ÿ‡ท 2019-12 FLUE: French Language Understanding Evaluation Text Classification, Paraphrase, NLI, Parsing, POS, WSD [web] [paper]
French ๐Ÿ‡ซ๐Ÿ‡ท 2025-10 COLE: a Comprehensive Benchmark for French Language Understanding Evaluation 23 NLU tasks: sentiment analysis, paraphrase detection, QA, NLI, grammar, definition matching, WSD, pronoun resolution [paper]
German ๐Ÿ‡ฉ๐Ÿ‡ช 2024-06 SuperGLEBer: German Language Understanding Evaluation Benchmark NER, Document Classification, STS, QA [paper]
Hindi ๐Ÿ‡ฎ๐Ÿ‡ณ 2025-04 Benchmarking and Building Zero-shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5 Zero-shot information retrieval for Hindi [paper]
Hungarian ๐Ÿ‡ญ๐Ÿ‡บ 2024-05 HuLU: Hungarian Language Understanding Benchmark Kit CoPA, RTE, SST, WNLI, CommitmentBank, ReCoRD QA [paper]
Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ 2020-12 IndoNLU SA, Aspect-SA, Emotion, POS, NER, NLI, Span-QA/KE [paper] [ACL Anthology]
Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ 2020-11 IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP Morpho-syntax (POS), Semantics, Discourse (7 tasks) [paper] [ACL Anthology]
Italian ๐Ÿ‡ฎ๐Ÿ‡น 2023-07 UINAUIL: A Unified Benchmark for Italian Natural Language Understanding Textual Entailment, Event Detection & Classification (EVENTI), Factuality Classification (FactA), Sentiment Analysis (SENTIPOLC), Irony Detection (IronITA), Hate Speech Detection (HaSpeeDe) [paper]
Italian ๐Ÿ‡ฎ๐Ÿ‡น 2025-02 Evalita-LLM: Benchmarking Large Language Models on Italian 10 native Italian tasks: WSD, textual entailment, sentiment analysis, hate speech, QA, NER, relation extraction, summarization [paper]
Japanese ๐Ÿ‡ฏ๐Ÿ‡ต 2022-07 JGLUE: Japanese General Language Understanding Evaluation SA (MARC-ja), NLI (JNLI), STS (JSTS), QA (JSQuAD/JCommonsenseQA), Acceptability (JCoLA) [paper] [ACL Anthology]
Norwegian ๐Ÿ‡ณ๐Ÿ‡ด 2023-05 NorBench -- A Benchmark for Norwegian Language Models Morpho-syntactic tasks (POS Tagging, Lemmatization, Dependency Parsing), NER, Sentiment Analysis (Document-level, Sentence-level, Targeted), Linguistic Acceptability, Question Answering, Machine Translation, Diagnostics of Harmful Predictions (Gender Bias, Harmfulness) [paper] [code]
Persian ๐Ÿ‡ฎ๐Ÿ‡ท 2020-12 ParsiNLU: A Suite of Language Understanding Challenges for Persian Reading Comprehension, Textual Entailment, Sentiment Analysis, Question Paraphrasing, Machine Translation, Query Paraphrasing [paper] [ACL Anthology]
Polish ๐Ÿ‡ต๐Ÿ‡ฑ 2020-05 KLEJ: Comprehensive Benchmark for Polish Language Understanding NER, Sentence Relatedness, Textual Entailment, Cyberbullying Detection, Sentiment Analysis (In-Domain & Out-of-Domain), Question Answering, Paraphrase Detection, Sentiment Analysis (Allegro Reviews) [paper]
Polish ๐Ÿ‡ต๐Ÿ‡ฑ 2022-12 This is the way: Designing and Compiling LEPISZCZE, a Comprehensive NLP Benchmark for Polish Sentiment Analysis, Abusive Clauses Detection, Political Advertising Detection, NLI, NER, POS Tagging, Paraphrase Classification, Punctuation Restoration, Dialogue Acts Classification [paper]
Portuguese ๐Ÿ‡ต๐Ÿ‡น๐Ÿ‡ง๐Ÿ‡ท 2024-04 PORTULAN ExtraGLUE Datasets and Models SST-2, MRPC, STS-B, MNLI, QNLI, RTE, WNLI, BoolQ, MultiRC, CoPA [paper]
Romanian ๐Ÿ‡ท๐Ÿ‡ด 2021-12 LiRo: Benchmark and Leaderboard for Romanian Language Tasks Document Classification, NER, Machine Translation, Sentiment Analysis, POS Tagging, Dependency Parsing, Language Modeling, QA, STS, Gender Debiasing [paper] [web]
Russian ๐Ÿ‡ท๐Ÿ‡บ 2020-10 RussianSuperGLUE Commonsense/COPA-like, RTE/NLI, QA, WSC-like, Paraphrase [paper] [ACL Anthology]
Slovak ๐Ÿ‡ธ๐Ÿ‡ฐ 2025-06 skLEP: A Slovak General Language Understanding Benchmark Sentiment Analysis, NER, Text Classification, Paraphrase Detection, Word Sense Disambiguation [paper]
Slovenian ๐Ÿ‡ธ๐Ÿ‡ฎ 2022-02 Slovene SuperGLUE Benchmark: Translation and Evaluation BoolQ, CB, COPA, MultiRC, RTE, WSC [paper]
Swedish ๐Ÿ‡ธ๐Ÿ‡ช 2023-12 Superlim: A Swedish Language Understanding Evaluation Benchmark Absabank-Imm, Argumentation Sentences, DaLAJ-GED, SweParaphrase, SweDN, SweFAQ, SweNLI, SweWiC, SweWinograd, SuperSim, Swedish Analogy, SweSAT, SweDiagnostics, SweWinogender [paper]
Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ 2024-06 ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models MNLI, QNLI, RTE, VNRTE, WNLI, SST2, VSFC, VSMEC, MRPC, QQP, CoLA, VToC [paper] [code] [data]
Yoruba Dialects ๐Ÿ‡ณ๐Ÿ‡ฌ 2024-06 Voices Unheard: NLP Resources and Models for Yorรนbรก Regional Dialects Regional dialect evaluation across four Yoruba language regions [paper]

Multilingual

These benchmarks cover multiple languages and focus on core linguistic competencies such as syntax, semantics, natural language inference, sentiment analysis, named entity recognition, and other fundamental language processing capabilities across different language families.

Languages Date Title Tasks Links
African Languages ๐ŸŒ 2023-11 AfroBench: How Good are Large Language Models on African Languages? Sentiment Analysis, Topic Classification, Named Entity Recognition, Question Answering, Language Identification (64 languages, 15 tasks) [paper]
Cross-lingual SEA ๐ŸŒ 2023-09 BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models Sentiment Analysis, Named Entity Recognition, Natural Language Inference, Part-of-Speech Tagging, Dependency Parsing [paper]
Indic Languages ๐Ÿ‡ฎ๐Ÿ‡ณ 2024-11 MILU: A Multi-task Indic Language Understanding Benchmark Text Classification, Natural Language Inference, Named Entity Recognition, Part-of-Speech Tagging, Sentiment Analysis [paper]
Turkic ๐ŸŒ 2024-03 KardeลŸ-NLU (Azeri, Kazakh, Kyrgyz, Uzbek, Uyghur) NLI, STS, COPA [paper] [data]
Nordic/Scandinavian ๐ŸŒ 2023-04 ScandEval: A Benchmark for Scandinavian Natural Language Processing Sentiment analysis, linguistic acceptability, NER, QA (Danish, Swedish, Norwegian, Icelandic, Faroese) [paper]
Indigenous Americas ๐ŸŒŽ 2022-12 AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas Natural language inference (Ashรกninka, Aymara, Bribri, Guarani, Nahuatl, Otomรญ, Quechua, Rarรกmuri, Shipibo-Konibo, Wixarika) [paper]
Language Varieties (281) ๐ŸŒ 2024-03 DialectBench: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages Comprehensive benchmark covering 281 language varieties and dialects worldwide [paper]
Indic Languages ๐Ÿ‡ฎ๐Ÿ‡ณ 2025-04 INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages Question answering across multiple Indic languages [paper]
Text Embeddings ๐ŸŒ 2025-02 MMTEB: Massive Multilingual Text Embedding Benchmark Massive multilingual text embedding evaluation across many languages [paper]

Holistic Benchmarks

These benchmarks focus on factual knowledge, domain expertise, curriculum-based assessments, and subject-matter competency. They test models' ability to recall and apply knowledge from specific academic domains, cultural contexts, or educational curricula.

Single-Language

Language Date Title Tasks Links
American Sign Language ๐Ÿ‡บ๐Ÿ‡ธ 2025-05 EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language 200 ASL videos with sentiment and emotion labels for multimodal emotion recognition [paper]
Ancient Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2023-10 Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE Polysemy resolution, homographic character resolution, NER, sentence segmentation, couplet prediction, poetry context/quality/appreciation/sentiment, reading comprehension, basic ancient Chinese, traditional culture, ancient medicine/literature/phonetics [paper]
Ancient Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2025-03 Fรนxรฌ: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation 21 tasks: ancient Chinese/idiom/TCM RC, loan character/allegorical saying/TCM QA, book author/dynasty/collection, poetry/famous quote/idiom source tracing, inverse poetry translation, poetry/ancient Chinese translation & appreciation, idiom/prescription explanation, couplet/ci/poetry generation [paper]
Arabic ๐Ÿ‡ธ๐Ÿ‡ฆ 2024-02 ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic 40 tasks covering STEM, humanities, social sciences from school exams across educational levels in North Africa, Levant, and Gulf regions [paper]
Azerbaijani ๐Ÿ‡ฆ๐Ÿ‡ฟ 2024-06 Open foundation models for Azerbaijani language Azerbaijani language model evaluation across NLU tasks, text generation, and domain-specific applications [paper]
Basque ๐Ÿ‡ช๐Ÿ‡ธ 2025-01 BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque Social bias assessment across 8 domains: age, disability, gender, nationality, physical appearance, race/ethnicity, religion, socioeconomic status [paper]
Bengali ๐Ÿ‡ง๐Ÿ‡ฉ 2025-05 BnMMLU: Measuring Massive Multitask Language Understanding in Bengali 41 domains across STEM, humanities, social sciences, general knowledge; includes BnMMLU-HARD subset for difficult questions [paper]
Bengali ๐Ÿ‡ง๐Ÿ‡ฉ 2025-06 BenNumEval: A Benchmark to Assess LLMs' Numerical Reasoning Capabilities in Bengali Numerical reasoning evaluation for Bengali [paper]
Cantonese ๐Ÿ‡ญ๐Ÿ‡ฐ๐Ÿ‡จ๐Ÿ‡ณ 2024-08 How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models Yue-TruthfulQA, Yue-GSM8K, Yue-ARC-C, Yue-MMLU, Yue-TRANS [paper]
Cantonese ๐Ÿ‡ญ๐Ÿ‡ฐ 2025-03 HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs Cantonese-specific tasks testing Hong Kong cultural knowledge, colloquial expressions, code-switching, sentiment analysis [paper]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2023-06 CMMLU: Measuring massive multitask language understanding in Chinese 67 subjects across STEM (17 tasks: physics, chemistry, math, CS, engineering), humanities (13 tasks: literature, history, philosophy), social sciences (22 tasks: law, psychology, education), other (15 tasks: Chinese food culture, driving rules, etc.) [paper]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2024-01 CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark College-level multimodal QA across art & design, business, science, health & medicine, humanities & social science, tech & engineering requiring Chinese cultural context [paper]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2024-09 CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data Multi-Choice QA, Bool QA, Fill-in-the Blank QA, Analysis QA [paper]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2024-03 LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models Comprehensive knowledge evaluation across STEM, humanities, social sciences with both objective (multiple-choice) and subjective (open-ended) questions [paper]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2025-06 MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies Multimodal tasks for classical Chinese: poetry appreciation, calligraphy analysis, historical artifact recognition, classical painting understanding [paper]
Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2025-06 MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages Evaluation across 6 minority languages (Mongolian, Tibetan, Uyghur, Yi, Zhuang, Kazakh) covering reading comprehension, translation, cultural knowledge, linguistic understanding [paper]
Classical Chinese ๐Ÿ‡จ๐Ÿ‡ณ 2024-05 CยณBench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models 50,000 text pairs for classification, retrieval, NER, punctuation, translation across 10 domains [paper]
Czech ๐Ÿ‡จ๐Ÿ‡ฟ 2024-12 BenCzechMark: A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism 50 tasks across 8 categories: Czech math reasoning, language modeling, NER, reading comprehension, factual knowledge, Czech language understanding, sentiment analysis, NLI [paper]
Dutch ๐Ÿ‡ณ๐Ÿ‡ฑ 2024-12 Fietje: An open, efficient LLM for Dutch Evaluation suite with reasoning, sentiment analysis, world knowledge, linguistic acceptability, WSD [paper]
Filipino ๐Ÿ‡ต๐Ÿ‡ญ 2025-02 Batayan: A Filipino NLP benchmark for evaluating Large Language Models 8 tasks: paraphrase identification, question answering, sentiment analysis, toxicity detection, commonsense reasoning (COPA), NLI, abstractive summarization, machine translation (Tagalog/English) [paper]
Finnish ๐Ÿ‡ซ๐Ÿ‡ฎ 2025-12 FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models Reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, alignment evaluation [paper]
French ๐Ÿ‡ซ๐Ÿ‡ท 2024-02 DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain Biomedical language understanding evaluation for French [paper]
Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ 2024-06 HeSum: A Novel Dataset for Abstractive Text Summarization in Hebrew Hebrew summarization benchmark with evaluation metrics [paper]
Hindi Analogy ๐Ÿ‡ฎ๐Ÿ‡ณ 2025-07 Multilingual LLMs are not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation 405 multiple-choice analogical reasoning questions from authentic Indian government exams [paper]
Hong Kong (Cantonese/Chinese) ๐Ÿ‡ญ๐Ÿ‡ฐ 2025-05 Measuring Hong Kong Massive Multi-Task Language Understanding MMLU-style benchmark across STEM, social sciences, humanities, other subjects + Cantonese-Mandarin translation tasks (both directions) [paper]
Indic Languages (9) ๐Ÿ‡ฎ๐Ÿ‡ณ 2025-01 IndicMMLU-Pro: Benchmarking Indic Large Language Models MMLU-Pro style complex reasoning across 9 languages (Hindi, Bengali, Telugu, Marathi, Tamil, Gujarati, Urdu, Kannada, Punjabi) covering STEM, humanities, social sciences, business, health [paper]
Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ 2025-06 NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts Indigenous script recognition and understanding (multimodal) [paper]
Indonesian ๐Ÿ‡ฎ๐Ÿ‡ฉ 2025-12 Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation Multimodal emotion recognition (text, audio, visual) with 1,944 video segments, 7 emotion categories [paper] [code]
Italian ๐Ÿ‡ฎ๐Ÿ‡น 2024-06 Disce aut Deficere: Evaluating LLMs Proficiency on the Invalsi Italian Benchmark Locate and Identify Information, Reconstruct Meaning, Reflect on Content/Form, Word Formation, Lexicon and Semantics, Morphology, Spelling, Syntax, Textuality and Pragmatics, Cloze (Fill-in-the-Blank), Multiple Choice (MC), Multiple Complex Choice (MCC), Unique Response (RU), Short Response (RB) [paper]
Japanese ๐Ÿ‡ฏ๐Ÿ‡ต 2024-10 JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation 1,320 expert-level multimodal questions across 9 subjects: Japanese art (150q), Japanese heritage (150q), Japanese history (150q), world history (150q), art & psychology (90q), business (150q), science (120q), health & medicine (150q), tech & engineering (210q) [paper]
Kazakh ๐Ÿ‡ฐ๐Ÿ‡ฟ 2025-02 KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge 23,000 questions covering STEM, humanities, social sciences with bilingual Kazakh-Russian context [paper]
Korean ๐Ÿ‡ฐ๐Ÿ‡ท 2024-02 KMMLU: Measuring Massive Multitask Language Understanding in Korean 45 subjects across STEM (science, technology, engineering, mathematics), humanities & social sciences (HUMSS: history, psychology, etc.), applied sciences (aviation, gas technology, nondestructive testing), other professional-level knowledge [paper]
Latvian/Giriama ๐Ÿ‡ฑ๐Ÿ‡ป๐Ÿ‡ฐ๐Ÿ‡ช 2025-03 LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama Culturally relevant MMLU subset for Latvian and Giriama languages [paper]
Malaysian/Malay ๐Ÿ‡ฒ๐Ÿ‡พ 2024-01 Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding Tatabahasa (Malay grammar) evaluation and language understanding [paper]
Maltese ๐Ÿ‡ฒ๐Ÿ‡น 2025-06 MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP 11 tasks: sentiment, topic classification, reading comprehension, machine translation, summarization, data-to-text [paper]
Norwegian ๐Ÿ‡ณ๐Ÿ‡ด 2025-04 NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark 24 datasets: commonsense reasoning, reading comprehension, sentiment analysis, Norwegian language knowledge, machine translation, truthfulness, text summarization, instruction following, Norwegian-specific & world knowledge (Bokmรฅl & Nynorsk) [paper]
Polish ๐Ÿ‡ต๐Ÿ‡ฑ 2025-01 LLMzSzล: a comprehensive LLM benchmark for Polish 19K questions from Polish national exams across 154 domains, 4 exam types: middle school (math, sciences, biology, physics, Polish language), high school, professional (arts, mechanics/mining/metallurgy, agriculture/forestry) [paper]
Polish ๐Ÿ‡ต๐Ÿ‡ฑ 2026-01 Bielik 11B v3: Multilingual Large Language Model for European Languages Evaluated on: sentiment analysis, NER, QA, belebele, flores, Polish EQ-bench, complex understanding, medical leaderboard tasks (multilingual model with Polish focus) [paper]
Russian ๐Ÿ‡ท๐Ÿ‡บ 2024-01 MERA: A Comprehensive LLM Evaluation in Russian MathLogicQA, MultiQ, PARus, RCB, ruModAr, ruMultiAr, ruOpenBookQA, ruTiE, ruWorldTree, RWSD, SimpleAr, BPS, CheGeKa, LCS, ruHumanEval, ruMMLU, USE, ruDetox, ruEthics, ruHateSpeech, ruHHH [paper] [web]
Sanskrit ๐Ÿ‡ฎ๐Ÿ‡ณ 2024-09 One Model is All You Need: ByT5-Sanskrit, A Unified Model for Sanskrit NLP Tasks Unified Sanskrit benchmark for word segmentation, lemmatization, morphosyntactic tagging [paper]
Swahili ๐Ÿ‡ฐ๐Ÿ‡ช๐Ÿ‡น๐Ÿ‡ฟ 2024-07 SwahBERT: Language Model of Swahili Swahili NLP benchmark with multiple evaluation tasks [paper]
Tibetan ๐Ÿ‡จ๐Ÿ‡ณ 2025-03 TLUE: A Tibetan Language Understanding Evaluation Benchmark Ti-MMLU (67 subdomains across 5 domains: STEM, humanities, social sciences, China-specific, other) + Ti-SafetyBench (offensive content, physical/mental harm, illegal activities, ethics, privacy) [paper]
Turkish ๐Ÿ‡น๐Ÿ‡ท 2025-01 Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation Turkish adaptation of MMLU covering STEM, humanities, social sciences across elementary to professional levels [paper]
Turkish ๐Ÿ‡น๐Ÿ‡ท 2026-01 TurkBench: A Benchmark for Evaluating Turkish Large Language Models 21 subtasks: general knowledge, MMLU, reading comprehension, NLI, summarization, STS, reasoning, sentiment, NER, POS, instruction following [paper]
Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ 2025-06 VMLU Benchmarks: A Comprehensive Benchmark Toolkit for Vietnamese LLMs Vi-MQA (multiple-choice QA), Vi-SQuAD, Vi-DROP, Vi-Dialog + subjects across elementary (math, science), middle school (biology, chemistry, math, physics), high school, university (clinical pharmacology, etc.) [paper]
Yoruba ๐Ÿ‡ณ๐Ÿ‡ฌ 2024-06 Voices Unheard: NLP Resources and Models for Yorรนbรก Regional Dialects Yoruba language understanding benchmark with cultural context [paper]

Multilingual

Languages Date Title Tasks Links
29 Languages ๐ŸŒ 2025-03 MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation Comprehensive multitask language understanding across EN, ZH, JA, KO, FR, DE, ES, PT, AR, TH, HI, BN, SW, and 16 other languages [paper] [web]
17 Languages ๐ŸŒ 2025-02 BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models Simple understanding tasks, instruction following, reasoning, long context understanding, code generation [paper]
SEA Languages ๐ŸŒ 2025-02 SEA-HELM: Southeast Asian Holistic Evaluation of Language Models NLP Classics, LLM-specifics, SEA Linguistics, SEA Culture, Safety (Filipino, Indonesian, Tamil, Thai, Vietnamese) [paper]
SEA Languages ๐ŸŒ 2025-02 SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia STEM, Humanities, Social Sciences, Other subjects (Indonesian, Thai, Vietnamese) [paper]
Global/42 Languages ๐ŸŒ 2024-12 Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation Improved MMLU across 42 languages with cultural bias mitigation [paper]
Turkic Languages ๐ŸŒ 2025-02 TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages Comprehensive multilingual benchmark for Turkic language family [paper]
EU20 ๐Ÿ‡ช๐Ÿ‡บ 2024-10 EU20-MMLU: A Benchmark for Evaluating Large Language Models in European Languages Multilingual evaluation across 20 European languages [paper]
Iberian Languages ๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ต๐Ÿ‡น 2025-04 IberBench: LLM Evaluation on Iberian Languages 101 datasets across 22 task types (Spanish, Portuguese, Catalan, Basque, Galician) [paper]
African Languages ๐ŸŒ 2024-12 Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages Scientific question answering and truthfulness evaluation for low-resource African languages [paper]
African Languages ๐ŸŒ 2024-06 IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models Natural Language Inference (AfriXNLI), Mathematical Reasoning (AfriMGSM), Multi-choice QA (AfriMMLU) for 17 African languages including Yoruba, Igbo, Hausa, Nigerian Pidgin [paper]
Indic Languages ๐Ÿ‡ฎ๐Ÿ‡ณ 2026-01 INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects Dialect translation, dialect detection, multi-task evaluation for Indian dialects [paper]
Multimodal 11 Languages ๐ŸŒ 2024-03 EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark 20,932 multiple-choice questions across 11 languages from 7 language families with images, tables, diagrams [paper]
39 Languages Multimodal ๐ŸŒ 2024-10 Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages Holistic evaluation spanning 14 datasets in 47 languages with multimodal capabilities [paper]
Sign Languages ๐ŸคŸ 2024-08 FLEURS-ASL: Including American Sign Language in Massively Multilingual Multitask Evaluation American Sign Language extension of FLORES benchmark for multimodal evaluation [paper]
Code-Switching 10 Languages ๐ŸŒ 2024-06 Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding Safety evaluation with code-switching queries combining up to 10 languages [paper]
Cultural Knowledge ๐ŸŒ 2025-06 GIMMICK: Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking Multimodal multitask cultural knowledge evaluation across many languages [paper]
Cultural Knowledge ๐ŸŒ 2025-06 CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs' Cultural Knowledge Through Human-AI Red-Teaming Cultural knowledge through human-AI red-teaming [paper]
Hallucination Detection ๐ŸŒ 2025-06 CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection Cross-lingual and cross-modal hallucination detection [paper]
Knowledge Editing ๐ŸŒ 2025-01 MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models Knowledge editing across multiple languages [paper]
Truthfulness ๐ŸŒ 2025-01 VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability Truthfulness assessment with multilingual transferability [paper]
Safety & Bias ๐Ÿ‡ธ๐Ÿ‡ฌ 2025-07 RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages Safety classification across Singlish, Chinese, Malay, Tamil with 5,000+ examples [paper]
Figurative Language ๐ŸŒ 2025-11 FLUID QA: A Multilingual Benchmark for Figurative Language Usage in Dialogue Figurative language usage (sarcasm, metaphor, idiom) in English, Korean, Chinese dialogue [paper]
Factuality ๐ŸŒ 2025-08 CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation Cross-lingual and cross-modal (speech/text) factuality in 8 languages [paper] [code]
Hallucination ๐ŸŒ 2025-05 MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations KG-based multilingual hallucination detection with 25.9k curated paths [paper]
Hallucination ๐ŸŒ 2025-10 Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection Scientific hallucination in English, French, Hindi, Italian, Spanish, Bengali, Gujarati, Malayalam, Telugu [paper]
Bias & Stereotypes ๐ŸŒ 2025-11 Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs Debate-style bias evaluation in 7 languages (English, Chinese, Swahili, Nigerian Pidgin, +3) with 8,400 prompts [paper]
Text Detoxification ๐ŸŒ 2025-07 Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification Text detoxification in English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic [paper]
Multilingual Financial (5) ๐ŸŒ 2025-06 MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application First multilingual financial benchmark (English, Chinese, Japanese, Spanish, Greek) across modalities [paper]
Multilingual Medical ๐ŸŒ 2024-04 MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering First multilingual medical benchmark with reference gold explanations written by medical doctors [paper]
Multilingual Medical ๐ŸŒ 2024-12 Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs Multilingual ophthalmological benchmark for low and middle-income countries [paper]

About

A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages