A curated list of multilingual and/or non-English benchmarks for Large Language Models (LLMs) or NLP models and tools in general.
These benchmarks focus on core linguistic competencies for individual languages, testing syntax, semantics, natural language inference, sentiment analysis, named entity recognition, and other fundamental language processing capabilities.
| Language | Date | Title | Tasks | Links |
|---|---|---|---|---|
| Amharic ๐ช๐น | 2025-06 | Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval | Passage retrieval for Amharic (Ethiopian) | [paper] |
| Arabic ๐ธ๐ฆ | 2021-04 | ALUE: Arabic Language Understanding Evaluation | SA, NLI, STS, Dialect ID, Toxicity/Offensive, QA | [paper] |
| Arabic ๐ธ๐ฆ | 2022-12 | ORCA: A Challenging Benchmark for Arabic Language Understanding | SA, NLI, QA (MRC/MC), NER, Topic/News Clf, Paraphrase | [paper] [ACL Anthology] |
| Basque ๐ช๐ธ๐ซ๐ท | 2022-06 | BasqueGLUE: A Natural Language Understanding Benchmark for Basque | NER, Intent Classification, Slot Filling, Topic Classification, Sentiment Analysis, Stance Detection, QA/NLI, WiC, Coreference Resolution | [paper] [data] |
| Bengali ๐ง๐ฉ | 2021-01 | BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla | SA, NLI, NER, Span-QA | [paper] [code] |
| Bengali ๐ง๐ฉ | 2025-10 | Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles | Political stance detection (Government-leaning, Critique, Neutral) with 200 news articles | [paper] |
| Belarusian ๐ง๐พ | 2025-06 | BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian | SA, NLI, QA, NER, Morphology, Topic Classification | [paper] |
| Bulgarian ๐ง๐ฌ | 2023-07 | bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark | NER, POS Tagging, Sentiment, Check-Worthiness, Humor Detection, NLI, Multi-Choice QA, Factuality Classification | [paper] [code] [data] |
| Catalan ๐ช๐ธ | 2021-12 | The Catalan Language CLUB | NER, POS Tagging, NLI, Document Classification, QA, STS | [paper] [data] |
| Chinese ๐จ๐ณ | 2020-04 | CLUE: A Chinese Language Understanding Evaluation Benchmark | Short / Long Text Classification, Coreference Resolution, Semantic Similarity, Keyword Recongition, NLI, Machine Reading Comprehension | [paper] |
| Danish ๐ฉ๐ฐ | 2024-05 | Towards a Danish Semantic Reasoning Benchmark | Inference, Entailment, Synonymy, Similarity, Relatedness, Word Sense Disambiguation (WiC) | [paper] |
| Dutch ๐ณ๐ฑ | 2023-12 | DUMB: A Benchmark for Smart Evaluation of Dutch Models | POS Tagging, NER, Word Sense Disambiguation, Pronoun Resolution, Causal Reasoning, NLI, Sentiment Analysis, Document Classification, Question Answering | [paper] |
| Dutch ๐ณ๐ฑ | 2025-01 | BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language | Zero-shot information retrieval for Dutch | [paper] |
| Finnish ๐ซ๐ฎ | 2020-10 | Towards Fully Bilingual Deep Language Modeling | POS Tagging, NER, Dependency Parsing, Document Classification | [paper] |
| French ๐ซ๐ท | 2019-12 | FLUE: French Language Understanding Evaluation | Text Classification, Paraphrase, NLI, Parsing, POS, WSD | [web] [paper] |
| French ๐ซ๐ท | 2025-10 | COLE: a Comprehensive Benchmark for French Language Understanding Evaluation | 23 NLU tasks: sentiment analysis, paraphrase detection, QA, NLI, grammar, definition matching, WSD, pronoun resolution | [paper] |
| German ๐ฉ๐ช | 2024-06 | SuperGLEBer: German Language Understanding Evaluation Benchmark | NER, Document Classification, STS, QA | [paper] |
| Hindi ๐ฎ๐ณ | 2025-04 | Benchmarking and Building Zero-shot Hindi Retrieval Model with Hindi-BEIR and NLLB-E5 | Zero-shot information retrieval for Hindi | [paper] |
| Hungarian ๐ญ๐บ | 2024-05 | HuLU: Hungarian Language Understanding Benchmark Kit | CoPA, RTE, SST, WNLI, CommitmentBank, ReCoRD QA | [paper] |
| Indonesian ๐ฎ๐ฉ | 2020-12 | IndoNLU | SA, Aspect-SA, Emotion, POS, NER, NLI, Span-QA/KE | [paper] [ACL Anthology] |
| Indonesian ๐ฎ๐ฉ | 2020-11 | IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP | Morpho-syntax (POS), Semantics, Discourse (7 tasks) | [paper] [ACL Anthology] |
| Italian ๐ฎ๐น | 2023-07 | UINAUIL: A Unified Benchmark for Italian Natural Language Understanding | Textual Entailment, Event Detection & Classification (EVENTI), Factuality Classification (FactA), Sentiment Analysis (SENTIPOLC), Irony Detection (IronITA), Hate Speech Detection (HaSpeeDe) | [paper] |
| Italian ๐ฎ๐น | 2025-02 | Evalita-LLM: Benchmarking Large Language Models on Italian | 10 native Italian tasks: WSD, textual entailment, sentiment analysis, hate speech, QA, NER, relation extraction, summarization | [paper] |
| Japanese ๐ฏ๐ต | 2022-07 | JGLUE: Japanese General Language Understanding Evaluation | SA (MARC-ja), NLI (JNLI), STS (JSTS), QA (JSQuAD/JCommonsenseQA), Acceptability (JCoLA) | [paper] [ACL Anthology] |
| Norwegian ๐ณ๐ด | 2023-05 | NorBench -- A Benchmark for Norwegian Language Models | Morpho-syntactic tasks (POS Tagging, Lemmatization, Dependency Parsing), NER, Sentiment Analysis (Document-level, Sentence-level, Targeted), Linguistic Acceptability, Question Answering, Machine Translation, Diagnostics of Harmful Predictions (Gender Bias, Harmfulness) | [paper] [code] |
| Persian ๐ฎ๐ท | 2020-12 | ParsiNLU: A Suite of Language Understanding Challenges for Persian | Reading Comprehension, Textual Entailment, Sentiment Analysis, Question Paraphrasing, Machine Translation, Query Paraphrasing | [paper] [ACL Anthology] |
| Polish ๐ต๐ฑ | 2020-05 | KLEJ: Comprehensive Benchmark for Polish Language Understanding | NER, Sentence Relatedness, Textual Entailment, Cyberbullying Detection, Sentiment Analysis (In-Domain & Out-of-Domain), Question Answering, Paraphrase Detection, Sentiment Analysis (Allegro Reviews) | [paper] |
| Polish ๐ต๐ฑ | 2022-12 | This is the way: Designing and Compiling LEPISZCZE, a Comprehensive NLP Benchmark for Polish | Sentiment Analysis, Abusive Clauses Detection, Political Advertising Detection, NLI, NER, POS Tagging, Paraphrase Classification, Punctuation Restoration, Dialogue Acts Classification | [paper] |
| Portuguese ๐ต๐น๐ง๐ท | 2024-04 | PORTULAN ExtraGLUE Datasets and Models | SST-2, MRPC, STS-B, MNLI, QNLI, RTE, WNLI, BoolQ, MultiRC, CoPA | [paper] |
| Romanian ๐ท๐ด | 2021-12 | LiRo: Benchmark and Leaderboard for Romanian Language Tasks | Document Classification, NER, Machine Translation, Sentiment Analysis, POS Tagging, Dependency Parsing, Language Modeling, QA, STS, Gender Debiasing | [paper] [web] |
| Russian ๐ท๐บ | 2020-10 | RussianSuperGLUE | Commonsense/COPA-like, RTE/NLI, QA, WSC-like, Paraphrase | [paper] [ACL Anthology] |
| Slovak ๐ธ๐ฐ | 2025-06 | skLEP: A Slovak General Language Understanding Benchmark | Sentiment Analysis, NER, Text Classification, Paraphrase Detection, Word Sense Disambiguation | [paper] |
| Slovenian ๐ธ๐ฎ | 2022-02 | Slovene SuperGLUE Benchmark: Translation and Evaluation | BoolQ, CB, COPA, MultiRC, RTE, WSC | [paper] |
| Swedish ๐ธ๐ช | 2023-12 | Superlim: A Swedish Language Understanding Evaluation Benchmark | Absabank-Imm, Argumentation Sentences, DaLAJ-GED, SweParaphrase, SweDN, SweFAQ, SweNLI, SweWiC, SweWinograd, SuperSim, Swedish Analogy, SweSAT, SweDiagnostics, SweWinogender | [paper] |
| Vietnamese ๐ป๐ณ | 2024-06 | ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models | MNLI, QNLI, RTE, VNRTE, WNLI, SST2, VSFC, VSMEC, MRPC, QQP, CoLA, VToC | [paper] [code] [data] |
| Yoruba Dialects ๐ณ๐ฌ | 2024-06 | Voices Unheard: NLP Resources and Models for Yorรนbรก Regional Dialects | Regional dialect evaluation across four Yoruba language regions | [paper] |
These benchmarks cover multiple languages and focus on core linguistic competencies such as syntax, semantics, natural language inference, sentiment analysis, named entity recognition, and other fundamental language processing capabilities across different language families.
| Languages | Date | Title | Tasks | Links |
|---|---|---|---|---|
| African Languages ๐ | 2023-11 | AfroBench: How Good are Large Language Models on African Languages? | Sentiment Analysis, Topic Classification, Named Entity Recognition, Question Answering, Language Identification (64 languages, 15 tasks) | [paper] |
| Cross-lingual SEA ๐ | 2023-09 | BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models | Sentiment Analysis, Named Entity Recognition, Natural Language Inference, Part-of-Speech Tagging, Dependency Parsing | [paper] |
| Indic Languages ๐ฎ๐ณ | 2024-11 | MILU: A Multi-task Indic Language Understanding Benchmark | Text Classification, Natural Language Inference, Named Entity Recognition, Part-of-Speech Tagging, Sentiment Analysis | [paper] |
| Turkic ๐ | 2024-03 | Kardeล-NLU (Azeri, Kazakh, Kyrgyz, Uzbek, Uyghur) | NLI, STS, COPA | [paper] [data] |
| Nordic/Scandinavian ๐ | 2023-04 | ScandEval: A Benchmark for Scandinavian Natural Language Processing | Sentiment analysis, linguistic acceptability, NER, QA (Danish, Swedish, Norwegian, Icelandic, Faroese) | [paper] |
| Indigenous Americas ๐ | 2022-12 | AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas | Natural language inference (Ashรกninka, Aymara, Bribri, Guarani, Nahuatl, Otomรญ, Quechua, Rarรกmuri, Shipibo-Konibo, Wixarika) | [paper] |
| Language Varieties (281) ๐ | 2024-03 | DialectBench: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages | Comprehensive benchmark covering 281 language varieties and dialects worldwide | [paper] |
| Indic Languages ๐ฎ๐ณ | 2025-04 | INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages | Question answering across multiple Indic languages | [paper] |
| Text Embeddings ๐ | 2025-02 | MMTEB: Massive Multilingual Text Embedding Benchmark | Massive multilingual text embedding evaluation across many languages | [paper] |
These benchmarks focus on factual knowledge, domain expertise, curriculum-based assessments, and subject-matter competency. They test models' ability to recall and apply knowledge from specific academic domains, cultural contexts, or educational curricula.
| Language | Date | Title | Tasks | Links |
|---|---|---|---|---|
| American Sign Language ๐บ๐ธ | 2025-05 | EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language | 200 ASL videos with sentiment and emotion labels for multimodal emotion recognition | [paper] |
| Ancient Chinese ๐จ๐ณ | 2023-10 | Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE | Polysemy resolution, homographic character resolution, NER, sentence segmentation, couplet prediction, poetry context/quality/appreciation/sentiment, reading comprehension, basic ancient Chinese, traditional culture, ancient medicine/literature/phonetics | [paper] |
| Ancient Chinese ๐จ๐ณ | 2025-03 | Fรนxรฌ: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation | 21 tasks: ancient Chinese/idiom/TCM RC, loan character/allegorical saying/TCM QA, book author/dynasty/collection, poetry/famous quote/idiom source tracing, inverse poetry translation, poetry/ancient Chinese translation & appreciation, idiom/prescription explanation, couplet/ci/poetry generation | [paper] |
| Arabic ๐ธ๐ฆ | 2024-02 | ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | 40 tasks covering STEM, humanities, social sciences from school exams across educational levels in North Africa, Levant, and Gulf regions | [paper] |
| Azerbaijani ๐ฆ๐ฟ | 2024-06 | Open foundation models for Azerbaijani language | Azerbaijani language model evaluation across NLU tasks, text generation, and domain-specific applications | [paper] |
| Basque ๐ช๐ธ | 2025-01 | BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque | Social bias assessment across 8 domains: age, disability, gender, nationality, physical appearance, race/ethnicity, religion, socioeconomic status | [paper] |
| Bengali ๐ง๐ฉ | 2025-05 | BnMMLU: Measuring Massive Multitask Language Understanding in Bengali | 41 domains across STEM, humanities, social sciences, general knowledge; includes BnMMLU-HARD subset for difficult questions | [paper] |
| Bengali ๐ง๐ฉ | 2025-06 | BenNumEval: A Benchmark to Assess LLMs' Numerical Reasoning Capabilities in Bengali | Numerical reasoning evaluation for Bengali | [paper] |
| Cantonese ๐ญ๐ฐ๐จ๐ณ | 2024-08 | How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models | Yue-TruthfulQA, Yue-GSM8K, Yue-ARC-C, Yue-MMLU, Yue-TRANS | [paper] |
| Cantonese ๐ญ๐ฐ | 2025-03 | HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs | Cantonese-specific tasks testing Hong Kong cultural knowledge, colloquial expressions, code-switching, sentiment analysis | [paper] |
| Chinese ๐จ๐ณ | 2023-06 | CMMLU: Measuring massive multitask language understanding in Chinese | 67 subjects across STEM (17 tasks: physics, chemistry, math, CS, engineering), humanities (13 tasks: literature, history, philosophy), social sciences (22 tasks: law, psychology, education), other (15 tasks: Chinese food culture, driving rules, etc.) | [paper] |
| Chinese ๐จ๐ณ | 2024-01 | CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark | College-level multimodal QA across art & design, business, science, health & medicine, humanities & social science, tech & engineering requiring Chinese cultural context | [paper] |
| Chinese ๐จ๐ณ | 2024-09 | CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data | Multi-Choice QA, Bool QA, Fill-in-the Blank QA, Analysis QA | [paper] |
| Chinese ๐จ๐ณ | 2024-03 | LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models | Comprehensive knowledge evaluation across STEM, humanities, social sciences with both objective (multiple-choice) and subjective (open-ended) questions | [paper] |
| Chinese ๐จ๐ณ | 2025-06 | MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies | Multimodal tasks for classical Chinese: poetry appreciation, calligraphy analysis, historical artifact recognition, classical painting understanding | [paper] |
| Chinese ๐จ๐ณ | 2025-06 | MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages | Evaluation across 6 minority languages (Mongolian, Tibetan, Uyghur, Yi, Zhuang, Kazakh) covering reading comprehension, translation, cultural knowledge, linguistic understanding | [paper] |
| Classical Chinese ๐จ๐ณ | 2024-05 | CยณBench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models | 50,000 text pairs for classification, retrieval, NER, punctuation, translation across 10 domains | [paper] |
| Czech ๐จ๐ฟ | 2024-12 | BenCzechMark: A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism | 50 tasks across 8 categories: Czech math reasoning, language modeling, NER, reading comprehension, factual knowledge, Czech language understanding, sentiment analysis, NLI | [paper] |
| Dutch ๐ณ๐ฑ | 2024-12 | Fietje: An open, efficient LLM for Dutch | Evaluation suite with reasoning, sentiment analysis, world knowledge, linguistic acceptability, WSD | [paper] |
| Filipino ๐ต๐ญ | 2025-02 | Batayan: A Filipino NLP benchmark for evaluating Large Language Models | 8 tasks: paraphrase identification, question answering, sentiment analysis, toxicity detection, commonsense reasoning (COPA), NLI, abstractive summarization, machine translation (Tagalog/English) | [paper] |
| Finnish ๐ซ๐ฎ | 2025-12 | FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models | Reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, alignment evaluation | [paper] |
| French ๐ซ๐ท | 2024-02 | DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain | Biomedical language understanding evaluation for French | [paper] |
| Hebrew ๐ฎ๐ฑ | 2024-06 | HeSum: A Novel Dataset for Abstractive Text Summarization in Hebrew | Hebrew summarization benchmark with evaluation metrics | [paper] |
| Hindi Analogy ๐ฎ๐ณ | 2025-07 | Multilingual LLMs are not Multilingual Thinkers: Evidence from Hindi Analogy Evaluation | 405 multiple-choice analogical reasoning questions from authentic Indian government exams | [paper] |
| Hong Kong (Cantonese/Chinese) ๐ญ๐ฐ | 2025-05 | Measuring Hong Kong Massive Multi-Task Language Understanding | MMLU-style benchmark across STEM, social sciences, humanities, other subjects + Cantonese-Mandarin translation tasks (both directions) | [paper] |
| Indic Languages (9) ๐ฎ๐ณ | 2025-01 | IndicMMLU-Pro: Benchmarking Indic Large Language Models | MMLU-Pro style complex reasoning across 9 languages (Hindi, Bengali, Telugu, Marathi, Tamil, Gujarati, Urdu, Kannada, Punjabi) covering STEM, humanities, social sciences, business, health | [paper] |
| Indonesian ๐ฎ๐ฉ | 2025-06 | NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts | Indigenous script recognition and understanding (multimodal) | [paper] |
| Indonesian ๐ฎ๐ฉ | 2025-12 | Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation | Multimodal emotion recognition (text, audio, visual) with 1,944 video segments, 7 emotion categories | [paper] [code] |
| Italian ๐ฎ๐น | 2024-06 | Disce aut Deficere: Evaluating LLMs Proficiency on the Invalsi Italian Benchmark | Locate and Identify Information, Reconstruct Meaning, Reflect on Content/Form, Word Formation, Lexicon and Semantics, Morphology, Spelling, Syntax, Textuality and Pragmatics, Cloze (Fill-in-the-Blank), Multiple Choice (MC), Multiple Complex Choice (MCC), Unique Response (RU), Short Response (RB) | [paper] |
| Japanese ๐ฏ๐ต | 2024-10 | JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation | 1,320 expert-level multimodal questions across 9 subjects: Japanese art (150q), Japanese heritage (150q), Japanese history (150q), world history (150q), art & psychology (90q), business (150q), science (120q), health & medicine (150q), tech & engineering (210q) | [paper] |
| Kazakh ๐ฐ๐ฟ | 2025-02 | KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge | 23,000 questions covering STEM, humanities, social sciences with bilingual Kazakh-Russian context | [paper] |
| Korean ๐ฐ๐ท | 2024-02 | KMMLU: Measuring Massive Multitask Language Understanding in Korean | 45 subjects across STEM (science, technology, engineering, mathematics), humanities & social sciences (HUMSS: history, psychology, etc.), applied sciences (aviation, gas technology, nondestructive testing), other professional-level knowledge | [paper] |
| Latvian/Giriama ๐ฑ๐ป๐ฐ๐ช | 2025-03 | LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama | Culturally relevant MMLU subset for Latvian and Giriama languages | [paper] |
| Malaysian/Malay ๐ฒ๐พ | 2024-01 | Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding | Tatabahasa (Malay grammar) evaluation and language understanding | [paper] |
| Maltese ๐ฒ๐น | 2025-06 | MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP | 11 tasks: sentiment, topic classification, reading comprehension, machine translation, summarization, data-to-text | [paper] |
| Norwegian ๐ณ๐ด | 2025-04 | NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark | 24 datasets: commonsense reasoning, reading comprehension, sentiment analysis, Norwegian language knowledge, machine translation, truthfulness, text summarization, instruction following, Norwegian-specific & world knowledge (Bokmรฅl & Nynorsk) | [paper] |
| Polish ๐ต๐ฑ | 2025-01 | LLMzSzล: a comprehensive LLM benchmark for Polish | 19K questions from Polish national exams across 154 domains, 4 exam types: middle school (math, sciences, biology, physics, Polish language), high school, professional (arts, mechanics/mining/metallurgy, agriculture/forestry) | [paper] |
| Polish ๐ต๐ฑ | 2026-01 | Bielik 11B v3: Multilingual Large Language Model for European Languages | Evaluated on: sentiment analysis, NER, QA, belebele, flores, Polish EQ-bench, complex understanding, medical leaderboard tasks (multilingual model with Polish focus) | [paper] |
| Russian ๐ท๐บ | 2024-01 | MERA: A Comprehensive LLM Evaluation in Russian | MathLogicQA, MultiQ, PARus, RCB, ruModAr, ruMultiAr, ruOpenBookQA, ruTiE, ruWorldTree, RWSD, SimpleAr, BPS, CheGeKa, LCS, ruHumanEval, ruMMLU, USE, ruDetox, ruEthics, ruHateSpeech, ruHHH | [paper] [web] |
| Sanskrit ๐ฎ๐ณ | 2024-09 | One Model is All You Need: ByT5-Sanskrit, A Unified Model for Sanskrit NLP Tasks | Unified Sanskrit benchmark for word segmentation, lemmatization, morphosyntactic tagging | [paper] |
| Swahili ๐ฐ๐ช๐น๐ฟ | 2024-07 | SwahBERT: Language Model of Swahili | Swahili NLP benchmark with multiple evaluation tasks | [paper] |
| Tibetan ๐จ๐ณ | 2025-03 | TLUE: A Tibetan Language Understanding Evaluation Benchmark | Ti-MMLU (67 subdomains across 5 domains: STEM, humanities, social sciences, China-specific, other) + Ti-SafetyBench (offensive content, physical/mental harm, illegal activities, ethics, privacy) | [paper] |
| Turkish ๐น๐ท | 2025-01 | Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation | Turkish adaptation of MMLU covering STEM, humanities, social sciences across elementary to professional levels | [paper] |
| Turkish ๐น๐ท | 2026-01 | TurkBench: A Benchmark for Evaluating Turkish Large Language Models | 21 subtasks: general knowledge, MMLU, reading comprehension, NLI, summarization, STS, reasoning, sentiment, NER, POS, instruction following | [paper] |
| Vietnamese ๐ป๐ณ | 2025-06 | VMLU Benchmarks: A Comprehensive Benchmark Toolkit for Vietnamese LLMs | Vi-MQA (multiple-choice QA), Vi-SQuAD, Vi-DROP, Vi-Dialog + subjects across elementary (math, science), middle school (biology, chemistry, math, physics), high school, university (clinical pharmacology, etc.) | [paper] |
| Yoruba ๐ณ๐ฌ | 2024-06 | Voices Unheard: NLP Resources and Models for Yorรนbรก Regional Dialects | Yoruba language understanding benchmark with cultural context | [paper] |
| Languages | Date | Title | Tasks | Links |
|---|---|---|---|---|
| 29 Languages ๐ | 2025-03 | MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation | Comprehensive multitask language understanding across EN, ZH, JA, KO, FR, DE, ES, PT, AR, TH, HI, BN, SW, and 16 other languages | [paper] [web] |
| 17 Languages ๐ | 2025-02 | BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models | Simple understanding tasks, instruction following, reasoning, long context understanding, code generation | [paper] |
| SEA Languages ๐ | 2025-02 | SEA-HELM: Southeast Asian Holistic Evaluation of Language Models | NLP Classics, LLM-specifics, SEA Linguistics, SEA Culture, Safety (Filipino, Indonesian, Tamil, Thai, Vietnamese) | [paper] |
| SEA Languages ๐ | 2025-02 | SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia | STEM, Humanities, Social Sciences, Other subjects (Indonesian, Thai, Vietnamese) | [paper] |
| Global/42 Languages ๐ | 2024-12 | Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation | Improved MMLU across 42 languages with cultural bias mitigation | [paper] |
| Turkic Languages ๐ | 2025-02 | TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages | Comprehensive multilingual benchmark for Turkic language family | [paper] |
| EU20 ๐ช๐บ | 2024-10 | EU20-MMLU: A Benchmark for Evaluating Large Language Models in European Languages | Multilingual evaluation across 20 European languages | [paper] |
| Iberian Languages ๐ช๐ธ๐ต๐น | 2025-04 | IberBench: LLM Evaluation on Iberian Languages | 101 datasets across 22 task types (Spanish, Portuguese, Catalan, Basque, Galician) | [paper] |
| African Languages ๐ | 2024-12 | Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages | Scientific question answering and truthfulness evaluation for low-resource African languages | [paper] |
| African Languages ๐ | 2024-06 | IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models | Natural Language Inference (AfriXNLI), Mathematical Reasoning (AfriMGSM), Multi-choice QA (AfriMMLU) for 17 African languages including Yoruba, Igbo, Hausa, Nigerian Pidgin | [paper] |
| Indic Languages ๐ฎ๐ณ | 2026-01 | INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects | Dialect translation, dialect detection, multi-task evaluation for Indian dialects | [paper] |
| Multimodal 11 Languages ๐ | 2024-03 | EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark | 20,932 multiple-choice questions across 11 languages from 7 language families with images, tables, diagrams | [paper] |
| 39 Languages Multimodal ๐ | 2024-10 | Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages | Holistic evaluation spanning 14 datasets in 47 languages with multimodal capabilities | [paper] |
| Sign Languages ๐ค | 2024-08 | FLEURS-ASL: Including American Sign Language in Massively Multilingual Multitask Evaluation | American Sign Language extension of FLORES benchmark for multimodal evaluation | [paper] |
| Code-Switching 10 Languages ๐ | 2024-06 | Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Safety evaluation with code-switching queries combining up to 10 languages | [paper] |
| Cultural Knowledge ๐ | 2025-06 | GIMMICK: Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking | Multimodal multitask cultural knowledge evaluation across many languages | [paper] |
| Cultural Knowledge ๐ | 2025-06 | CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs' Cultural Knowledge Through Human-AI Red-Teaming | Cultural knowledge through human-AI red-teaming | [paper] |
| Hallucination Detection ๐ | 2025-06 | CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection | Cross-lingual and cross-modal hallucination detection | [paper] |
| Knowledge Editing ๐ | 2025-01 | MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models | Knowledge editing across multiple languages | [paper] |
| Truthfulness ๐ | 2025-01 | VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability | Truthfulness assessment with multilingual transferability | [paper] |
| Safety & Bias ๐ธ๐ฌ | 2025-07 | RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | Safety classification across Singlish, Chinese, Malay, Tamil with 5,000+ examples | [paper] |
| Figurative Language ๐ | 2025-11 | FLUID QA: A Multilingual Benchmark for Figurative Language Usage in Dialogue | Figurative language usage (sarcasm, metaphor, idiom) in English, Korean, Chinese dialogue | [paper] |
| Factuality ๐ | 2025-08 | CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation | Cross-lingual and cross-modal (speech/text) factuality in 8 languages | [paper] [code] |
| Hallucination ๐ | 2025-05 | MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations | KG-based multilingual hallucination detection with 25.9k curated paths | [paper] |
| Hallucination ๐ | 2025-10 | Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection | Scientific hallucination in English, French, Hindi, Italian, Spanish, Bengali, Gujarati, Malayalam, Telugu | [paper] |
| Bias & Stereotypes ๐ | 2025-11 | Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs | Debate-style bias evaluation in 7 languages (English, Chinese, Swahili, Nigerian Pidgin, +3) with 8,400 prompts | [paper] |
| Text Detoxification ๐ | 2025-07 | Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification | Text detoxification in English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, Amharic | [paper] |
| Multilingual Financial (5) ๐ | 2025-06 | MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application | First multilingual financial benchmark (English, Chinese, Japanese, Spanish, Greek) across modalities | [paper] |
| Multilingual Medical ๐ | 2024-04 | MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering | First multilingual medical benchmark with reference gold explanations written by medical doctors | [paper] |
| Multilingual Medical ๐ | 2024-12 | Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs | Multilingual ophthalmological benchmark for low and middle-income countries | [paper] |