OpenAI releases multilingual AI dataset on Humming Face supporting 14 languages

OpenAI raises the bar for multilingual AI capabilities with the addition of a variety of languages into the new multilingual evaluation
An undated image. — iStock
An undated image. — iStock

OpenAI has been a driving force and a trendsetter in the realm of artificial intelligence (AI) since it stepped into the limelight a few years back. 

Despite the influx of countless AI tools backed by multifaceted large language models (LLMs), a key area in this space is the language which remains largely neglected, resulting in a divide that by time takes the form of discords. 

To fill in this language divide, OpenAI released a massive multilingual dataset which determines the performance of language models of various languages. 

Read more: OpenAI expands o1 AI models to ChatGPT Enterprise, ChatGPT Edu users

The OpenAI dataset has the capability of assessing the efficiency of language models across 14 languages, including Arabic, German, Swahili, Bengali and Yoruba. 

The Multilingual Massive Multitask Language Understanding (MMMLU) dataset is available on the open data platform Hugging Face. This multilanguage approach is an outcome of progress on the existing Massive Multitask Language Understanding (MMLU) benchmark, which tested in English an AI system’s knowledge across 57 disciplines ranging from mathematics to law and computer science. 

The addition of a variety of languages into the new multilingual evaluation, the ChatGPT maker has raised the bar for multilingual AI capabilities, although some of them have limited resources to train AI data. 

Subsequently, it will widen access to the technology in a more transparent way than before while mitigating the rant-drawer argument which is outspoken about the AI industry's inability to come up with language models that grasp languages spoken by people worldwide.