When the Nigerian government announced plans in April to develop a multilingual AI tool to boost digital inclusion across the West African nation, 28-year-old computer science student Lwasinam Lenham Dilli was thrilled. Dilli said he had been facing problems scraping datasets from the internet to build a big language model in Hausa—used to train AI chatbots—as part of his final-year project at university.
“I needed texts in English and their corresponding translation in Hausa, but I couldn’t get anything online, there was no clean data,” Dilli told Context.
“(Creating local language LLMs) is a way to ensure that our local dialects and languages will not be forgotten or left out of the AI ecosystem,” he added.
The world has gone wild over AI mania, particularly tools like OpenAI’s ChatGPT, Meta’s Llama 2, and Mistral AI, which have won both critical acclaim and phenomenal usage by millions across the world to generate human-like text.
For many tech-savvy Africans, however, the excitement came tempered against the frustrated reality: feeding Hausa, Amharic, or Kinyarwanda into the chat, many of these very advanced systems stumble, often responding with gibberish.
Technology experts have warned that the absence of LLMs in African languages can only work to exclude millions of people residing in the continent, increasing both the digital and economic divides.
The multilingual LLM initiative led by the Nigerian government will definitely even the playing ground.
It would be trained on five low-resource languages and accented English to ensure stronger language representation. for development of artificial intelligence solutions, said Digital Economy Minister Bosun Tijani of Nigeria in April.
Association will be done with the Nigerian AI startups, and the data would be collected by volunteers fluent in any of the five Nigerian languages—Yoruba, Hausa, Igbo, Ibibio, and West African lingua franca-Pidgin.
It will also harness the skills of more than 7,000 fellows from a government scheme aiming to train three million people in Nigeria in such skills as coding and programming under the country’s tech talent programme to build the model.
The cofounder of an AI startup known as Awarri, part of the initiative, Silas Adekunle, notes that building such a sophisticated AI tool that can truly understand the language and cultural nuances of Nigeria was full of hurdles.
“We have so many different accents and languages; this will enable many people and developers to build products that leverage AI but are for the Nigerian market,” he said.
“The scale of the project, with particularly limited resources, has required that we had to be creative in how to train the model, gather data, compute and label what we got.”
Visitors attend the AI for Good Global Summit on artificial intelligence, organised by the International Telecommunication Union (ITU), in Geneva, Switzerland, May 30, 2024. REUTERS/Denis Balibouse
Bridging the AI language gap
More than 2,000 languages are spoken across 54 African countries, says the United Nations Educational, Scientific and Cultural Organization, UNESCO.
It means that most of the African languages are still under-represented on the internet. Digital space is also colonized with [text in English, which occupies about 50%], followed by Spanish, German, Japanese, and French.
There’s also the Nigerian government initiative, and a small but growing number of African startups rising to the challenge of developing AI tools in languages like Swahili, Amharic, Zulu, and Sesotho.
Jacaranda Health, a health tech company, has launched the first LLM to run in Swahili to improve maternal health in East Africa.
But a new tool called UlizaLlama, AskLlama, uses Meta’s Llama 3 system and is supposed to “retrain” Jacaranda Health’s SMS service for low-income expectant mothers speaking Swahili, designed to respond to informational queries on diet and foetal movement with exercise tips during pregnancy.
Currently, the platform offers pre-written automated responses, but once UlizaLlama is integrated by the end of June, it will offer tailored responses in line with particular needs and much better detail on pregnancy advice and emergency support.
“A lot of these expecting mothers can’t just Google something. The goal of UlizaLlama is to ensure that we get them the correct answers in the fastest time possible,” Jay Patel, director of technology at Jacaranda Health, told Context.
“We’re looking probably about 85% to start out with, also in terms of response time, it’s a matter of minutes right now, and we hope to have that down to less than a minute shortly.”
In South Africa, the Masakhane initiative translate African languages using open source machine learning.
Lelapa AI, a South African AI research laboratory, developed VulaVula—a for-profit language processing tool that can translate, transcribe, and analyse languages in English, Afrikaans, Zulu, and Sesotho.
A nurse puts a newly born baby into an incubator at the Pumwani Maternity Hospital in Nairobi, Kenya October 17, 2019. Picture taken October 17, 2019. REUTERS/Njeri Mwangi
Scarcity of data, ethical concerns
But development of LLMs in African languages comes with larger challenges ranging from the availability of data to pressing ethical concerns over consent and compensation, say experts in AI. Most of these African languages are truly low-resource, lacking adequate data for the training of such models to be effective, unlike in high-resource languages like English or French.”.
According to Michael Michie, the cofounder of Everse Technology Africa, an AI startup building intelligence into data protection and privacy, raising ethical questions was the collection of data to be used in training LLMs. Oral tradition reigns supreme in many African communities, and some such communities may not be interested in sharing their language to train LLMs, adding that this had to be respected.
“There are currently no regulations or laws in African countries that address issues related to consent, privacy, and compensation to communities when collecting data to train AI tools-this needs to be addressed,” Michie said.
“There are questions of who owns the language and who benefits. There needs to be guidelines to prevent exploitation and ensure the development of these LLMs benefits the people they are meant to serve,” he added.
Open-source initiatives like Creative Commons, which allow creators to legally share their work with specified conditions like ensuring attribution and non-commercial use, are also not a panacea, said some AI experts.
“At the moment there’s this push of saying everything should just be under Creative Commons,” said Vukosi Marivate, associate professor of computer science at the University of Pretoria and co-founder of Lelapa AI. It’s harder to properly pay and acknowledge the people who made those language models in the first place if everything is open source, he said.
“A lot of people are working on LLMs now because of the prestige, that’s where the money is, but we need to be really making sure that our languages are taken care of.”.