Today, Google and a consortium of leading African research institutions announced the launch of WAXAL, a large-scale, openly accessible speech dataset designed to catalyze research and build more inclusive AI technologies. The dataset bridges a critical digital divide for over 100 million speakers by providing foundational data for 21 Sub-Saharan African languages, including Hausa, Luganda, Yoruba, and Acholi.
While voice-enabled technologies have become common in much of the world, a profound scarcity of high-quality speech data has prevented their development for most of Africa’s 2,000+ languages. This has excluded hundreds of millions of people from accessing technology in their native tongues.
The WAXAL dataset was created to directly address this gap. Developed over three years with funding from Google, the project features 1,250 hours of transcribed, natural speech, and Over 20 hours of high-quality, studio recordings designed for building high-fidelity synthetic voices.
“The ultimate impact of WAXAL is the empowerment of people in Africa”. Said Aisha Walcott-Bryantt, Head of Google Research Africa. “This dataset provides the critical foundation for students, researchers, and entrepreneurs to build technology on their own terms, in their own languages, finally reaching over 100 million people. We look forward to seeing African innovators use this data to create everything from new educational tools to voice-enabled services that create tangible economic opportunities across the continent.”
A central principle of the project was to ensure it was built by and for the community. African academic and community organizations, including Makerere University (Uganda), the University of Ghana, and Digital Umuganda (Rwanda), led the data collection with guidance from Google experts. These partner institutions retain full ownership of the data, establishing a new framework for equitable, partnership-led AI development.
The dataset covers the following languages: Acholi, Akan, Dagaare, Dagbani, Dholuo, Ewe, Fante, Fulani (Fula), Hausa, Igbo, Ikposo (Kposo), Kikuyu, Lingala, Luganda, Malagasy, Masaaba, Nyankole, Rukiga, Shona, Soga (Lusoga), Swahili, and Yoruba.
The WAXAL dataset is available starting today. For more information, visit the Google Africa blog at goo.gle/IntroducingWaxal.
For decades, Africa’s vast linguistic diversity has remained largely invisible in the digital world, leaving many communities unheard as artificial intelligence systems evolve. Today, that narrative is beginning to change. Through collaborative efforts that place local voices at the centre of innovation, African researchers are building the foundations for AI that understands not just words, but culture, context and identity.
“For AI to have a real impact in Africa, it must speak our languages and understand our contexts. The WAXAL dataset gives our researchers the high-quality data they need to build speech technologies that reflect our unique communities. In Uganda, it has already strengthened our local research capacity and supported new student and faculty-led projects,” said Joyce Nakatumba-Nabende, Senior Lecturer at Makerere University’s School of Computing and Information Technology.
At the University of Ghana, the ripple effects are already being felt across disciplines and communities.
“For us at the University of Ghana, WAXAL’s impact goes beyond the data itself. It has empowered us to build our own language resources and train a new generation of AI researchers. Over 7,000 volunteers joined us because they wanted their voices and languages to belong in the digital future. Today, that collective effort has sparked an ecosystem of innovation in fields like health, education, and agriculture. This proves that when the data exists, possibility expands everywhere,” said Prof Isaac Wiafe, Associate Professor at the University of Ghana.

