Mozilla Unveils Free AI Voice Training Data in 180 Languages

30.11.2024 23:46

eWeek

Since its inception in 2017, Mozilla’s Common Voice project has been on a mission to make AI more inclusive and accessible. By gathering more than 30,000 hours of spoken language recordings from contributors worldwide, the project has created one of the largest free AI voice datasets for training voice recognition software. Its purpose is clear: to empower developers and companies of all sizes with publicly available data to improve and build voice-enabled AI tools.

What sets Common Voice apart is its emphasis on volunteer consent and ensuring that contributors understand how their recordings will be used. These datasets cover more than 180 languages and are accessible under the Creative Commons CC0 license. They are available for download from Mozilla and its Hugging Face AI development platform.

A Lifeline for Endangered Languages

With approximately 3,000 languages at risk of extinction, Common Voice is also a powerful tool for language preservation. Many endangered languages are vanishing as younger generations stop learning them and native speakers dwindle.

Most of these languages are often excluded from smartphones, apps, and AI tools, leaving them on the fringes of technological advancement. This exclusion motivates volunteers to contribute, ensuring their languages are represented in AI. Their efforts help create more accurate and inclusive AI tools while safeguarding their cultural heritage.

As of June 2024, Common Voice added five new languages to its repertoire as part of the project’s initiative to prioritize African languages: Xhosa, Kalenjin, Kidaw’ida, Dhuluo, and Setswana. The addition of the new languages is a significant milestone in Mozilla’s push to include native languages from the continent, which are absent from flagship AI-voice assistants like Amazon Alexa, Google Home, and Apple’s Siri. Ensuring that underrepresented languages are included underscores Mozilla’s effort to dismantle linguistic barriers in AI.

Making A Real-World Impact

Mozilla’s high-quality, free AI voice datasets empower developers to create AI solutions for people from diverse backgrounds. These datasets are driving innovations such as legal advice AI chatbots, enhanced screen readers, and improved communication tools for people with disabilities.

By focusing on inclusivity, Mozilla addresses the gaps in AI technology that leave many communities underrepresented. Through its Common Voice project, Mozilla is advancing voice recognition and preserving the linguistic heritage of smaller cultures, ensuring they have a voice in today’s AI. This free access helps support a future where technology is inclusive, adaptable, and reflective of the needs of global communities.

The post Mozilla Unveils Free AI Voice Training Data in 180 Languages appeared first on eWEEK.