Three Years of Building Language Technology for Tamazight
Early this month, I presented technical findings from the AWAL project at The International Conference on Information and Communication Technologies for Amazigh (TICAM) 2025 in Rabat. Awal is a project where I served as technical consultant as part of Col·lectivaT, coordinated by CIEMEN with participation from the Tamazight diaspora in Catalonia.
The presentation took place at IRCAM - the Royal Institute of Amazigh Culture. Being invited to present at IRCAM was an honor. This is the institution responsible for Tamazight language policy in Morocco, working with a team of about 100 people to handle standardization, educational material development, teacher training, and research coordination. The scope of work relative to resources makes it clear why community-powered approaches like Awal matter to close the digital gap for Tamazight.
During my time in Rabat, I noticed that Tamazight writing is far more visible than I experienced in my earlier visits to Morocco. Street signs, official documents, shop fronts - the script appears in public spaces in ways that would have been unimaginable two decades ago. This visibility represents real progress, though conversations with linguists at IRCAM and elsewhere revealed the ongoing challenges - dialectal tensions between regions, gaps in educational infrastructure, the constant negotiation around standardization decisions. There’s still a long way to go in terms of social acceptance and daily use.
The Awal project
Awal began in 2022 through a collaboration between CIEMEN, Casa Amaziga de Catalunya and Col·lectivaT. My role as technical lead evolved across three phases of work, moving from initial manual translation data collection to developing crowdsourced infrastructure through the awaldigital.org website and overseeing translation dataset and machine translation (MT) model development.
"Do you speak Tamazight? Join us!"
What we collected and built
The awaldigital.org platform we built provides two main functions. First, contributors can translate between Tamazight and five other languages (Catalan, Spanish, French, English and Arabic), with random sentence loading and an auto-translation feature that generates starting points for correction. Second, the platform includes a review system where experienced contributors validate translations from others. Contributors mark dialectal variants when relevant, and the platform includes gamification elements through points and leaderboards to encourage sustained participation. We also integrated with Mozilla’s Common Voice initiative, allowing participants to contribute voice recordings for speech technology development.
The initiative as of 2025 has now gathered up to 7,500 translation pairs and three hours of speech, creating the largest open translation and speech datasets available for Tamazight. The infrastructure continues to operate, albeit passively as future funding is not secured.
Our publications
For our research presented at TICAM 2025, we examined the community-powered approach itself - analyzing participation patterns, conducting interviews with contributors, and documenting the sociolinguistic challenges that affect data collection. The research revealed five key learnings: writing confidence as a major barrier in participation from general public, the fact that translation can motivate content creation by language learners, that academic and activist communities become the core contributors, that dialect diversity creates both opportunities and tensions, and that Catalan-focused outreach limited our geographic reach. The full findings are available in our TICAM 2025 paper.
For our research presented at WMT OLDI workshop, we focused on improving existing open-source MT models. We corrected standard MT datasets FLORES and NLLB-Seed, enhanced the NLLB model through fine-tuning, and compared our results with large language models. Thanks to this work, we achieved a 33% improvement in translation quality in English to Tamazight direction and a 9% improvement in Tamazight to English. These improvements are visible to users - Awal translate now handles common phrases, cultural expressions, and dialectal variations much better than before. The corrected reference datasets now provide reliable foundations for any researcher wanting to develop or evaluate machine translation systems for Tamazight. The technical details are available in our OLDI paper.
Where to find everything
All project outputs remain openly available:
- Awal Platform: awaldigital.org
- Datasets: Available through Hugging Face
- Models: Published and maintained by Tamazight NLP on Hugging Face
- Research papers: TICAM 2025 paper, Catalan translation, WMT-OLDI paper
For a broader overview of the project’s goals and community approach, the piece I wrote for Nationalia provides more context and details on lessons learned.
Where from now on
I’m closing this chapter of technical consulting on Awal as Col·lectivaT has concluded operations. The infrastructure we built continues to function, and the datasets remain available for researchers and language technology developers working with Tamazight. The learnings from this work have already informed my ongoing work with other minoritized languages.
The real question isn’t whether community-powered approaches can work for Tamazight - we’ve proven they can. The question is whether these initiatives can secure the sustained institutional support and long-term commitment they need to move beyond prototype stages. Looking at successful models like NaijaVoices for Nigerian languages or successful initiatives for Catalan, the pattern is clear: concentrated, sustained effort with adequate resources makes the difference. The data and models we created through Awal provide a foundation, but realizing the full potential of these tools requires coordination between language communities, academic institutions, and technology organizations - work that extends well beyond what any single project can accomplish.
Tanmmirt / ⵜⴰⵏⵎⵎⵉⵔⵜ / Thanks !
This work was made possible through CIEMEN’s vision in creating and financing the project over four years as part of their SomPart initiative. My partners at Col·lectivaT were essential - Özgür Güneş Öztürk for kicking off the community engagement work, Clara Basiana for coordination and communication design during the project, and Pelin Doğan for translating the TICAM paper. Deep gratitude to the community champions: Farida Boudichat and Ghizlan Baryala for their community empowerment work, Yuxuan Peng for developing Awal platform, Naceur Jabouja for sharing his insights and Tamazight translations, Brahim Essaidi and Yassine Aït-El-Mouden for their support from the very beginning, their contributions to localize the Awal digital platform and Common Voice and finally, Mohamed Aymane Farhi from Tamazight NLP for providing crucial technical support and companionship throughout this journey. Tanmmirt!
![]()