publications

Correcting the Tamazight Portions of FLORES+ and OLDI Seed Datasets

Tamazight

October 24, 2025
Awal – Community-Powered Language Technology for Tamazight

Tamazight

October 23, 2025
Nós-TTS : a Web User Interface for Galician Text-to-Speech

Galician

March 15, 2024
BibleTTS - a large, high-fidelity, multilingual, and uniquely African speech corpus

Akuapem TwiAsante TwiChichewaEweHausaKikuyuLingalaLugandaLuoYoruba

September 15, 2022
Preparing an endangered language for the digital age - The Case of Judeo-Spanish

Ladino

June 21, 2022
Corpora compilation for prosody-informed speech processing

EnglishSpanish

September 4, 2021
Congolese Swahili Machine Translation for Humanitarian Response

Swahili CongoCoastal SwahiliFrench

April 19, 2021
TICO-19 – The Translation Initiative for COvid-19

AmharicArabic (Modern Standard)BengaliChinese (Simplified)DariDinkaFarsiFrench (European)HausaHindiIndonesianKanuriKhmer (Central)KinyarwandaKurdish KurmanjiKurdish SoraniLingalaLugandaMalayMarathiMyanmarNepaliNigerian FulfuldeNuerOromoPashtoPortuguese (Brazilian)RussianSomaliSpanish (Latin American)SwahiliCongolese SwahiliTagalogTamilTigrinyaUrduZulu

November 19, 2020
Participatory Research for Low-resourced Machine Translation - A Case Study in African Languages

African Languages

November 16, 2020
Gamayun – Language Technology for Humanitarian Response

KanuriHausaSwahiliLingalaNandeRohingyaTigrinyaEnglishFrench

November 1, 2020
CATOTRON – A Neural Text-to-Speech System in Catalan

Catalan

October 16, 2020
Masakhane – Machine Translation For Africa

African Languages

April 26, 2020
Tigrinya Neural Machine Translation with Transfer Learning for Humanitarian Response

Tigrinya

April 26, 2020
Prosodic phrase alignment for machine dubbing

SpanishEnglish

September 24, 2019
Building an open source automatic speech recognition system for Catalan

Catalan

November 22, 2018
Bilingual prosodic dataset compilation for spoken language translation

EnglishSpanish

November 21, 2018
Visualizing punctuation restoration in speech transcripts with Prosograph

September 2, 2018
Attentional parallel RNNs for generating punctuation in transcribed speech

English

October 23, 2017
Revising the METU-Sabancı Turkish treebank: An exercise in surface-syntactic annotation of agglutinative languages

Turkish

September 18, 2017
Prosograph: A tool for prosody visualisation of large speech corpora

August 20, 2017
Automatic extraction of parallel speech corpora from dubbed movies

EnglishSpanish

July 30, 2017
From raw data to semantically enriched hyperlinking: Recent advances in the LinkedTV analysis workflow

German

October 28, 2013
Processing the manuscripts of Atatürk

Turkish

May 22, 2010