Supported Datasets
Radio Haïti collection (hat subset)
Audio recordings from Radio Haiti-Inter, documenting Haitian politics and culture from 1957 to 2003 (bulk 1972-2003). Under the leadership of station directors Jean Dominique and Michèle Montas, Radio Haiti was a voice of social change and democracy, speaking out against oppression and impunity while advocating for human rights and celebrating Haitian culture and heritage. Subset consists of only Haitian creole automatically annotated.
Languages
Subsets
train, val, test
License
Authors
Duke University, various contributors, William N. Havard, Renauld Govain, Benjamin Lecouteux, Emmanuel Schang
Keyword arguments
(subsets = Any["train", "val", "test"],)Pangloss Narua
La langue na (aussi appelé narua et mosuo) est parlée à la frontière des provinces chinoises du Yunnan et du Sichuan, aux abords du lac Lugu. Elle appartient au groupe naish de la famille sino-tibétaine, qui comprend également le naxi et le lazé. Contient des enregistrements EGG.
Languages
License
Authors
Alexis Michaud, Pascale-Marie Milan, Maxime Fily
Multilingual LibriSpeech
Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish
Languages
Subsets
train, dev, test
License
Authors
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert
Keyword arguments
(lang = "eng", subsets = Any["train", "dev", "test"])Mauritian 2024 Multilingual
Environ six heures d'enregistrements de parole en créole mauricien, français et anglais (environ 5 minutes par langues pour chaque enregistrement, sur trois microphones), collectées en décembre 2024 à Maurice. Annotées manuellement et anonymisées.
Languages
License
Authors
William N. Havard, Shrita Hassamal, Muhsina Alleesaib, Guilhem Florigny, Guillaume Fon Sing, Anne Abeillé, Benjamin Lecouteux, Emmanuel Schang
Pangloss
The Pangloss Collection provides free access to recordings of "rare" or little-endowed languages. Its goal is to contribute to the documentation and study of a precious human heritage: the world's languages. The documents mostly consist of narratives ("spontaneous speech"), recorded in their cultural context and transcribed in consultation with native speakers. The Pangloss Collection also contains elicitation sessions and lists of words. These documents were recorded and annotated by a number of researchers, including the members of the LACITO-CNRS research centre. The Pangloss Collection is managed by a team of members of the LACITO research centre.
Languages
License
Authors
Laboratoire De Langues Et Civilisations à Tradition Orale (LACITO)
Faetar ASR challenge 2025
Data for the 2025 Faetar Low-Resource ASR Challenge
Languages
Subsets
train, test, dev, unlab
License
Authors
Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar
Keyword arguments
(subsets = Any["train", "test", "dev", "unlab"],)TIMIT
The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.
Languages
Subsets
train, test
License
Authors
John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, Victor Zue
Keyword arguments
(subsets = Any["train", "test"], formantsdir = nothing, audio_fmt = "SPHERE")INA diachrony
Voice recordings and transcriptions sorted by time period, sex and speaker.
Languages
License
Keyword arguments
(ina_csv_dir = nothing,)Speech2Tex
Recordings of read equations, literal transcriptions and latex transcriptions.
Languages
License
Authors
Lorenzo Brucato
TATIANA
EGG + audio recordings of texts read with various emotions.
Languages
Subsets
colere, joie, neutre, peur, sensualite, surprise, tristesse
License
Authors
Albert Rilliard, Marc Evrard
Keyword arguments
(subsets = Any["colere", "joie", "neutre", "peur", "sensualite", "surprise", "tristesse"],)Pangloss Mường
La langue mường est actuellement la quatrième plus grande langue du Vietnam. Contient des enregistrements EGG.
Languages
License
Authors
Minh-Châu Nguyễn, Michel Ferlus, Trần Trí Dõi
Synthetic Vowel Dataset
Synthetic vowels dataset generated from formants tables
Languages
License
Authors
Simon Devauchelle, Lucas Ondel Yang, Albert Rilliard, David Doukhan
Mini LibriSpeech
Subset of LibriSpeech corpus for purpose of regression testing.
Languages
Subsets
train, dev
License
Authors
Vassil Panayotov, Daniel Povey
Keyword arguments
(subsets = Any["train", "dev"],)Expression et Perception des Identités dans la Voix
Languages
License
Authors
Rémi Uro, Lucas Ondel Yang, Albert Rilliard
PFC LISN
Phonologie du Français Contemporain, version LISN
Languages
Subsets
g, l, m, t
License
Authors
Jacques Durand, Bernard Laks, Chantal Lyche
AVID
Aalto Vocal Intensity Database includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud, and very loud).
Languages
License
Authors
Manila Kodali, Paavo Alku, Sudarsana Reddy Kadiri