Supported Datasets

Radio Haïti collection (hat subset)

Audio recordings from Radio Haiti-Inter, documenting Haitian politics and culture from 1957 to 2003 (bulk 1972-2003). Under the leadership of station directors Jean Dominique and Michèle Montas, Radio Haiti was a voice of social change and democracy, speaking out against oppression and impunity while advocating for human rights and celebrating Haitian culture and heritage. Subset consists of only Haitian creole automatically annotated.

Languages

Language

Subsets

train, val, test

License

License

Authors

Duke University, various contributors, William N. Havard, Renauld Govain, Benjamin Lecouteux, Emmanuel Schang

Source

Keyword arguments

(subsets = Any["train", "val", "test"],)

Pangloss Narua

La langue na (aussi appelé narua et mosuo) est parlée à la frontière des provinces chinoises du Yunnan et du Sichuan, aux abords du lac Lugu. Elle appartient au groupe naish de la famille sino-tibétaine, qui comprend également le naxi et le lazé. Contient des enregistrements EGG.

Languages

Language

License

License

Authors

Alexis Michaud, Pascale-Marie Milan, Maxime Fily

Source


Multilingual LibriSpeech

Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages - English, German, Dutch, Spanish, French, Italian, Portuguese, Polish

Languages

Language Language Language Language Language Language Language Language

Subsets

train, dev, test

License

License

Authors

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert

Source

Keyword arguments

(lang = "eng", subsets = Any["train", "dev", "test"])

Mauritian 2024 Multilingual

Environ six heures d'enregistrements de parole en créole mauricien, français et anglais (environ 5 minutes par langues pour chaque enregistrement, sur trois microphones), collectées en décembre 2024 à Maurice. Annotées manuellement et anonymisées.

Languages

Language Language Language

License

License

Authors

William N. Havard, Shrita Hassamal, Muhsina Alleesaib, Guilhem Florigny, Guillaume Fon Sing, Anne Abeillé, Benjamin Lecouteux, Emmanuel Schang


Pangloss

The Pangloss Collection provides free access to recordings of "rare" or little-endowed languages. Its goal is to contribute to the documentation and study of a precious human heritage: the world's languages. The documents mostly consist of narratives ("spontaneous speech"), recorded in their cultural context and transcribed in consultation with native speakers. The Pangloss Collection also contains elicitation sessions and lists of words. These documents were recorded and annotated by a number of researchers, including the members of the LACITO-CNRS research centre. The Pangloss Collection is managed by a team of members of the LACITO research centre.

Languages

Language Language

License

License

Authors

Laboratoire De Langues Et Civilisations à Tradition Orale (LACITO)

Source


Faetar ASR challenge 2025

Data for the 2025 Faetar Low-Resource ASR Challenge

Languages

Language

Subsets

train, test, dev, unlab

License

License

Authors

Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar

Source

Keyword arguments

(subsets = Any["train", "test", "dev", "unlab"],)

TIMIT

The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.

Languages

Language

Subsets

train, test

License

License

Authors

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, Victor Zue

Source

Keyword arguments

(subsets = Any["train", "test"], formantsdir = nothing, audio_fmt = "SPHERE")

INA diachrony

Voice recordings and transcriptions sorted by time period, sex and speaker.

Languages

Language

License

License

Keyword arguments

(ina_csv_dir = nothing,)

Speech2Tex

Recordings of read equations, literal transcriptions and latex transcriptions.

Languages

Language

License

License

Authors

Lorenzo Brucato


TATIANA

EGG + audio recordings of texts read with various emotions.

Languages

Language

Subsets

colere, joie, neutre, peur, sensualite, surprise, tristesse

License

License

Authors

Albert Rilliard, Marc Evrard

Keyword arguments

(subsets = Any["colere", "joie", "neutre", "peur", "sensualite", "surprise", "tristesse"],)

Pangloss Mường

La langue mường est actuellement la quatrième plus grande langue du Vietnam. Contient des enregistrements EGG.

Languages

Language

License

License

Authors

Minh-Châu Nguyễn, Michel Ferlus, Trần Trí Dõi

Source


Synthetic Vowel Dataset

Synthetic vowels dataset generated from formants tables

Languages

Language Language

License

License

Authors

Simon Devauchelle, Lucas Ondel Yang, Albert Rilliard, David Doukhan

Source


Mini LibriSpeech

Subset of LibriSpeech corpus for purpose of regression testing.

Languages

Language

Subsets

train, dev

License

License

Authors

Vassil Panayotov, Daniel Povey

Source

Keyword arguments

(subsets = Any["train", "dev"],)

Expression et Perception des Identités dans la Voix

Languages

Language

License

License

Authors

Rémi Uro, Lucas Ondel Yang, Albert Rilliard


PFC LISN

Phonologie du Français Contemporain, version LISN

Languages

Language

Subsets

g, l, m, t

License

License

Authors

Jacques Durand, Bernard Laks, Chantal Lyche

Source


AVID

Aalto Vocal Intensity Database includes speech and EGG produced by 50 speakers (25 males, 25 females) who varied their vocal intensity in four categories (soft, normal, loud, and very loud).

Languages

Language

License

License

Authors

Manila Kodali, Paavo Alku, Sudarsana Reddy Kadiri

Source