Add a new dataset

info

For developers

  1. Add metadatas in src/corpora/corpora.json

    Example:

     {
         "name": "TIMIT",
         "lang": "eng",
         "license": "LDC User Agreement for Non-Members",
         "source": "https://catalog.ldc.upenn.edu/LDC93S1",
         "authors": ["John S. Garofolo", "Lori F. Lamel", "William M. Fisher", "Jonathan G. Fiscus", "David S. Pallett", "Nancy L. Dahlgren", "Victor Zue"],
         "description": "The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.",
         "subsets": ["train", "dev", "test"]
     },
  2. Create a new directory in src/corpora with the same name as defined in the metadata (all capital letters separated by underscores)

  3. Create a prepare.jl file in this directory (and/or a lexicon.jl, or other)

  4. If your dataset is downloadable, you can implement (don't forget to replace :NAME)

     Base.download(::Val{:NAME}, dir::AbstractString)
  5. It is mandatory to implement the prepare() function as such (don't forget to replace :NAME):

     prepare(::Val{:NAME}, inputdir, outputdir; <keyword arguments>)

    You can add any keyword argument. This function must create the following files in outputdir:

    • recordings.jsonl
    • annotations.jsonl or annotations-<subset>.jsonl for each subset

That's it, you can now use

dataset(:NAME, inputdir, outputdir; <keyword arguments>)

Functions details

SpeechDatasets.prepareFunction
prepare(::Val{key}, inputdir, outputdir; <keyword arguments>)

Create manifest files into outputdir from dataset in inputdir.

Each dataset has its own implementation, and can have optional keyword arguments, they can be accessed with get_dataset_kwargs.

Implementing this function is mandatory for a dataset to be compatible with dataset()

source
SpeechDatasets.get_kwargsFunction
get_kwargs(func_name::Function, args_types::Tuple)

Return a NamedTuple containing each supported kwarg and its default value for a given method.

Arguments

  • func_name is the name of the function
  • args_types is a tuple of argument types for the desired method (no kwargs)
source
SpeechDatasets.get_dataset_kwargsFunction
get_dataset_kwargs(name::Symbol)
get_dataset_kwargs(ds::SpeechDataset)

Return a NamedTuple containing each supported kwarg and its default value for a dataset (prepare method).

source