Add a new dataset

  1. Add metadatas in src/corpora/corpora.json

    Example:

     {
         "name": "TIMIT",
         "lang": "eng",
         "license": "LDC User Agreement for Non-Members",
         "source": "https://catalog.ldc.upenn.edu/LDC93S1",
         "authors": ["John S. Garofolo", "Lori F. Lamel", "William M. Fisher", "Jonathan G. Fiscus", "David S. Pallett", "Nancy L. Dahlgren", "Victor Zue"],
         "description": "The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.",
         "subsets": ["train", "dev", "test"]
     },
  2. Create a new .jl file in src/corpora

  3. Add the following line at the beginning of the file:

     const <idname> = get_nametype(<dataset name>)
    • Replace <idname> with an identifier of your dataset (for example, timit_id).
    • Replace <dataset name> with a string containing the name of the dataset (same as referenced in corpora.json).
  4. If your dataset is downloadable, you can implement

     Base.download(::DatasetBuilder{<idname>}, dir::AbstractString)
  5. It is mandatory to implement the prepare() function as such:

     prepare(::DatasetBuilder{<idname>}, inputdir, outputdir; <keyword arguments>)

    You can add any keyword argument. This function must create the following files in outputdir:

    • recordings.jsonl
    • annotations.jsonl or annotations-<subset>.jsonl for each subset

That's it, you can now use

dataset("name", inputdir, outputdir; <keyword arguments>)

DatasetBuilder and utilities

SpeechDatasets.DatasetBuilderType
struct DatasetBuilder{name}

Allow to dispatch main dataset functions (download(), prepare()).

Parameter

  • name Dataset identifier

Fields

  • kwargs::NamedTuple Keyword arguments supported by the dataset associated to name
source
SpeechDatasets.declareBuilderMethod
declareBuilder(name::Symbol)

Declare a functor for a DatasetBuilder of type name.

A DatasetBuilder{name} object can now be created, and will hold the supported kwargs for the corresponding dataset.

source
SpeechDatasets.get_kwargsMethod
get_kwargs(func_name::Function, args_types::Tuple)

Return a NamedTuple containing each supported kwarg and its default value for a given method.

Arguments

  • func_name is the name of the function
  • args_types is a tuple of argument types for the desired method
source
Base.downloadFunction
Base.download(builder::DatasetBuilder{name}, dir::AbstractString)

Download the dataset identified by name into dir.

Each dataset has its own implementation if download is supported (for example, a proprietary dataset might not implements download).

source
SpeechDatasets.prepareFunction
prepare(::DatasetBuilder{name}, inputdir, outputdir; <keyword arguments>)

Create manifest files into outputdir from dataset in inputdir.

Each dataset has its own implementation, and can have optional keyword arguments, they can be accessed with get_dataset_kwargs(name::String).

Implementing this function is mandatory for a dataset to be compatible with dataset()

source