Add a new dataset
Add metadatas in
src/corpora/corpora.jsonExample:
{ "name": "TIMIT", "lang": "eng", "license": "LDC User Agreement for Non-Members", "source": "https://catalog.ldc.upenn.edu/LDC93S1", "authors": ["John S. Garofolo", "Lori F. Lamel", "William M. Fisher", "Jonathan G. Fiscus", "David S. Pallett", "Nancy L. Dahlgren", "Victor Zue"], "description": "The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.", "subsets": ["train", "dev", "test"] },Create a new directory in
src/corporawith the same name as defined in the metadata (all capital letters separated by underscores)Create a
prepare.jlfile in this directory (and/or alexicon.jl, or other)If your dataset is downloadable, you can implement (don't forget to replace
:NAME)Base.download(::Val{:NAME}, dir::AbstractString)It is mandatory to implement the
prepare()function as such (don't forget to replace:NAME):prepare(::Val{:NAME}, inputdir, outputdir; <keyword arguments>)You can add any keyword argument. This function must create the following files in outputdir:
recordings.jsonlannotations.jsonlorannotations-<subset>.jsonlfor each subset
That's it, you can now use
dataset(:NAME, inputdir, outputdir; <keyword arguments>)Functions details
SpeechDatasets.prepare — Function
prepare(::Val{key}, inputdir, outputdir; <keyword arguments>)Create manifest files into outputdir from dataset in inputdir.
Each dataset has its own implementation, and can have optional keyword arguments, they can be accessed with get_dataset_kwargs.
Implementing this function is mandatory for a dataset to be compatible with dataset()
SpeechDatasets.get_kwargs — Function
get_kwargs(func_name::Function, args_types::Tuple)Return a NamedTuple containing each supported kwarg and its default value for a given method.
Arguments
func_nameis the name of the functionargs_typesis a tuple of argument types for the desired method (no kwargs)
SpeechDatasets.get_dataset_kwargs — Function
get_dataset_kwargs(name::Symbol)
get_dataset_kwargs(ds::SpeechDataset)Return a NamedTuple containing each supported kwarg and its default value for a dataset (prepare method).
SpeechDatasets.get_download_kwargs — Function
get_download_kwargs(name::Symbol)Return a NamedTuple containing each supported kwarg and its default value for the download method of a dataset.