Add a new dataset
Add metadatas in
src/corpora/corpora.jsonExample:
{ "name": "TIMIT", "lang": "eng", "license": "LDC User Agreement for Non-Members", "source": "https://catalog.ldc.upenn.edu/LDC93S1", "authors": ["John S. Garofolo", "Lori F. Lamel", "William M. Fisher", "Jonathan G. Fiscus", "David S. Pallett", "Nancy L. Dahlgren", "Victor Zue"], "description": "The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.", "subsets": ["train", "dev", "test"] },Create a new
.jlfile insrc/corporaAdd the following line at the beginning of the file:
const <idname> = get_nametype(<dataset name>)- Replace
<idname>with an identifier of your dataset (for example,timit_id). - Replace
<dataset name>with a string containing the name of the dataset (same as referenced incorpora.json).
- Replace
If your dataset is downloadable, you can implement
Base.download(::DatasetBuilder{<idname>}, dir::AbstractString)It is mandatory to implement the
prepare()function as such:prepare(::DatasetBuilder{<idname>}, inputdir, outputdir; <keyword arguments>)You can add any keyword argument. This function must create the following files in outputdir:
recordings.jsonlannotations.jsonlorannotations-<subset>.jsonlfor each subset
That's it, you can now use
dataset("name", inputdir, outputdir; <keyword arguments>)DatasetBuilder and utilities
SpeechDatasets.DatasetBuilder — Typestruct DatasetBuilder{name}Allow to dispatch main dataset functions (download(), prepare()).
Parameter
nameDataset identifier
Fields
kwargs::NamedTupleKeyword arguments supported by the dataset associated toname
SpeechDatasets.DatasetBuilder — MethodDatasetBuilder(name::Symbol)Construct a DatasetBuilder for a given name. Implementations for each name are done by calling declareBuilder(name) (automatically done for each supported name).
SpeechDatasets.declareBuilder — MethoddeclareBuilder(name::Symbol)Declare a functor for a DatasetBuilder of type name.
A DatasetBuilder{name} object can now be created, and will hold the supported kwargs for the corresponding dataset.
SpeechDatasets.get_kwargs — Methodget_kwargs(func_name::Function, args_types::Tuple)Return a NamedTuple containing each supported kwarg and its default value for a given method.
Arguments
func_nameis the name of the functionargs_typesis a tuple of argument types for the desired method
SpeechDatasets.get_nametype — Methodget_nametype(name::String)Return a symbol corresponding to the name. This symbol is used to identify the dataset.
Base.download — FunctionBase.download(builder::DatasetBuilder{name}, dir::AbstractString)Download the dataset identified by name into dir.
Each dataset has its own implementation if download is supported (for example, a proprietary dataset might not implements download).
SpeechDatasets.prepare — Functionprepare(::DatasetBuilder{name}, inputdir, outputdir; <keyword arguments>)Create manifest files into outputdir from dataset in inputdir.
Each dataset has its own implementation, and can have optional keyword arguments, they can be accessed with get_dataset_kwargs(name::String).
Implementing this function is mandatory for a dataset to be compatible with dataset()