Add a new dataset
Add metadatas in
src/corpora/corpora.json
Example:
{ "name": "TIMIT", "lang": "eng", "license": "LDC User Agreement for Non-Members", "source": "https://catalog.ldc.upenn.edu/LDC93S1", "authors": ["John S. Garofolo", "Lori F. Lamel", "William M. Fisher", "Jonathan G. Fiscus", "David S. Pallett", "Nancy L. Dahlgren", "Victor Zue"], "description": "The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems.", "subsets": ["train", "dev", "test"] },
Create a new
.jl
file insrc/corpora
Add the following line at the beginning of the file:
const <idname> = get_nametype(<dataset name>)
- Replace
<idname>
with an identifier of your dataset (for example,timit_id
). - Replace
<dataset name>
with a string containing the name of the dataset (same as referenced incorpora.json
).
- Replace
If your dataset is downloadable, you can implement
Base.download(::DatasetBuilder{<idname>}, dir::AbstractString)
It is mandatory to implement the
prepare()
function as such:prepare(::DatasetBuilder{<idname>}, inputdir, outputdir; <keyword arguments>)
You can add any keyword argument. This function must create the following files in outputdir:
recordings.jsonl
annotations.jsonl
orannotations-<subset>.jsonl
for each subset
That's it, you can now use
dataset("name", inputdir, outputdir; <keyword arguments>)
DatasetBuilder and utilities
SpeechDatasets.DatasetBuilder
— Typestruct DatasetBuilder{name}
Allow to dispatch main dataset functions (download()
, prepare()
).
Parameter
name
Dataset identifier
Fields
kwargs::NamedTuple
Keyword arguments supported by the dataset associated toname
SpeechDatasets.DatasetBuilder
— MethodDatasetBuilder(name::Symbol)
Construct a DatasetBuilder for a given name. Implementations for each name are done by calling declareBuilder(name)
(automatically done for each supported name).
SpeechDatasets.declareBuilder
— MethoddeclareBuilder(name::Symbol)
Declare a functor for a DatasetBuilder of type name
.
A DatasetBuilder{name}
object can now be created, and will hold the supported kwargs for the corresponding dataset.
SpeechDatasets.get_kwargs
— Methodget_kwargs(func_name::Function, args_types::Tuple)
Return a NamedTuple
containing each supported kwarg and its default value for a given method.
Arguments
func_name
is the name of the functionargs_types
is a tuple of argument types for the desired method
SpeechDatasets.get_nametype
— Methodget_nametype(name::String)
Return a symbol corresponding to the name. This symbol is used to identify the dataset.
Base.download
— FunctionBase.download(builder::DatasetBuilder{name}, dir::AbstractString)
Download the dataset identified by name
into dir
.
Each dataset has its own implementation if download is supported (for example, a proprietary dataset might not implements download).
SpeechDatasets.prepare
— Functionprepare(::DatasetBuilder{name}, inputdir, outputdir; <keyword arguments>)
Create manifest files into outputdir
from dataset in inputdir
.
Each dataset has its own implementation, and can have optional keyword arguments, they can be accessed with get_dataset_kwargs(name::String)
.
Implementing this function is mandatory for a dataset to be compatible with dataset()