API
Load a Dataset
To get data from a supported dataset, you only need one function:
SpeechDatasets.dataset
— Methoddataset(name::AbstractString, inputdir::AbstractString, outputdir::AbstractString; <keyword arguments>)
Extract recordings and annotations for desired dataset.
Return a SpeechDataset object.
Create the outputdir
folder, with:
recordings.jsonl
containing each audio file path and associated metadataannotations-<subset>.jsonl
containing each annotation and associated metadata
Arguments
name
Name of the dataset. Supported names are ["AVID", "INA Diachrony", "Mini LibriSpeech", "Multilingual LibriSpeech", "TIMIT", "Speech2Tex"].inputdir
Name of dataset directory. If the directory does not exists, it is created and the data is downloaded if possible. Not all datasets can be downloaded, for example proprietary datasets does not implements a download function.outputdir
is the output directory for manifest files.
Keyword Arguments
Common kwargs are
subset
Part of the dataset to load (for example "train" or "test").lang
ISO 639-3 code of the language.
Other kwargs can be available depending on the dataset, they can be accessed with get_dataset_kwargs(name::String)
.
Base.summary
— MethodBase.summary(dataset::SpeechDataset)
Display informations about given SpeechDataset
SpeechDatasets.get_dataset_kwargs
— Methodget_dataset_kwargs(name::String)
Return a NamedTuple
containing each supported kwarg and its default value for a dataset identified by name.
Types
SpeechDataset
SpeechDatasets.SpeechDatasetInfos
— Typestruct SpeechDatasetInfos
Store metadata about a dataset.
Fields
name
Dataset official namelang
Language or list of languages (ISO 639-3 code)license
License namesource
URL to the dataset publication or contentauthors
list of authorsdescription
A few sentences describing the content or main purposesubsets
List of available subsets (for example ["train", "test"])
SpeechDatasets.SpeechDatasetInfos
— MethodSpeechDatasetInfos(name::AbstractString)
Construct a SpeechDatasetInfos from the Dataset name.
SpeechDatasets.SpeechDataset
— Typestruct SpeechDataset <: MLUtils.AbstractDataContainer
Store all dataset recordings and annotations.
It can be iterated, and will give a Tuple{Recording, Annotation}
for each entry. Indexation can be done with integer or id.
Fields
infos::SpeechDatasetInfos
idxs::Vector{AbstractString}
id indexes to access elementsannotations::Dict{AbstractString, Annotation}
Annotation for each indexrecordings::Dict{AbstractString, Recording}
Recording for each index
SpeechDatasets.SpeechDataset
— MethodSpeechDataset(infos::SpeechDatasetInfos, manifestroot::AbstractString, subset::AbstractString)
Create a SpeechDataset from manifest files and subset.
Access a single element with integer or id indexing
# ds::SpeechDataset
ds[1]
ds["1988-147956-0027"]
Access several elements by providing a list
ds[[1,4,7]]
ds[[8, 2, "777-126732-0015"]]
Get all annotations
ds.annotations
Manifest items
SpeechDatasets.ManifestItem
— Typeabstract type ManifestItem end
Base class for all manifest item. Every manifest item should have an id
attribute.
SpeechDatasets.Recording
— Typestruct Recording{Ts<:AbstractAudioSource} <: ManifestItem
id::AbstractString
source::Ts
channels::Vector{Int}
samplerate::Int
end
A recording is an audio source associated with and id.
Constructors
Recording(id, source, channels, samplerate)
Recording(id, source[; channels = missing, samplerate = missing])
If the channels or the sample rate are not provided then they will be read from source
.
When preparing large corpus, not providing the channels and/or the sample rate can drastically reduce the speed as it forces to read source.
SpeechDatasets.Annotation
— Typestruct Annotation <: ManifestItem
id::AbstractString
recording_id::AbstractString
start::Float64
duration::Float64
channel::Union{Vector, Colon}
data::Dict
end
An "annotation" defines a segment of a recording on a single channel. The data
field is an arbitrary dictionary holdin the nature of the annotation. start
and duration
(in seconds) defines, where the segment is locatated within the recoding recording_id
.
Constructor
Annotation(id, recording_id, start, duration, channel, data)
Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)
If start
and/or duration
are negative, the segment is considered to be the whole sequence length of the recording.
AudioSources.load
— Methodload(recording::Recording [; start = -1, duration = -1, channels = recording.channels])
load(recording, annotation)
Load the signal from a recording. start
, duration
(in seconds)
The function returns a tuple (x, sr)
where x
is a $N×C$ array
- $N$ is the length of the signal and $C$ is the number of channels
- and
sr
is the sampling rate of the signal.
AudioSources.load
— Methodload(r::Recording, a::Annotation)
load(t::Tuple{Recording, Annotation})
Load only a segment of the recording referenced in the annotation.
SpeechDatasets.load_manifest
— Methodload_manifest(Annotation, path)
load_manifest(Recording, path)
Load Recording/Annotation manifest from path
.
Lexicons
SpeechDatasets.CMUDICT
— MethodCMUDICT(path)
Return the dictionary of pronunciation loaded from the CMU sphinx dictionary. The CMU dictionary will be donwloaded and stored into to path
. Subsequent calls will only read the file path
without downloading again the data.
SpeechDatasets.TIMITDICT
— MethodTIMITDICT(timitdir)
Return the dictionary of pronunciation as provided by TIMIT corpus (located in timitdir
).
SpeechDatasets.MFAFRDICT
— MethodMFAFRDICT(path)
Return the french dictionary of pronunciation as provided by MFA (french_mfa v2.0.0a).
Index
SpeechDatasets.Annotation
SpeechDatasets.DatasetBuilder
SpeechDatasets.DatasetBuilder
SpeechDatasets.ManifestItem
SpeechDatasets.Recording
SpeechDatasets.SpeechDataset
SpeechDatasets.SpeechDataset
SpeechDatasets.SpeechDatasetInfos
SpeechDatasets.SpeechDatasetInfos
AudioSources.load
AudioSources.load
Base.download
Base.summary
SpeechDatasets.CMUDICT
SpeechDatasets.MFAFRDICT
SpeechDatasets.TIMITDICT
SpeechDatasets.dataset
SpeechDatasets.declareBuilder
SpeechDatasets.get_dataset_kwargs
SpeechDatasets.get_kwargs
SpeechDatasets.get_nametype
SpeechDatasets.load_manifest
SpeechDatasets.prepare