API

Load a Dataset

To get data from a supported dataset, you only need one function:

SpeechDatasets.datasetMethod
dataset(name::AbstractString, inputdir::AbstractString, outputdir::AbstractString; <keyword arguments>)

Extract recordings and annotations for desired dataset.

Return a SpeechDataset object.

Create the outputdir folder, with:

  • recordings.jsonl containing each audio file path and associated metadata
  • annotations-<subset>.jsonl containing each annotation and associated metadata

Arguments

  • name Name of the dataset. Supported names are ["AVID", "INA Diachrony", "Mini LibriSpeech", "Multilingual LibriSpeech", "TIMIT", "Speech2Tex"].
  • inputdir Name of dataset directory. If the directory does not exists, it is created and the data is downloaded if possible. Not all datasets can be downloaded, for example proprietary datasets does not implements a download function.
  • outputdir is the output directory for manifest files.

Keyword Arguments

Common kwargs are

  • subset Part of the dataset to load (for example "train" or "test").
  • lang ISO 639-3 code of the language.

Other kwargs can be available depending on the dataset, they can be accessed with get_dataset_kwargs(name::String).

source
Base.summaryMethod
Base.summary(dataset::SpeechDataset)

Display informations about given SpeechDataset

source

Types

SpeechDataset

SpeechDatasets.SpeechDatasetInfosType
struct SpeechDatasetInfos

Store metadata about a dataset.

Fields

  • name Dataset official name
  • lang Language or list of languages (ISO 639-3 code)
  • license License name
  • source URL to the dataset publication or content
  • authors list of authors
  • description A few sentences describing the content or main purpose
  • subsets List of available subsets (for example ["train", "test"])
source
SpeechDatasets.SpeechDatasetType
struct SpeechDataset <: MLUtils.AbstractDataContainer

Store all dataset recordings and annotations.

It can be iterated, and will give a Tuple{Recording, Annotation} for each entry. Indexation can be done with integer or id.

Fields

  • infos::SpeechDatasetInfos
  • idxs::Vector{AbstractString} id indexes to access elements
  • annotations::Dict{AbstractString, Annotation} Annotation for each index
  • recordings::Dict{AbstractString, Recording} Recording for each index
source
SpeechDatasets.SpeechDatasetMethod
SpeechDataset(infos::SpeechDatasetInfos, manifestroot::AbstractString, subset::AbstractString)

Create a SpeechDataset from manifest files and subset.

source

Access a single element with integer or id indexing

# ds::SpeechDataset
ds[1]
ds["1988-147956-0027"]

Access several elements by providing a list

ds[[1,4,7]]
ds[[8, 2, "777-126732-0015"]]

Get all annotations

ds.annotations

Manifest items

SpeechDatasets.RecordingType
struct Recording{Ts<:AbstractAudioSource} <: ManifestItem
    id::AbstractString
    source::Ts
    channels::Vector{Int}
    samplerate::Int
end

A recording is an audio source associated with and id.

Constructors

Recording(id, source, channels, samplerate)
Recording(id, source[; channels = missing, samplerate = missing])

If the channels or the sample rate are not provided then they will be read from source.

Warning

When preparing large corpus, not providing the channels and/or the sample rate can drastically reduce the speed as it forces to read source.

source
SpeechDatasets.AnnotationType
struct Annotation <: ManifestItem
    id::AbstractString
    recording_id::AbstractString
    start::Float64
    duration::Float64
    channel::Union{Vector, Colon}
    data::Dict
end

An "annotation" defines a segment of a recording on a single channel. The data field is an arbitrary dictionary holdin the nature of the annotation. start and duration (in seconds) defines, where the segment is locatated within the recoding recording_id.

Constructor

Annotation(id, recording_id, start, duration, channel, data)
Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)

If start and/or duration are negative, the segment is considered to be the whole sequence length of the recording.

source
AudioSources.loadMethod
load(recording::Recording [; start = -1, duration = -1, channels = recording.channels])
load(recording, annotation)

Load the signal from a recording. start, duration (in seconds)

The function returns a tuple (x, sr) where x is a $N×C$ array

  • $N$ is the length of the signal and $C$ is the number of channels
  • and sr is the sampling rate of the signal.
source
AudioSources.loadMethod
load(r::Recording, a::Annotation)
load(t::Tuple{Recording, Annotation})

Load only a segment of the recording referenced in the annotation.

source

Lexicons

SpeechDatasets.CMUDICTMethod
CMUDICT(path)

Return the dictionary of pronunciation loaded from the CMU sphinx dictionary. The CMU dictionary will be donwloaded and stored into to path. Subsequent calls will only read the file path without downloading again the data.

source

Index