API

Load a Dataset

To get data from a supported dataset, you only need one function:

SpeechDatasets.datasetFunction
dataset(dataset, inputdir::AbstractString, outputdir::AbstractString; kwargs...)

Create a SpeechDataset object for dataset. inputdir is the directory containing the raw data. If the inputdir does not exist and the data is freely available, it will be automatically downloaded and put in inputdir. outputdir is the directory where will be stored summary files. kwargs... are dataset specific arguments passed to dataset

source

See metadata with

Base.summaryMethod
summary(ds::SpeechDataset)
summary(key::String)
summary(key::Symbol)

Display dataset metadata, adapt to current MIME type (HTML or plain text)

source

Access citation in BibTeX format with

SpeechDatasets.citeMethod
cite(ds::SpeechDataset)
cite(key::String)
cite(key::Symbol)

Get citation for a given dataset in BibTeX format, if available. The output is a multiline string that can directly be appended to a .bib file. If you want a different format you can then use the external package Bibliography.jl or its components BibInternal.jl and BibParser.jl. For example, you can parse the bib string to a BibInternal.Entry object, and then convert it to JSON:

parsed = BibParser.parse_entry(cite(ds))
JSON.json(parsed[ds.citekey], omit_empty=true)
source

Types

SpeechDataset

SpeechDataset objects are iterable, you can also access a single element with id indexing :

# ds::SpeechDataset
recording, annotation = ds["msmr0_si1405"]

As it is an AbstractDict subType, you can use the followings functions

length(ds)
keys(ds)
values(ds)
get(ds, "key", defaultValue)

Manifest items

SpeechDatasets.RecordingType
struct Recording{Ts<:AbstractAudioSource} <: ManifestItem
    id::AbstractString
    source::Ts
    channels::Vector{Int}
    samplerate::Int
end

A recording is an audio source associated with and id.

Constructors

Recording(id, source, channels, samplerate)
Recording(id, source[; channels = missing, samplerate = missing])

If the channels or the sample rate are not provided then they will be read from source.

Warning

When preparing large corpus, not providing the channels and/or the sample rate can drastically reduce the speed as it forces to read source.

source
SpeechDatasets.AnnotationType
struct Annotation <: ManifestItem
    id::AbstractString
    recording_id::AbstractString
    start::Float64
    duration::Float64
    channel::Union{Vector, Colon}
    data::Dict
end

An "annotation" defines a segment of a recording on a single channel. The data field is an arbitrary dictionary holdin the nature of the annotation. start and duration (in seconds) defines, where the segment is locatated within the recoding recording_id.

Constructor

Annotation(id, recording_id, start, duration, channel, data)
Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)

If start and/or duration are negative, the segment is considered to be the whole sequence length of the recording.

source
AudioSources.loadMethod
load(recording::Recording [; start = -1, duration = -1, channels = recording.channels])
load(recording, annotation)

Load the signal from a recording. start, duration (in seconds)

The function returns a tuple (x, sr) where x is a $N×C$ array

  • $N$ is the length of the signal and $C$ is the number of channels
  • and sr is the sampling rate of the signal.
source
AudioSources.loadMethod
load(r::Recording, a::Annotation)
load(t::Tuple{Recording, Annotation})

Load only a segment of the recording referenced in the annotation.

source

Lexicons

Datasets with lexicons or other language files (units, wordcount, topology) should provide a lang/ directory as artifact. It is loaded on dataset instantiation.

Index