API

Load a Dataset

To get data from a supported dataset, you only need one function:

SpeechDatasets.dataset — Function

dataset(dataset, inputdir::AbstractString, outputdir::AbstractString; kwargs...)

Create a SpeechDataset object for dataset. inputdir is the directory containing the raw data. If the inputdir does not exist and the data is freely available, it will be automatically downloaded and put in inputdir. outputdir is the directory where will be stored summary files. kwargs... are dataset specific arguments passed to dataset

source

See metadata with

Base.summary — Method

summary(ds::SpeechDataset)
summary(key::String)
summary(key::Symbol)

Display dataset metadata, adapt to current MIME type (HTML or plain text)

source

Access citation in BibTeX format with

SpeechDatasets.cite — Method

cite(ds::SpeechDataset)
cite(key::String)
cite(key::Symbol)

Get citation for a given dataset in BibTeX format, if available. The output is a multiline string that can directly be appended to a .bib file. If you want a different format you can then use the external package Bibliography.jl or its components BibInternal.jl and BibParser.jl. For example, you can parse the bib string to a BibInternal.Entry object, and then convert it to JSON:

parsed = BibParser.parse_entry(cite(ds))
JSON.json(parsed[ds.citekey], omit_empty=true)

source

Types

SpeechDataset

SpeechDatasets.SpeechDataset — Type

SpeechDataset

Store metadata about a speech dataset.

source

SpeechDataset objects are iterable, you can also access a single element with id indexing :

# ds::SpeechDataset
recording, annotation = ds["msmr0_si1405"]

As it is an AbstractDict subType, you can use the followings functions

length(ds)
keys(ds)
values(ds)
get(ds, "key", defaultValue)

Manifest items

SpeechDatasets.ManifestItem — Type

abstract type ManifestItem end

Base class for all manifest item. Every manifest item should have an id attribute.

source

SpeechDatasets.Recording — Type

struct Recording{Ts<:AbstractAudioSource} <: ManifestItem
    id::AbstractString
    source::Ts
    channels::Vector{Int}
    samplerate::Int
end

A recording is an audio source associated with and id.

Constructors

Recording(id, source, channels, samplerate)
Recording(id, source[; channels = missing, samplerate = missing])

If the channels or the sample rate are not provided then they will be read from source.

Warning

When preparing large corpus, not providing the channels and/or the sample rate can drastically reduce the speed as it forces to read source.

source

SpeechDatasets.Annotation — Type

struct Annotation <: ManifestItem
    id::AbstractString
    recording_id::AbstractString
    start::Float64
    duration::Float64
    channel::Union{Vector, Colon}
    data::Dict
end

An "annotation" defines a segment of a recording on a single channel. The data field is an arbitrary dictionary holdin the nature of the annotation. start and duration (in seconds) defines, where the segment is locatated within the recoding recording_id.

Constructor

Annotation(id, recording_id, start, duration, channel, data)
Annotation(id, recording_id[; channel = missing, start = -1, duration = -1, data = missing)

If start and/or duration are negative, the segment is considered to be the whole sequence length of the recording.

source

AudioSources.load — Method

load(recording::Recording [; start = -1, duration = -1, channels = recording.channels])
load(recording, annotation)

Load the signal from a recording. start, duration (in seconds)

The function returns a tuple (x, sr) where x is a $N×C$ array

$N$ is the length of the signal and $C$ is the number of channels
and sr is the sampling rate of the signal.

source

AudioSources.load — Method

load(r::Recording, a::Annotation)
load(t::Tuple{Recording, Annotation})

Load only a segment of the recording referenced in the annotation.

source

SpeechDatasets.load_manifest — Method

load_manifest(Annotation, path)
load_manifest(Recording, path)

Load Recording/Annotation manifest from path.

source

Lexicons

Datasets with lexicons or other language files (units, wordcount, topology) should provide a lang/ directory as artifact. It is loaded on dataset instantiation.

Index

SpeechDatasets.Annotation
SpeechDatasets.ManifestItem
SpeechDatasets.Recording
SpeechDatasets.SpeechDataset
AudioSources.load
AudioSources.load
Base.summary
SpeechDatasets.cite
SpeechDatasets.dataset
SpeechDatasets.get_artifact
SpeechDatasets.get_dataset_kwargs
SpeechDatasets.get_download_kwargs
SpeechDatasets.get_kwargs
SpeechDatasets.load_manifest
SpeechDatasets.prepare