pydrobert.kaldi.io.corpus

Submodule for corpus iterators

class pydrobert.kaldi.io.corpus.Data(table, *additional_tables, **kwargs)[source]

Bases: Iterable, Sized

Metaclass for data iterables

A template for providing iterators over kaldi tables. They can be used like this

>>> data = DataSubclass(
...     'scp:feats.scp', 'scp:labels.scp', batch_size=10)
>>> for feat_batch, label_batch in data:
>>>     pass  # do something
>>> for feat_batch, label_batch in data:
>>>     pass  # do something again

Where DataSubclass is some subclass of this virtual class. Calling iter() on an instance (which occurs implicitly in for-loops) will generate a new iterator over the entire data set.

The class takes an arbitrary positive number of positional arguments on initialization, each a table to open. Each argument is one of:

An rspecifier (ideally for a script file). Assumed to be of type KaldiDataType.BaseMatrix
A sequence of length 2: the first element is the rspecifier, the second the rspecifier’s KaldiDataType
A sequence of length 3: the first element is the rspecifier, the second the rspecifier’s KaldiDataType, and the third is a dictionary to be passed as keyword arguments to the pydrobert.kaldi.io.open() function

All tables are assumed to index data using the same keys.

If batch_size is set, data are stacked in batches along a new axis. The keyword arguments batch_axis, batch_pad_mode, and any remaining keywords are sent to this module’s batch_data() function. If batch_size is None or 0, samples are returned one-by-one. Data are always cast to numpy arrays before being returned. Consult that function for more information on batching.

If only one table is specified and neither axis_lengths or add_key is specified, iterators will be of a batch of the table’s data directly. Otherwise, iterators yield “batches” of tuples containing “sub-batches” from each respective data source. Sub-batches belonging to the same batch share the same subset of ordered keys.

If add_key is True, a sub-batch of referrent keys is added as the first element of a batch tuple.

For batched sequence-to-sequence tasks, it is often important to know the original length of data before padding. Setting axis_lengths adds one or more sub-batches to the end of a batch tuple with this information. These sub-batches are filled with signed 32-bit integers. axis_lengths can be one of:

An integer specifying an axis from the first table to get the lengths of.
A pair of integers. The first element is the table index, the second is the axis index in that table.
A sequence of pairs of integers. Sub-batches will be appended to the batch tuple in that order

Note that axes in axis_lengths index the axes in individual samples, not the batch. For instance, if batch_axis == 0 and axis_lengths == 0, then the last sub-batch will refer to the pre-padded value of sub-batch 0’s axis 1 (batch[0].shape[1]).

The length of this object is the number of batches it serves per epoch.

Parameters

table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If True, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults to False
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to 0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to. None values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicable
batch_kwargs – Additional keyword arguments to pass to batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this numpy.pad() strategy when samples do not have the same length
batch_size – The number of samples per (sub-)batch. Defaults to None, which means samples are served without batching
ignore_missing – If True and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default to False

table_specifiers: A tuple of triples indicating (rspecifier, kaldi_dtype, open_kwargs) for each table

add_key: Whether a sub-batch of table keys has been prepended to existing sub-batches

axis_lengths: A tuple of pairs for each axis-length sub-batch requested. Each pair is (sub_batch_idx, axis).

batch_axis: A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type

batch_cast_to_array: A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of None indicate no casting should be done on that (sub-)sample

batch_kwargs: Additional keyword arguments to pass to batch_data

batch_pad_mode: If set, pads samples in (sub-)batches according to this numpy.pad() strategy when samples do not have the same length

batch_size: The number of samples per (sub-)batch

ignore_missing: If True and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError

num_sub: The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches

batch_generator(repeat=False)[source]

A generator which yields batches of data

Parameters: repeat (bool) – Whether to stop generating after one epoch (False) or keep restart and continue generating indefinitely
Yields: batch (np.array or tuple) – A batch if self.num_sub == 1, otherwise a tuple of sub-batches. If self.batch_size does not divide an epoch’s worth of data evenly, the last batch of every epoch will be smaller

property num_batches

the number of batches yielded per epoch

This number takes into account the number of terms missing if self.ignore_missing == True

Type: int

abstract property num_samples

the number of samples yielded per epoch

This number takes into account the number of terms missing if self.ignore_missing == True

Type: int

sample_generator(repeat=False)[source]

A generator which yields individual samples from data

Parameters: repeat (bool) – Whether to stop generating after one epoch (False) or keep restart and continue generating indefinitely
Yields: sample (np.array or tuple) – A sample if self.num_sub == 1, otherwise a tuple of sub-samples

abstract sample_generator_for_epoch()[source]

A generator which yields individual samples from data for an epoch

An epoch means one pass through the data from start to finish. Equivalent to sample_generator(False).

Yields: sample (np.array or tuple) – A sample if self.num_sub == 1, otherwise a tuple of sub-samples

class pydrobert.kaldi.io.corpus.SequentialData(table, *additional_tables, **kwargs)[source]

Bases: Data

Provides iterators to read data sequentially

Tables are always assumed to be sorted so reading can proceed in lock-step.

Warning

Each time an iterator is requested, new sequential readers are opened. Be careful with stdin!

Parameters

table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If True, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults to False
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to 0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to. None values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicable
batch_kwargs – Additional keyword arguments to pass to batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this numpy.pad() strategy when samples do not have the same length
batch_size – The number of samples per (sub-)batch. Defaults to None, which means samples are served without batching
ignore_missing – If True and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default to False

table_specifiers: A tuple of triples indicating (rspecifier, kaldi_dtype, open_kwargs) for each table

add_key: Whether a sub-batch of table keys has been prepended to existing sub-batches

axis_lengths: A tuple of pairs for each axis-length sub-batch requested. Each pair is (sub_batch_idx, axis).

batch_axis: A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type

batch_cast_to_array: A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of None indicate no casting should be done on that (sub-)sample

batch_kwargs: Additional keyword arguments to pass to batch_data

batch_pad_mode: If set, pads samples in (sub-)batches according to this numpy.pad() strategy when samples do not have the same length

batch_size: The number of samples per (sub-)batch

ignore_missing: If True and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError

num_sub: The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches

property num_samples

the number of samples yielded per epoch

This number takes into account the number of terms missing if self.ignore_missing == True

Type: int

sample_generator_for_epoch()[source]

A generator which yields individual samples from data for an epoch

An epoch means one pass through the data from start to finish. Equivalent to sample_generator(False).

Yields: sample (np.array or tuple) – A sample if self.num_sub == 1, otherwise a tuple of sub-samples

class pydrobert.kaldi.io.corpus.ShuffledData(table, *additional_tables, **kwargs)[source]

Bases: Data

Provides iterators over shuffled data

A master list of keys is either provided by keyword argument or inferred from the first table. Every new iterator requested shuffles that list of keys and returns batches in that order. Appropriate for training data.

Notes

For efficiency, it is highly recommended to use scripts to access tables rather than archives.

Parameters

table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If True, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults to False
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to 0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to. None values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicable
batch_kwargs – Additional keyword arguments to pass to batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this numpy.pad() strategy when samples do not have the same length
batch_size – The number of samples per (sub-)batch. Defaults to None, which means samples are served without batching
ignore_missing – If True and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default to False
key_list – A master list of keys. No other keys will be queried. If not specified, the key list will be inferred by passing through the first table once
rng – Either a numpy.random.RandomState object or a seed to create one. It will be used to shuffle the list of keys

table_specifiers: A tuple of triples indicating (rspecifier, kaldi_dtype, open_kwargs) for each table

add_key: Whether a sub-batch of table keys has been prepended to existing sub-batches

axis_lengths: A tuple of pairs for each axis-length sub-batch requested. Each pair is (sub_batch_idx, axis).

batch_axis: A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type

batch_cast_to_array: A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of None indicate no casting should be done on that (sub-)sample

batch_kwargs: Additional keyword arguments to pass to batch_data

batch_pad_mode: If set, pads samples in (sub-)batches according to this numpy.pad() strategy when samples do not have the same length

batch_size: The number of samples per (sub-)batch

ignore_missing: If True and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError

num_sub: The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches

key_list: The master list of keys

rng: Used to shuffle the list of keys every epoch

table_holders: A tuple of table readers opened in random access mode

property num_samples

the number of samples yielded per epoch

This number takes into account the number of terms missing if self.ignore_missing == True

Type: int

sample_generator_for_epoch()[source]

int : the number of samples yielded per epoch

This number takes into account the number of terms missing if self.ignore_missing == True

pydrobert.kaldi.io.corpus.batch_data(input_iter, subsamples=True, batch_size=None, axis=0, cast_to_array=None, pad_mode=None, **pad_kwargs)[source]

Generate batched data from an input generator

Takes some fixed number of samples from input_iter, encapsulates them, and yields them.

If subsamples is True, data from input_iter are expected to be encapsulated in fixed-length sequences (e.g. (feat, label, len)). Each sample will be batched separately into a sub-batch and returned in a tuple (e.g. (feat_batch, label_batch, len_batch)).

The format of a (sub-)batch depends on the properties of its samples:

If cast_to_array applies to this sub-batch, cast it to a numpy array of the target type.
If all samples in the (sub-)batch are numpy arrays of the same type and shape, samples are stacked in a bigger numpy array along the axis specified by axis (see Parameters).
If all samples are numpy arrays of the same type but variable length and pad_mode is specified, pad all sample arrays to the right such that they all have the same (supremum) shape, then perform 2.
Otherwise, simply return a list of samples as-is (ignoring axis).

Parameters

input_iter (Iterator) – An iterator over samples
subsamples (bool) – input_iter yields tuples to be divided into different sub-batches if True
batch_size (Optional[int]) – The size of batches, except perhaps the last one. If not set or 0, will yield samples (casting and encapsulating in tuples when necessary)
axis (int) – Where to insert the batch index/indices into the shape/shapes of the inputs. If a sequence, subsamples must be True and input_iter should yield samples of the same length as axis. If an int and subsamples is True, the same axis will be used for all sub-samples.
cast_to_array (Union[dtype, Sequence, None]) – Dictates whether data should be cast to numpy arrays and of what type. If a sequence, subsamples must be True and input_iter should yield samples of the same length as cast_to_array. If a single value and subsamples is True, the same value will be used for all sub-samples. Value(s) of None indicate no casting should be done for this (sub-)sample. Other values will be used to cast (sub-)samples to numpy arrays
pad_mode (Union[str, Callable, None]) – If set, inputs within a batch will be padded on the end to match the largest shapes in the batch. How the inputs are padded matches the argument to numpy.pad(). If not set, will raise a ValueError if they don’t all have the same shape
pad_kwargs – Additional keyword arguments are passed along to numpy.pad() if padding.