pydrobert.kaldi.io.corpus
Submodule for corpus iterators
- class pydrobert.kaldi.io.corpus.Data(table, *additional_tables, **kwargs)[source]
-
Metaclass for data iterables
A template for providing iterators over kaldi tables. They can be used like this
>>> data = DataSubclass( ... 'scp:feats.scp', 'scp:labels.scp', batch_size=10) >>> for feat_batch, label_batch in data: >>> pass # do something >>> for feat_batch, label_batch in data: >>> pass # do something again
Where DataSubclass is some subclass of this virtual class. Calling
iter()
on an instance (which occurs implicitly in for-loops) will generate a new iterator over the entire data set.The class takes an arbitrary positive number of positional arguments on initialization, each a table to open. Each argument is one of:
An rspecifier (ideally for a script file). Assumed to be of type
KaldiDataType.BaseMatrix
A sequence of length 2: the first element is the rspecifier, the second the rspecifier’s
KaldiDataType
A sequence of length 3: the first element is the rspecifier, the second the rspecifier’s
KaldiDataType
, and the third is a dictionary to be passed as keyword arguments to thepydrobert.kaldi.io.open()
function
All tables are assumed to index data using the same keys.
If batch_size is set, data are stacked in batches along a new axis. The keyword arguments batch_axis, batch_pad_mode, and any remaining keywords are sent to this module’s
batch_data()
function. If batch_size isNone
or0
, samples are returned one-by-one. Data are always cast to numpy arrays before being returned. Consult that function for more information on batching.If only one table is specified and neither axis_lengths or add_key is specified, iterators will be of a batch of the table’s data directly. Otherwise, iterators yield “batches” of tuples containing “sub-batches” from each respective data source. Sub-batches belonging to the same batch share the same subset of ordered keys.
If add_key is
True
, a sub-batch of referrent keys is added as the first element of a batch tuple.For batched sequence-to-sequence tasks, it is often important to know the original length of data before padding. Setting axis_lengths adds one or more sub-batches to the end of a batch tuple with this information. These sub-batches are filled with signed 32-bit integers. axis_lengths can be one of:
An integer specifying an axis from the first table to get the lengths of.
A pair of integers. The first element is the table index, the second is the axis index in that table.
A sequence of pairs of integers. Sub-batches will be appended to the batch tuple in that order
Note that axes in axis_lengths index the axes in individual samples, not the batch. For instance, if
batch_axis == 0
andaxis_lengths == 0
, then the last sub-batch will refer to the pre-padded value of sub-batch 0’s axis 1 (batch[0].shape[1]
).The length of this object is the number of batches it serves per epoch.
- Parameters
table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If
True
, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults toFalse
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to
0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to.
None
values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicablebatch_kwargs – Additional keyword arguments to pass to
batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same lengthbatch_size – The number of samples per (sub-)batch. Defaults to
None
, which means samples are served without batchingignore_missing – If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default toFalse
- table_specifiers
A tuple of triples indicating
(rspecifier, kaldi_dtype, open_kwargs)
for each table
- add_key
Whether a sub-batch of table keys has been prepended to existing sub-batches
- axis_lengths
A tuple of pairs for each axis-length sub-batch requested. Each pair is
(sub_batch_idx, axis)
.
- batch_axis
A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type
- batch_cast_to_array
A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of
None
indicate no casting should be done on that (sub-)sample
- batch_kwargs
Additional keyword arguments to pass to
batch_data
- batch_pad_mode
If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same length
- batch_size
The number of samples per (sub-)batch
- ignore_missing
If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError
- num_sub
The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches
- batch_generator(repeat=False)[source]
A generator which yields batches of data
- Parameters
repeat (
bool
) – Whether to stop generating after one epoch (False) or keep restart and continue generating indefinitely- Yields
batch (
np.array
ortuple
) – A batch ifself.num_sub == 1
, otherwise a tuple of sub-batches. If self.batch_size does not divide an epoch’s worth of data evenly, the last batch of every epoch will be smaller
- property num_batches
the number of batches yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- abstract property num_samples
the number of samples yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- abstract sample_generator_for_epoch()[source]
A generator which yields individual samples from data for an epoch
An epoch means one pass through the data from start to finish. Equivalent to
sample_generator(False)
.- Yields
sample (
np.array
ortuple
) – A sample ifself.num_sub == 1
, otherwise a tuple of sub-samples
- class pydrobert.kaldi.io.corpus.SequentialData(table, *additional_tables, **kwargs)[source]
Bases:
Data
Provides iterators to read data sequentially
Tables are always assumed to be sorted so reading can proceed in lock-step.
Warning
Each time an iterator is requested, new sequential readers are opened. Be careful with stdin!
- Parameters
table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If
True
, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults toFalse
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to
0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to.
None
values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicablebatch_kwargs – Additional keyword arguments to pass to
batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same lengthbatch_size – The number of samples per (sub-)batch. Defaults to
None
, which means samples are served without batchingignore_missing – If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default toFalse
- table_specifiers
A tuple of triples indicating
(rspecifier, kaldi_dtype, open_kwargs)
for each table
- add_key
Whether a sub-batch of table keys has been prepended to existing sub-batches
- axis_lengths
A tuple of pairs for each axis-length sub-batch requested. Each pair is
(sub_batch_idx, axis)
.
- batch_axis
A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type
- batch_cast_to_array
A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of
None
indicate no casting should be done on that (sub-)sample
- batch_kwargs
Additional keyword arguments to pass to
batch_data
- batch_pad_mode
If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same length
- batch_size
The number of samples per (sub-)batch
- ignore_missing
If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError
- num_sub
The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches
- property num_samples
the number of samples yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- sample_generator_for_epoch()[source]
A generator which yields individual samples from data for an epoch
An epoch means one pass through the data from start to finish. Equivalent to
sample_generator(False)
.- Yields
sample (
np.array
ortuple
) – A sample ifself.num_sub == 1
, otherwise a tuple of sub-samples
- class pydrobert.kaldi.io.corpus.ShuffledData(table, *additional_tables, **kwargs)[source]
Bases:
Data
Provides iterators over shuffled data
A master list of keys is either provided by keyword argument or inferred from the first table. Every new iterator requested shuffles that list of keys and returns batches in that order. Appropriate for training data.
Notes
For efficiency, it is highly recommended to use scripts to access tables rather than archives.
- Parameters
table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If
True
, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults toFalse
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to
0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to.
None
values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicablebatch_kwargs – Additional keyword arguments to pass to
batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same lengthbatch_size – The number of samples per (sub-)batch. Defaults to
None
, which means samples are served without batchingignore_missing – If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default toFalse
key_list – A master list of keys. No other keys will be queried. If not specified, the key list will be inferred by passing through the first table once
rng – Either a
numpy.random.RandomState
object or a seed to create one. It will be used to shuffle the list of keys
- table_specifiers
A tuple of triples indicating
(rspecifier, kaldi_dtype, open_kwargs)
for each table
- add_key
Whether a sub-batch of table keys has been prepended to existing sub-batches
- axis_lengths
A tuple of pairs for each axis-length sub-batch requested. Each pair is
(sub_batch_idx, axis)
.
- batch_axis
A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type
- batch_cast_to_array
A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of
None
indicate no casting should be done on that (sub-)sample
- batch_kwargs
Additional keyword arguments to pass to
batch_data
- batch_pad_mode
If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same length
- batch_size
The number of samples per (sub-)batch
- ignore_missing
If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError
- num_sub
The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches
- key_list
The master list of keys
- rng
Used to shuffle the list of keys every epoch
- table_holders
A tuple of table readers opened in random access mode
- property num_samples
the number of samples yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- pydrobert.kaldi.io.corpus.batch_data(input_iter, subsamples=True, batch_size=None, axis=0, cast_to_array=None, pad_mode=None, **pad_kwargs)[source]
Generate batched data from an input generator
Takes some fixed number of samples from input_iter, encapsulates them, and yields them.
If subsamples is
True
, data from input_iter are expected to be encapsulated in fixed-length sequences (e.g.(feat, label, len)
). Each sample will be batched separately into a sub-batch and returned in a tuple (e.g.(feat_batch, label_batch, len_batch)
).The format of a (sub-)batch depends on the properties of its samples:
If cast_to_array applies to this sub-batch, cast it to a numpy array of the target type.
If all samples in the (sub-)batch are numpy arrays of the same type and shape, samples are stacked in a bigger numpy array along the axis specified by axis (see Parameters).
If all samples are numpy arrays of the same type but variable length and pad_mode is specified, pad all sample arrays to the right such that they all have the same (supremum) shape, then perform 2.
Otherwise, simply return a list of samples as-is (ignoring axis).
- Parameters
input_iter (
Iterator
) – An iterator over samplessubsamples (
bool
) – input_iter yields tuples to be divided into different sub-batches ifTrue
batch_size (
Optional
[int
]) – The size of batches, except perhaps the last one. If not set or0
, will yield samples (casting and encapsulating in tuples when necessary)axis (
int
) – Where to insert the batch index/indices into the shape/shapes of the inputs. If a sequence, subsamples must beTrue
and input_iter should yield samples of the same length as axis. If anint
and subsamples isTrue
, the same axis will be used for all sub-samples.cast_to_array (
Union
[dtype
,Sequence
,None
]) – Dictates whether data should be cast to numpy arrays and of what type. If a sequence, subsamples must beTrue
and input_iter should yield samples of the same length as cast_to_array. If a single value and subsamples isTrue
, the same value will be used for all sub-samples. Value(s) ofNone
indicate no casting should be done for this (sub-)sample. Other values will be used to cast (sub-)samples to numpy arrayspad_mode (
Union
[str
,Callable
,None
]) – If set, inputs within a batch will be padded on the end to match the largest shapes in the batch. How the inputs are padded matches the argument tonumpy.pad()
. If not set, will raise aValueError
if they don’t all have the same shapepad_kwargs – Additional keyword arguments are passed along to
numpy.pad()
if padding.
See also
numpy.pad
For different pad modes and options