pydrobert-kaldi
Some Kaldi bindings for Python. I started this project because I wanted to seamlessly incorporate Kaldi’s I/O mechanism into the gamut of Python-based data science packages (e.g. Theano, Tensorflow, CNTK, PyTorch, etc.). The code base is expanding to wrap more of Kaldi’s feature processing and mathematical functions, but is unlikely to include modelling or decoding.
Eventually, I plan on adding hooks for Kaldi audio features and pre-/post- processing. However, I have no plans on porting any code involving modelling or decoding.
This is student-driven code, so don’t expect a stable API. I’ll try to use semantic versioning, but the best way to keep functionality stable is by forking.
Documentation
Input/Output
Most I/O can be performed with the pydrobert.kaldi.io.open
function:
from pydrobert.kaldi import io
with io.open('scp:foo.scp', 'bm') as f:
for matrix in f:
...
open
is a factory function that determines the appropriate underlying stream
to open, much like Python’s built-in open
. The data types we can read (here,
a BaseMatrix
) are listed in pydrobert.kaldi.io.enums.KaldiDataType
. Big
data types, like matrices and vectors, are piped into Numpy arrays. Passing an
extended filename (e.g. paths to files on discs, '-'
for stdin/stdout,
'gzip -c a.ark.gz |'
, etc.) opens a stream from which data types can be read
one-by-one and in the order they were written. Alternatively, prepending the
extended filename with 'ark[,[option_a[,option_b...]]:'
or 'scp[,...]:'
and
specifying a data type allows one to open a Kaldi table for iterator-like
sequential reading (mode='r'
), dict-like random access reading (mode='r+'
),
or writing (mode='w'
). For more information on the open function, consult the
docstring.
The submodule pydrobert.kaldi.io.corpus
contains useful wrappers around Kaldi
I/O to serve up batches of data to, say, a neural network:
train = ShuffledData('scp:feats.scp', 'scp:labels.scp', batch_size=10)
for feat_batch, label_batch in train:
...
Logging and CLI
By default, Kaldi error, warning, and critical messages are piped to standard
error. The pydrobert.kaldi.logging
submodule provides hooks into python’s
native logging interface: the logging
module. The :class:KaldiLogger
can
handle stack traces from Kaldi C++ code, and there are a variety of decorators
to finagle the kaldi logging patterns to python logging patterns, or vice
versa.
You’d likely want to explicitly handle logging when creating new kaldi-style
commands for command line. pydrobert.kaldi.io.argparse
provides
:class:KaldiParser
, an :class:ArgumentParser
tailored to Kaldi
inputs/outputs. It is used by a few command-line entry points added by this
package. See the Command-Line
Interface page for
details.
Installation
Prepackaged binaries of tagged versions of pydrobert-kaldi
are available for
most 64-bit platforms (Windows, Glibc Linux, OSX) and most active Python
versions (3.7-3.11) on both conda and
PyPI.
To install via conda-forge
conda install -c conda-forge pydrobert-kaldi
If you only want to rely on Anaconda depenedencies, you can install from the
sdrobert
channel instead. There is not yet a 3.11 build there.
To install via PyPI
pip install pydrobert-kaldi
You can also try building the cutting-edge version. To do so, you’ll need to first install SWIG 4.0 and an appropriate C++ compiler, then
pip install git+https://github.com/sdrobert/pydrobert-kaldi.git
The current version does not require a BLAS install, though it likely will in the future as more is wrapped.
License
This code is licensed under Apache 2.0.
Code found under the src/
directory has been primarily copied from Kaldi.
setup.py
is also strongly influenced by Kaldi’s build configuration. Kaldi is
also covered by the Apache 2.0 license; its specific license file was copied
into src/COPYING_Kaldi_Project
to live among its fellows.
How to Cite
Please see the pydrobert page for more details.
Command-Line Interface
write-table-to-pickle
write-table-to-pickle -h
usage: write-table-to-pickle [-h] [-v VERBOSE] [--config CONFIG] [--print-args PRINT_ARGS] [-i IN_TYPE] [-o OUT_TYPE] rspecifier value_out [key_out]
Write a kaldi table to pickle file(s)
The inverse is write-pickle-to-table
positional arguments:
rspecifier The table to read
value_out A path to write (key,value) pairs to, or just values if key_out was set. If it ends in ".gz", the file will be gzipped
key_out A path to write keys to. If it ends in ".gz", the file will be gzipped
optional arguments:
-h, --help show this help message and exit
-v VERBOSE, --verbose VERBOSE
Verbose level (higher->more logging)
--config CONFIG
--print-args PRINT_ARGS
-i IN_TYPE, --in-type IN_TYPE
The type of kaldi data type to read. Defaults to base matrix
-o OUT_TYPE, --out-type OUT_TYPE
The numpy data type to cast values to. The default is dependent on the input type. String types will be written as (tuples of) strings
write-pickle-to-table
write-pickle-to-table -h
usage: write-pickle-to-table [-h] [-v VERBOSE] [--config CONFIG] [--print-args PRINT_ARGS] [-o OUT_TYPE] value_in [key_in] wspecifier
Write pickle file(s) contents to a table
The inverse is write-table-to-pickle
positional arguments:
value_in A path to read (key,value) pairs from, or just values if key_in was set. If it ends in ".gz", the file is assumed to be gzipped
key_in A path to read keys from. If it ends in ".gz", the file is assumed to be gzipped
wspecifier The table to write to
optional arguments:
-h, --help show this help message and exit
-v VERBOSE, --verbose VERBOSE
Verbose level (higher->more logging)
--config CONFIG
--print-args PRINT_ARGS
-o OUT_TYPE, --out-type OUT_TYPE
The type of kaldi data type to read. Defaults to base matrix
compute-error-rate
compute-error-rate -h
usage: compute-error-rate [-h] [-v VERBOSE] [--config CONFIG] [--print-args PRINT_ARGS] [--print-tables PRINT_TABLES] [--strict STRICT] [--insertion-cost INSERTION_COST]
[--deletion-cost DELETION_COST] [--substitution-cost SUBSTITUTION_COST] [--include-inserts-in-cost INCLUDE_INSERTS_IN_COST]
[--report-accuracy REPORT_ACCURACY]
ref_rspecifier hyp_rspecifier [out_path]
Compute error rates between reference and hypothesis token vectors
Two common error rates in speech are the word (WER) and phone (PER), though the
computation is the same. Given a reference and hypothesis sequence, the error rate
is
error_rate = (substitutions + insertions + deletions) / (ref_tokens * 100)
Where the number of substitutions (e.g. "A B C -> A D C"), deletions (e.g. "A B C ->
A C"), and insertions (e.g. "A B C -> A D B C") are determined by Levenshtein
distance.
positional arguments:
ref_rspecifier Rspecifier pointing to reference (gold standard) transcriptions
hyp_rspecifier Rspecifier pointing to hypothesis transcriptions
out_path Path to print results to. Default is stdout.
optional arguments:
-h, --help show this help message and exit
-v VERBOSE, --verbose VERBOSE
Verbose level (higher->more logging)
--config CONFIG
--print-args PRINT_ARGS
--print-tables PRINT_TABLES
If set, will print breakdown of insertions, deletions, and subs to out_path
--strict STRICT If set, missing utterances will cause an error
--insertion-cost INSERTION_COST
Cost (in terms of edit distance) to perform an insertion
--deletion-cost DELETION_COST
Cost (in terms of edit distance) to perform a deletion
--substitution-cost SUBSTITUTION_COST
Cost (in terms of edit distance) to perform a substitution
--include-inserts-in-cost INCLUDE_INSERTS_IN_COST
Whether to include insertions in error rate calculations
--report-accuracy REPORT_ACCURACY
Whether to report accuracy (1 - error_rate) instead of the error rate
normalize-feat-lens
normalize-feat-lens -h
usage: normalize-feat-lens [-h] [-v VERBOSE] [--config CONFIG] [--print-args PRINT_ARGS] [--type TYPE] [--tolerance TOLERANCE] [--strict STRICT]
[--pad-mode {zero,constant,edge,symmetric,mean}] [--side {left,right,center}]
feats_in_rspecifier len_in_rspecifier feats_out_wspecifier
Ensure features match some reference lengths
Incoming features are either clipped or padded to match reference lengths (stored as
an int32 table), if they are within tolerance.
positional arguments:
feats_in_rspecifier The features to be normalized
len_in_rspecifier The reference lengths (int32 table)
feats_out_wspecifier The output features
optional arguments:
-h, --help show this help message and exit
-v VERBOSE, --verbose VERBOSE
Verbose level (higher->more logging)
--config CONFIG
--print-args PRINT_ARGS
--type TYPE The kaldi type of the input/output features
--tolerance TOLERANCE
How many frames deviation from reference to tolerate before error. The default is to be infinitely tolerant (a feat I'm sure we all desire)
--strict STRICT Whether missing keys in len_in and lengths beyond the threshold cause an error (true) or are skipped with a warning (false)
--pad-mode {zero,constant,edge,symmetric,mean}
If frames are being padded to the features, specify how they should be padded. zero=zero pad, edge=pad with rightmost frame, symmetric=pad with
reverse of frame edges, mean=pad with mean feature values
--side {left,right,center}
If an utterance needs to be padded or truncated, specify what side of the utterance to do this on. left=beginning, right=end, center=distribute
evenly on either side
write-table-to-torch-dir
write-table-to-torch-dir -h
usage: write-table-to-torch-dir [-h] [-v VERBOSE] [--config CONFIG] [--print-args PRINT_ARGS] [-i IN_TYPE] [-o {float,double,half,byte,char,short,int,long}]
[--file-prefix FILE_PREFIX] [--file-suffix FILE_SUFFIX]
rspecifier dir
Write a Kaldi table to a series of PyTorch data files in a directory
Writes to a folder in the format:
folder/
<file_prefix><key_1><file_suffix>
<file_prefix><key_2><file_suffix>
...
The contents of the file "<file_prefix><key_1><file_suffix>" will be a PyTorch
tensor corresponding to the entry in the table for "<key_1>"
positional arguments:
rspecifier The table to read
dir The folder to write files to
optional arguments:
-h, --help show this help message and exit
-v VERBOSE, --verbose VERBOSE
Verbose level (higher->more logging)
--config CONFIG
--print-args PRINT_ARGS
-i IN_TYPE, --in-type IN_TYPE
The type of table to read
-o {float,double,half,byte,char,short,int,long}, --out-type {float,double,half,byte,char,short,int,long}
The type of torch tensor to write. If unset, it is inferrred from the input type
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
write-torch-dir-to-table
write-torch-dir-to-table -h
usage: write-torch-dir-to-table [-h] [-v VERBOSE] [--config CONFIG] [--print-args PRINT_ARGS] [-o OUT_TYPE] [--file-prefix FILE_PREFIX] [--file-suffix FILE_SUFFIX]
dir wspecifier
Write a data directory containing PyTorch data files to a Kaldi table
Reads from a folder in the format:
folder/
<file_prefix><key_1><file_suffix>
<file_prefix><key_2><file_suffix>
...
Where each file contains a PyTorch tensor. The contents of the file
"<file_prefix><key_1><file_suffix>" will be written as a value in a Kaldi table with
key "<key_1>"
positional arguments:
dir The folder to read files from
wspecifier The table to write to
optional arguments:
-h, --help show this help message and exit
-v VERBOSE, --verbose VERBOSE
Verbose level (higher->more logging)
--config CONFIG
--print-args PRINT_ARGS
-o OUT_TYPE, --out-type OUT_TYPE
The type of table to write to
--file-prefix FILE_PREFIX
The file prefix indicating a torch data file
--file-suffix FILE_SUFFIX
The file suffix indicating a torch data file
Locale and Kaldi
After v0.6.0
, pydrobert.kaldi.io
no longer issues a
KaldiLocaleWarning
when the system locale doesn’t match the POSIX standard.
The long story short is that locale shouldn’t matter much to what
pydrobert-kaldi does, so I no longer bug you about it. If you’re hunting an
error, however, read on.
Most Kaldi shell scripts presume
export LC_ALL=C
has been called some time prior to running the current script. This sets the
locale to POSIX-style, which is going to ensure your various shell commands
sort stuff like C does. The Kaldi codebase is written in C, so it’s definitely
going to sort this way. Here’s an example of some weirdness involving the
"s"
flag in the file rxspecifier. It basically tells Kaldi that table
entries are in sorted order, which allows Kaldi to take some shortcuts to save
on read/write costs.
# I've previously installed the German and Russian locales on Ubuntu:
# sudo locale-gen de_DE
# sudo locale-gen ru_RU
export LC_ALL=C
python -c "print('f\xe4n a'); print('foo b')" | \
sort | \
python -c "
from pydrobert.kaldi.io import open as kopen
with kopen('ark,s:-', 't', 'r+') as f:
print(f['foo'])
"
# outputs: b
# sort sorts C-style ("foo" first), kaldi sorts C-style
python -c "print('f\xe4n a'); print('foo b')" | \
LC_ALL=de_DE sort | \
python -c "
from pydrobert.kaldi.io import open as kopen
with kopen('ark,s:-', 't', 'r+') as f:
print(f['foo'])
"
# KeyError: 'foo'
# sort sorts German ("fän" first), kaldi sorts C-style
python -c "print('f\xe4n a'); print('foo b')" | \
sort | \
LC_ALL=de_DE python -c "
from pydrobert.kaldi.io import open as kopen
with kopen('ark,s:-', 't', 'r+') as f:
print(f['foo'])
"
# outputs: b
# sort sorts C-style, kaldi ignores German encoding and sorts C-style
These examples will lead to exceptions which can be caught and debugged. One can come up with more insidious errors which don’t fail, mind you.
For the most part, however, this is a non-issue, at least for pydrobert-kaldi. The only situation the library might mess up in that I know of involves sorting table keys, and the table keys are (as far as I can tell) exclusively ASCII. Also as far as I can tell, even locales which contain characters visually identical to those in the Latin alphabet are nonetheless encoded outside of the ASCII range. For example:
export LC_ALL=C
echo $'M\nC' | LC_ALL=ru_RU sort
# outputs: C, M
# these are the ASCII characters
echo $'М\nС' | LC_ALL=ru_RU sort
# outputs: M, C
# these are UTF characters 'U+0421' and 'U+0043', respectively
Besides UTF, ISO-8859-1 maintains a contiguous ASCII range. Technically there’s no guarantee that this will be the case for all encodings, though any such encoding would probably break all sorts of legacy code. If you have a counterexample of a Kaldi recipe that does otherwise, please let me know and I’ll mention it here.
Other than that, the library is quite agnostic to locale. An error involving locales is, more likely than not, something that occurred before or after the library was called.
pydrobert.kaldi API
Python access to kaldi
pydrobert.kaldi.eval
Tools related to evaluating models
pydrobert.kaldi.eval.util
Utilities for evaluation
- pydrobert.kaldi.eval.util.edit_distance(ref, hyp, insertion_cost=1, deletion_cost=1, substitution_cost=1, return_tables=False)[source]
Levenshtein (edit) distance
- Parameters
ref (
Sequence
) – Sequence of tokens of reference text (source)hyp (
Sequence
) – Sequence of tokens of hypothesis text (target)insertion_cost (
int
) – Penalty for hyp inserting a token to refdeletion_cost (
int
) – Penalty for hyp deleting a token from refsubstitution_cost (
int
) – Penalty for hyp swapping tokens in refreturn_tables (
bool
) – See below
- Returns
distances (
int
or(int
,dict
,dict
,dict
,dict)
) – Returns the edit distance of hyp from ref. If return_tables is True, this returns a tuple of the edit distance, a dict of insertion counts, a dict of deletion , a dict of substitution counts per ref token, and a dict of counts of ref tokens. Any tokens with count 0 are excluded from the dictionary.
pydrobert.kaldi.io
Interfaces for Kaldi’s readers and writers
This subpackage contains a factory function, open()
, which is intended to behave
similarly to python’s built-in open()
factory. open()
gives the specifics
behind Kaldi’s different read/write styles. Here, they are described in a general way.
Kaldi’s streams can be very exotic, including regular files, file offsets, stdin/out, and pipes.
Data can be read/written from a binary or text stream in the usual way: specific data types have specific encodings, and data are packed/unpacked in that fashion. While an appropriate style for a fixed sequence of data, variables sequences of data are encoded using the table analogy.
Kaldi uses the table analogy to store and retrieve indexed data. In a nutshell, Kaldi uses archive (“ark”) files to store binary or text data, and script files (“scp”) to point into archives. Both use whitespace- free strings as keys. Scripts and archives do not have any built-in type checking, so it is necessary to specify the input/output type when the files are opened.
A full account of Kaldi IO can be found on Kaldi’s website under Kaldi I/O Mechanisms.
See also
pydrobert.kaldi.io.enums.KaldiDataType
For more information on the types of streams that can be read or written
- class pydrobert.kaldi.io.KaldiIOBase(path)[source]
Bases:
object
IOBase for kaldi readers and writers
Similar to
io.IOBase
, but without a lot of the assumed functionality.- Parameters
path (
str
) – The path passed to “func:pydrobert.kaldi.io.open. One of an rspecifier, wspecifier, rxfilename, or wxfilename
- path
The opened path
- table_type
The type of table that’s being read/written (or
NotATable
)
- xfilenames
The extended file names being read/written. For tables, this excludes the
'ark:'
and'scp:'
prefixes from path. Usually there will be only one extended file name, unless the path uses the special'ark,scp:'
format to write both an archive and script at the same time
- xtypes
The type of extended file name opened. Usually there will be only one extended file name, unless the path uses the special
'ark,scp:'
format to write both an archive and script at the same time
- binary
Whether this stream encodes binary data (or text)
- closed
Whether this stream is closed
- permissive
Whether invalid values will be treated as non-existent (tables only)
- once
Whether each entry will only be read once (readable tables only)
- sorted
Whether keys are sorted (readable tables only)
- called_sorted
Whether entries will be read in sorted order (readable tables only)
- background
Whether reading is not being performed on the main thread (readable tables only)
- flush
Whether the stream is flushed after each write operation (writable tables only)
- pydrobert.kaldi.io.open(path, kaldi_dtype=None, mode='r', error_on_str=True, utt2spk='', value_style='b', header=True, cache=False)[source]
Factory function for initializing and opening kaldi streams
This function provides a general interface for opening kaldi streams. Kaldi streams are either simple input/output of kaldi objects (the basic/duck stream) or key-value readers and writers (tables).
When path starts with
'ark:'
or'scp:'
(possibly with modifiers before the colon), a table is opened. Otherwise, a basic stream is opened.See also
pydrobert.kaldi.io.table_streams.open_table_stream
For information on opening tables
pydrobert.kaldi.io.duck_streams.open_duck_stream
For information on opening basic streams
pydrobert.kaldi.io.argparse
Contains a custom ArgumentParser, KaldiParser, and a number of arg types
- class pydrobert.kaldi.io.argparse.KaldiParser(prog=None, usage=None, description=None, epilog=None, parents=(), formatter_class=<class 'argparse.HelpFormatter'>, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, add_verbose=True, add_config=True, update_formatters=True, add_print_args=True, logger=None, version=None)[source]
Bases:
ArgumentParser
Kaldi-compatible wrapper for argument parsing
KaldiParser intends to make command-line entry points in python more compatible with kaldi command-line scripts. It makes the following changes to
argparse.ArgumentParser
:Creates a
logging.Formatter
instance that formats messages similarly to kaldi using the prog keyword as the program name.Sets the default help and usage locations to
sys.stderr
(instead ofsys.stdout
)Registers
'kaldi_bool'
,'kaldi_rspecifier'
,'kaldi_wspecifier'
,'kaldi_wxfilename'
,'kaldi_rxfilename'
,'kaldi_config'
,'kaldi_dtype'
, and'numpy_dtype'
as argument typesRegisters
'kaldi_verbose'
as an actionAdds logger, update_formatters, add_config, and add_verbose parameters to initialization (see below)
Wraps parse_args and parse_known_args with
kaldi_vlog_level_cmd_decorator
(so loggers use the right level names on error)
KaldiParser differs from kaldi’s command line parsing in a few key ways. First, though ‘=’ syntax is supported, the parser will also group using the command-line splitting (on unquoted whitespace). For the
KaldiParser
,--foo bar
and--foo=bar
are equivalent (assuming foo takes one optional argument), whereas, in Kaldi,--foo bar
would be parsed as the boolean flag--foo
followed by a positional with valuebar
. This ambiguity is the source of the next difference: boolean flags. Because kaldi command-line parsing splits around=
, it can use--foo=true
and--foo
interchangeably. To avoid gobbling up a positional argument,KaldiParser
allows for only one type of boolean flag syntax. For the former, useaction='store_true'
in add_argument. For the latter, usetype='kaldi_bool'
.- Parameters
prog (
Optional
[str
]) – Name of the program. Defaults tosys.argv[0]
usage (
Optional
[str
]) – A usage message. Default: auto-generated from argumentsdescription (
Optional
[str
]) – A description of what the program doesepilog (
Optional
[str
]) – Text following the argument descriptionsparents (
Sequence
[ArgumentParser
]) – Parsers whose arguments should be copied into this oneformatter_class (
type
) – Class for printing help messagesprefix_chars (
str
) – Characters that prefix optional argumentsfromfile_prefix_chars (
Optional
[str
]) – Characters that prefix files containing additional argumentsargument_default (
Optional
[Any
]) – The default value for all argumentsconflict_handler (
str
) – String indicating how to handle conflictsadd_help (
bool
) – Add a-h/--help
optionadd_verbose (
bool
) – Add a-v/--verbose
option. The option requires an integer argument specifying a verbosiy level at the same degrees as Kaldi. The level will be converted to the appropriate python level when parsedadd_config (
bool
) – Whether to add the standard--config
option to the parser. IfTrue
, a first-pass will extract all config file options and put them at the beginning of the argument string to be re-parsed.add_print_args (
bool
) – Whether to add the standard--print-args
to the parser. IfTrue
, a first-pass of the will search for the value of--print-args
and, ifTrue
, will print that value to stderr (only on parse_args, not parse_known_args)update_formatters (
bool
) – If logger is set, the logger’s handlers’ formatters will be set to a kaldi-style formatterlogger (
Optional
[Logger
]) – Errors will be written to this logger when parse_args fails. If add_verbose has been set toTrue
, the logger will be set to the appropriate python level if verbose is set (note: the logger will be set to the default level -INFO
- on initialization)version (
Optional
[str
]) – A version string to use for logs. If not set,pydrobert.kaldi.__version__
will be used by default
- logger
The logger this parse was printing out to
- formatter
A log formatter that formats with kaldi-style headers
- add_config
Whether this parser has a
--config
flag
- add_print_args
Whether this parser has a
--print-args
flag
- version
Version string used by this parser and logger
- error(message)[source]
Prints a usage message incorporating the message to stderr and exits.
If you override this in a subclass, it should not return – it should either exit or raise an exception.
- parse_known_args(**kwargs)
- class pydrobert.kaldi.io.argparse.KaldiVerbosityAction(option_strings, dest, default=20, required=False, help='Verbose level (higher->more logging)', metavar=None)[source]
Bases:
Action
Read kaldi-style verbosity levels, setting logger to python level
Kaldi verbosities tend to range from [-3, 9]. This action takes in a kaldi verbosity level and converts it to python logging levels with
pydrobert.kaldi.logging.kaldi_lvl_to_logging_lvl()
If the parser has a logger attribute, the logger will be set to the new level.
- pydrobert.kaldi.io.argparse.kaldi_bool_arg_type(string)[source]
argument type for bool strings of “true”,”t”,”false”, or “f”
- pydrobert.kaldi.io.argparse.kaldi_config_arg_type(string)[source]
Encapsulate parse_kaldi_config_file as an argument type
- pydrobert.kaldi.io.argparse.kaldi_dtype_arg_type(string)[source]
argument type for string reps of KaldiDataType
- pydrobert.kaldi.io.argparse.kaldi_rspecifier_arg_type(string)[source]
argument type to make sure string is a valid rspecifier
- pydrobert.kaldi.io.argparse.kaldi_rxfilename_arg_type(string)[source]
argument type to make sure string is a valid extended readable file
- pydrobert.kaldi.io.argparse.kaldi_wspecifier_arg_type(string)[source]
argument type to make sure string is a valid wspecifier
- pydrobert.kaldi.io.argparse.kaldi_wxfilename_arg_type(string)[source]
argument type to make sure string is a valid extended readable file
- pydrobert.kaldi.io.argparse.numpy_dtype_arg_type(string)[source]
argument type for string reps of numpy dtypes
- pydrobert.kaldi.io.argparse.parse_kaldi_config_file(file_path, allow_space=True)[source]
Return a list of arguments from a kaldi config file
- Parameters
file_path (
str
) – Points to the config file in questionallow_spaces (
bool
, optional) – IfTrue
, treat the first space on a line as splitting key and value if no equals sign exists on the line. IfFalse
, no equals sign will chunk the whole line (as if a boolean flag). Kaldi does not split on spaces, but python does. Note that allow_spaces does not split the entire line on spaces, unlike shell arguments.
pydrobert.kaldi.io.corpus
Submodule for corpus iterators
- class pydrobert.kaldi.io.corpus.Data(table, *additional_tables, **kwargs)[source]
-
Metaclass for data iterables
A template for providing iterators over kaldi tables. They can be used like this
>>> data = DataSubclass( ... 'scp:feats.scp', 'scp:labels.scp', batch_size=10) >>> for feat_batch, label_batch in data: >>> pass # do something >>> for feat_batch, label_batch in data: >>> pass # do something again
Where DataSubclass is some subclass of this virtual class. Calling
iter()
on an instance (which occurs implicitly in for-loops) will generate a new iterator over the entire data set.The class takes an arbitrary positive number of positional arguments on initialization, each a table to open. Each argument is one of:
An rspecifier (ideally for a script file). Assumed to be of type
KaldiDataType.BaseMatrix
A sequence of length 2: the first element is the rspecifier, the second the rspecifier’s
KaldiDataType
A sequence of length 3: the first element is the rspecifier, the second the rspecifier’s
KaldiDataType
, and the third is a dictionary to be passed as keyword arguments to thepydrobert.kaldi.io.open()
function
All tables are assumed to index data using the same keys.
If batch_size is set, data are stacked in batches along a new axis. The keyword arguments batch_axis, batch_pad_mode, and any remaining keywords are sent to this module’s
batch_data()
function. If batch_size isNone
or0
, samples are returned one-by-one. Data are always cast to numpy arrays before being returned. Consult that function for more information on batching.If only one table is specified and neither axis_lengths or add_key is specified, iterators will be of a batch of the table’s data directly. Otherwise, iterators yield “batches” of tuples containing “sub-batches” from each respective data source. Sub-batches belonging to the same batch share the same subset of ordered keys.
If add_key is
True
, a sub-batch of referrent keys is added as the first element of a batch tuple.For batched sequence-to-sequence tasks, it is often important to know the original length of data before padding. Setting axis_lengths adds one or more sub-batches to the end of a batch tuple with this information. These sub-batches are filled with signed 32-bit integers. axis_lengths can be one of:
An integer specifying an axis from the first table to get the lengths of.
A pair of integers. The first element is the table index, the second is the axis index in that table.
A sequence of pairs of integers. Sub-batches will be appended to the batch tuple in that order
Note that axes in axis_lengths index the axes in individual samples, not the batch. For instance, if
batch_axis == 0
andaxis_lengths == 0
, then the last sub-batch will refer to the pre-padded value of sub-batch 0’s axis 1 (batch[0].shape[1]
).The length of this object is the number of batches it serves per epoch.
- Parameters
table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If
True
, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults toFalse
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to
0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to.
None
values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicablebatch_kwargs – Additional keyword arguments to pass to
batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same lengthbatch_size – The number of samples per (sub-)batch. Defaults to
None
, which means samples are served without batchingignore_missing – If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default toFalse
- table_specifiers
A tuple of triples indicating
(rspecifier, kaldi_dtype, open_kwargs)
for each table
- add_key
Whether a sub-batch of table keys has been prepended to existing sub-batches
- axis_lengths
A tuple of pairs for each axis-length sub-batch requested. Each pair is
(sub_batch_idx, axis)
.
- batch_axis
A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type
- batch_cast_to_array
A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of
None
indicate no casting should be done on that (sub-)sample
- batch_kwargs
Additional keyword arguments to pass to
batch_data
- batch_pad_mode
If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same length
- batch_size
The number of samples per (sub-)batch
- ignore_missing
If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError
- num_sub
The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches
- batch_generator(repeat=False)[source]
A generator which yields batches of data
- Parameters
repeat (
bool
) – Whether to stop generating after one epoch (False) or keep restart and continue generating indefinitely- Yields
batch (
np.array
ortuple
) – A batch ifself.num_sub == 1
, otherwise a tuple of sub-batches. If self.batch_size does not divide an epoch’s worth of data evenly, the last batch of every epoch will be smaller
- property num_batches
the number of batches yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- abstract property num_samples
the number of samples yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- abstract sample_generator_for_epoch()[source]
A generator which yields individual samples from data for an epoch
An epoch means one pass through the data from start to finish. Equivalent to
sample_generator(False)
.- Yields
sample (
np.array
ortuple
) – A sample ifself.num_sub == 1
, otherwise a tuple of sub-samples
- class pydrobert.kaldi.io.corpus.SequentialData(table, *additional_tables, **kwargs)[source]
Bases:
Data
Provides iterators to read data sequentially
Tables are always assumed to be sorted so reading can proceed in lock-step.
Warning
Each time an iterator is requested, new sequential readers are opened. Be careful with stdin!
- Parameters
table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If
True
, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults toFalse
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to
0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to.
None
values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicablebatch_kwargs – Additional keyword arguments to pass to
batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same lengthbatch_size – The number of samples per (sub-)batch. Defaults to
None
, which means samples are served without batchingignore_missing – If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default toFalse
- table_specifiers
A tuple of triples indicating
(rspecifier, kaldi_dtype, open_kwargs)
for each table
- add_key
Whether a sub-batch of table keys has been prepended to existing sub-batches
- axis_lengths
A tuple of pairs for each axis-length sub-batch requested. Each pair is
(sub_batch_idx, axis)
.
- batch_axis
A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type
- batch_cast_to_array
A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of
None
indicate no casting should be done on that (sub-)sample
- batch_kwargs
Additional keyword arguments to pass to
batch_data
- batch_pad_mode
If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same length
- batch_size
The number of samples per (sub-)batch
- ignore_missing
If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError
- num_sub
The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches
- property num_samples
the number of samples yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- sample_generator_for_epoch()[source]
A generator which yields individual samples from data for an epoch
An epoch means one pass through the data from start to finish. Equivalent to
sample_generator(False)
.- Yields
sample (
np.array
ortuple
) – A sample ifself.num_sub == 1
, otherwise a tuple of sub-samples
- class pydrobert.kaldi.io.corpus.ShuffledData(table, *additional_tables, **kwargs)[source]
Bases:
Data
Provides iterators over shuffled data
A master list of keys is either provided by keyword argument or inferred from the first table. Every new iterator requested shuffles that list of keys and returns batches in that order. Appropriate for training data.
Notes
For efficiency, it is highly recommended to use scripts to access tables rather than archives.
- Parameters
table – The first table specifier
additional_tables – Table specifiers past the first. If not empty, will iterate over tuples of sub-batches
add_key – If
True
, will insert sub-samples into the 0th index of each sample sequence that specify the key that this sample was indexed by. Defaults toFalse
axis_lengths – If set, sub-batches of axis lengths will be appended to the end of a batch tuple
batch_axis – The axis or axes (in the case of multiple tables) along which samples are stacked in (sub-)batches. batch_axis should take into account axis length and key sub-batches when applicable. Defaults to
0
batch_cast_to_array – A numpy type or sequence of types to cast each (sub-)batch to.
None
values indicate no casting should occur. batch_cast_to_array should take into acount axis length and key sub-batches when applicablebatch_kwargs – Additional keyword arguments to pass to
batch_data
batch_pad_mode – If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same lengthbatch_size – The number of samples per (sub-)batch. Defaults to
None
, which means samples are served without batchingignore_missing – If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError. Default toFalse
key_list – A master list of keys. No other keys will be queried. If not specified, the key list will be inferred by passing through the first table once
rng – Either a
numpy.random.RandomState
object or a seed to create one. It will be used to shuffle the list of keys
- table_specifiers
A tuple of triples indicating
(rspecifier, kaldi_dtype, open_kwargs)
for each table
- add_key
Whether a sub-batch of table keys has been prepended to existing sub-batches
- axis_lengths
A tuple of pairs for each axis-length sub-batch requested. Each pair is
(sub_batch_idx, axis)
.
- batch_axis
A tuple of length num_sub indicating which axis (sub-)samples will be arrayed along in a given (sub-)batch when all (sub-)samples are (or are cast to) fixed length numpy arrays of the same type
- batch_cast_to_array
A tuple of length num_sub indicating what numpy types, if any (sub-)samples should be cast to. Values of
None
indicate no casting should be done on that (sub-)sample
- batch_kwargs
Additional keyword arguments to pass to
batch_data
- batch_pad_mode
If set, pads samples in (sub-)batches according to this
numpy.pad()
strategy when samples do not have the same length
- batch_size
The number of samples per (sub-)batch
- ignore_missing
If
True
and some provided table does not have some key, that key will simply be ignored. Otherwise, a missing key raises a ValueError
- num_sub
The number of sub-batches per batch. If > 1, batches are yielded as tuples of sub-batches. This number accounts for key, table, and axis-length sub-batches
- key_list
The master list of keys
- rng
Used to shuffle the list of keys every epoch
- table_holders
A tuple of table readers opened in random access mode
- property num_samples
the number of samples yielded per epoch
This number takes into account the number of terms missing if
self.ignore_missing == True
- Type
- pydrobert.kaldi.io.corpus.batch_data(input_iter, subsamples=True, batch_size=None, axis=0, cast_to_array=None, pad_mode=None, **pad_kwargs)[source]
Generate batched data from an input generator
Takes some fixed number of samples from input_iter, encapsulates them, and yields them.
If subsamples is
True
, data from input_iter are expected to be encapsulated in fixed-length sequences (e.g.(feat, label, len)
). Each sample will be batched separately into a sub-batch and returned in a tuple (e.g.(feat_batch, label_batch, len_batch)
).The format of a (sub-)batch depends on the properties of its samples:
If cast_to_array applies to this sub-batch, cast it to a numpy array of the target type.
If all samples in the (sub-)batch are numpy arrays of the same type and shape, samples are stacked in a bigger numpy array along the axis specified by axis (see Parameters).
If all samples are numpy arrays of the same type but variable length and pad_mode is specified, pad all sample arrays to the right such that they all have the same (supremum) shape, then perform 2.
Otherwise, simply return a list of samples as-is (ignoring axis).
- Parameters
input_iter (
Iterator
) – An iterator over samplessubsamples (
bool
) – input_iter yields tuples to be divided into different sub-batches ifTrue
batch_size (
Optional
[int
]) – The size of batches, except perhaps the last one. If not set or0
, will yield samples (casting and encapsulating in tuples when necessary)axis (
int
) – Where to insert the batch index/indices into the shape/shapes of the inputs. If a sequence, subsamples must beTrue
and input_iter should yield samples of the same length as axis. If anint
and subsamples isTrue
, the same axis will be used for all sub-samples.cast_to_array (
Union
[dtype
,Sequence
,None
]) – Dictates whether data should be cast to numpy arrays and of what type. If a sequence, subsamples must beTrue
and input_iter should yield samples of the same length as cast_to_array. If a single value and subsamples isTrue
, the same value will be used for all sub-samples. Value(s) ofNone
indicate no casting should be done for this (sub-)sample. Other values will be used to cast (sub-)samples to numpy arrayspad_mode (
Union
[str
,Callable
,None
]) – If set, inputs within a batch will be padded on the end to match the largest shapes in the batch. How the inputs are padded matches the argument tonumpy.pad()
. If not set, will raise aValueError
if they don’t all have the same shapepad_kwargs – Additional keyword arguments are passed along to
numpy.pad()
if padding.
See also
numpy.pad
For different pad modes and options
pydrobert.kaldi.io.duck_streams
Submodule for reading and writing one-by-one, like (un)packing c structs
- class pydrobert.kaldi.io.duck_streams.KaldiInput(path, header=True)[source]
Bases:
KaldiIOBase
A kaldi input stream from which objects can be read one at a time
- Parameters
- close()[source]
Close and flush the underlying IO object
This method has no effect if the file is already closed
- read(kaldi_dtype, value_style='b', read_binary=None)[source]
Read in one object from the stream
- Parameters
kaldi_dtype (
KaldiDataType
) – The type of object to readvalue_style (
Literal
['b'
,'s'
,'d'
]) –'wm'
readers can provide not only the audio buffer ('b'
) of a wave file, but its sampling rate ('s'
), and/or duration (in sec,'d'
). Setting value_style to some combination of'b'
,'s'
, and/or'd'
will cause the reader to return a tuple of that information. If value_style is only one character, the result will not be contained in a tupleread_binary (
bool
, optional) – If set, the object will be read as either binary (True
) or text (False
). The default behaviour is to read according to the binary attribute. Ignored if there’s only one way to read the data
- class pydrobert.kaldi.io.duck_streams.KaldiOutput(path, header=True)[source]
Bases:
KaldiIOBase
A kaldi output stream from which objects can be written one at a time
- Parameters
- write(obj, kaldi_dtype=None, error_on_str=True, write_binary=True)[source]
Write one object to the stream
- Parameters
obj (
Any
) – The object to writekaldi_dtype (
Optional
[KaldiDataType
]) – The type of object to writeerror_on_str (
bool
) – Token vectors ('tv'
) accept sequences of whitespace-free ASCII/UTF strings. Astr
is also a sequence of characters, which may satisfy the token requirements. If error_on_str isTrue
, aValueError
is raised when writing astr
as a token vector. Otherwise astr
can be writtenwrite_binary (
bool
) – The object will be written as binary (True
) or text (False
)
- Raises
ValueError – If unable to determine a proper data type
See also
pydrobert.kaldi.io.util.infer_kaldi_data_type
Illustrates how different inputs are mapped to data types
- pydrobert.kaldi.io.duck_streams.open_duck_stream(path, mode='r', header=True)[source]
Open a “duck” stream
“Duck” streams provide an interface for reading or writing kaldi objects, one at a time. Essentially: remember the order things go in, then pull them out in the same order.
Duck streams can read/write binary or text data. It is mostly up to the user how to read or write data, though the following rules establish the default:
An input stream that does not look for a ‘binary header’ is binary
An input stream that looks for and finds a binary header when opening is binary
An input stream that looks for but does not find a binary header when opening is a text stream
An output stream is always binary. However, the user may choose not to write a binary header. The resulting input stream will be considered a text stream when 3. is satisfied
- Parameters
path (
str
) – The extended file name to be opened. This can be quite exotic. More details can be found on the Kaldi website.mode (
Literal
['r'
,'r+'
,'w'
]) – Whether to open the stream for input ('r'
) or output ('w'
).'r+'
is equivalent to'r'
header (
bool
) – Setting this toTrue
will either check for a ‘binary header’ in an input stream, or write a binary header for an output stream. If False, no check/write is performed
pydrobert.kaldi.io.enums
Kaldi enumerations, including data types and xspecifier types
- class pydrobert.kaldi.io.enums.KaldiDataType(value)[source]
Bases:
Enum
Enumerates the data types stored and retrieved by Kaldi I/O
This enumerable lists the types of data written and read to various readers and writers. It is used in the factory method
pydrobert.kaldi.io.open()
to dictate the subclass created.Notes
The “base float” mentioned in this documentation is the same type as
kaldi::BaseFloat
, which was determined when Kaldi was built. The easiest way to determine whether this is a double (64-bit) or a float (32-bit) is by checking the value ofKaldiDataType.BaseVector.is_double()
- Base = 'b'
Inputs/outputs are single base floats
- BaseMatrix = 'bm'
Inputs/outputs are 2D numpy arrays of the base float
- BasePairVector = 'bpv'
Inputs/outputs are tuples of pairs of the base float
- BaseVector = 'bv'
Inputs/outputs are 1D numpy arrays of the base float
- Bool = 'B'
Inputs/outputs are single booleans
- Double = 'd'
Inputs/outputs are single 64-bit floats
- DoubleMatrix = 'dm'
Inputs/outputs are 2D numpy arrays of 64-bit floats
- DoubleVector = 'dv'
Inputs/outputs are 1D numpy arrays of 64-bit floats
- FloatMatrix = 'fm'
Inputs/outputs are 2D numpy arrays of 32-bit floats
- FloatVector = 'fv'
Inputs/outputs are 1D numpy arrays of 32-bit floats
- Int32 = 'i'
Inputs/outputs are single 32-bit ints
- Int32PairVector = 'ipv'
Inputs/outputs are tuples of pairs of 32-bit ints
- Int32Vector = 'iv'
Inputs/outputs are tuples of 32-bit ints
- Int32VectorVector = 'ivv'
Inputs/outputs are tuples of tuples of 32-bit ints
- Token = 't'
Inputs/outputs are individual whitespace-free ASCII or unicode words
- TokenVector = 'tv'
Inputs/outputs are tuples of tokens
- WaveMatrix = 'wm'
Inputs/outputs are wave file data, cast to base float 2D arrays
Wave matrices have the shape
(n_channels, n_samples)
. Kaldi will read PCM wave files, but will always convert the samples the base floats.Though Kaldi can read wave files of different types and sample rates, Kaldi will only write wave files as PCM16 sampled at 16k.
- class pydrobert.kaldi.io.enums.RxfilenameType(value)[source]
Bases:
Enum
The type of stream to read, based on an extended filename
- FileInput = 1
Input is from a file on disk with no offset
- InvalidInput = 0
An invalid stream
- OffsetFileInput = 4
Input is from a file on disk, read from a specific offset
- PipedInput = 3
Input is being piped from a command
- StandardInput = 2
Input is being piped from stdin
- class pydrobert.kaldi.io.enums.TableType(value)[source]
Bases:
Enum
The type of table a stream points to
- ArchiveTable = 1
The stream points to an archive (keys and values)
- BothTables = 3
The stream points simultaneously to a script and archive
This is a special pattern for writing. The archive stores keys and values; the script stores keys and points to the locations in the archive
- NotATable = 0
The stream is not a table
- ScriptTable = 2
The stream points to a script (keys and extended file names)
- class pydrobert.kaldi.io.enums.WxfilenameType(value)[source]
Bases:
Enum
The type of stream to write, based on an extended filename
- FileOutput = 1
Output to a file on disk
- InvalidOutput = 0
An invalid stream
- PipedOutput = 3
Output is being piped to some command
- StandardOutput = 2
Output is being piped to stdout
pydrobert.kaldi.io.table_streams
Submodule containing table readers and writers
- class pydrobert.kaldi.io.table_streams.KaldiRandomAccessReader(path, kaldi_dtype, utt2spk='')[source]
Bases:
KaldiTable
,Container
Read-only access to values of table by key
KaldiRandomAccessReader
objects can access values of a table through either theget()
method or square bracket access (e.g.a[key]
). The presence of a key can be checked with “in” syntax (e.g.key in a
). Unlike adict
, the extent of aKaldiRandomAccessReader
is not known beforehand, so neither iterators nor length methods are implemented.- Parameters
path (
str
) – An rspecifier to read tables fromkaldi_dtype (
KaldiDataType
) – The data type to readutt2spk (
str
) – If set, the reader uses utt2spk as a map from utterance ids to speaker ids. The data in path, which are assumed to be referenced by speaker ids, can then be refrenced by utterance. If utt2spk is unspecified, the keys in path are used to query for data.
- class pydrobert.kaldi.io.table_streams.KaldiSequentialReader(path, kaldi_dtype)[source]
Bases:
KaldiTable
,Iterator
Abstract class for iterating over table entries
KaldiSequentialReader
iterates over key-value pairs. The default behaviour (i.e. that in a for-loop) is to iterate over the values in order of access. Similar todict
instances,items()
,values()
, andkeys()
return iterators over their respective domains. Alternatively, themove()
method moves to the next pair, at which point thekey()
andvalue()
methods can be queried.Though it is possible to mix and match access patterns, all methods refer to the same underlying iterator (the
KaldiSequentialReader
)- Parameters
path (
str
) – An rspecifier to read the table fromkaldi_dtype (
KaldiDataType
) – The data type to read
- Yields
object
or(str
,object)
– Values or key, value pairs
- class pydrobert.kaldi.io.table_streams.KaldiTable(path, kaldi_dtype)[source]
Bases:
KaldiIOBase
Base class for interacting with tables
All table readers and writers are subclasses of
KaldiTable
. Tables must specify the type of data being read ahead of time- Parameters
path (
str
) – An rspecifier or wspecifierkaldi_dtype (
KaldiDataType
) – The type of data type this table contains
- kaldi_dtype
The table’s data type
- Type
KaldiDataType
- Raises
IOError – If unable to open table
- class pydrobert.kaldi.io.table_streams.KaldiWriter(path, kaldi_dtype)[source]
Bases:
KaldiTable
Write key-value pairs to tables
- Parameters
path (
str
) – An rspecifier to write the table tokaldi_dtype (
pydrobert.kaldi.io.enums.KaldiDataType
) – The data type to write
- pydrobert.kaldi.io.table_streams.open_table_stream(path, kaldi_dtype, mode='r', error_on_str=True, utt2spk='', value_style='b', cache=False)[source]
Factory function to open a kaldi table
This function finds the correct
KaldiTable
according to the args kaldi_dtype and mode. Specific combinations allow for optional parameters outlined by the table belowmode
kaldi_dtype
additional kwargs
'r'
'wm'
value_style='b'
'r+'
utt2spk=''
'r+'
'wm'
value_style='b'
'w'
'tv'
error_on_str=True
- Parameters
path (
str
) – The specifier used by kaldi to open the script. Generally these will take the form'{ark|scp}:<path_to_file>'
, though they can take much more interesting forms (like pipes). More information can be found on the Kaldi websitekaldi_dtype (
KaldiDataType
) – The type of data the table is expected to handlemode (
Literal
['r'
,'r+'
,'w'
]) – Specifies the type of access to be performed: read sequential, read random, or write. They are implemented by subclasses ofKaldiSequentialReader
,KaldiRandomAccessReader
, orKaldiWriter
, resp.error_on_str (
bool
) – Token vectors ('tv'
) accept sequences of whitespace-free ASCII/UTF strings. Astr
is also a sequence of characters, which may satisfy the token requirements. If error_on_str isTrue
, aValueError
is raised when writing astr
as a token vector. Otherwise astr
can be writtenutt2spk (
str
) – If set, the reader uses utt2spk as a map from utterance ids to speaker ids. The data in path, which are assumed to be referenced by speaker ids, can then be refrenced by utterance. If utt2spk is unspecified, the keys in path are used to query for datavalue_style (
str
) –- Wave readers can provide not only the audio buffer (
'b'
) of a wave file, but its sampling rate (
's'
), and/or duration (in sec,'d'
). Setting value_style to some combination of'b'
,'s'
, and/or'd'
will cause the reader to return a tuple of that information. If value_style is only one character, the result will not be contained in a tuple.- cache
Whether to cache all values in a dict as they are retrieved. Only applicable to random access readers. This can be very expensive for large tables and redundant if reading from an archive directly (as opposed to a script).
- Wave readers can provide not only the audio buffer (
- Returns
table (
KaldiTable
) – A table, opened.- Raises
IOError – On failure to open
pydrobert.kaldi.io.util
Kaldi I/O utilities
- pydrobert.kaldi.io.util.infer_kaldi_data_type(obj)[source]
Infer the appropriate kaldi data type for this object
The following map is used (in order):
Object
KaldiDataType
an int
Int32
a boolean
Bool
a float*
Base
str
Token
2-dim numpy array float32
FloatMatrix
1-dim numpy array float32
FloatVector
2-dim numpy array float64
DoubleMatrix
1-dim numpy array float64
DoubleVector
1-dim numpy array of int32
Int32Vector
2-dim numpy array of int32*
Int32VectorVector
(matrix-like, float or int)
WaveMatrix**
an empty container
BaseMatrix
container of str
TokenVector
1-dim py container of ints
Int32Vector
2-dim py container of ints*
Int32VectorVector
2-dim py container of pairs of floats
BasePairVector
matrix-like python container
DoubleMatrix
vector-like python container
DoubleVector
*The same data types could represent a
Double
or anInt32PairVector
, respectively. Care should be taken in these cases.**The first element is the wave data, the second its sample frequency. The wave data can be a 2d numpy float array of the same precision as
KaldiDataType.BaseMatrix
, or a matrix-like python container of floats and/or ints.- Returns
- pydrobert.kaldi.io.util.parse_kaldi_input_path(path)[source]
Determine the characteristics of an input stream by its path
Returns a 4-tuple of the following information:
If path is not an rspecifier (
TableType.NotATable
):Classify path as an rxfilename
return a tuple of
(TableType, path, RxfilenameType, dict())
else:
Put all rspecifier options (once, sorted, called_sorted, permissive, background) into a dictionary
Extract the embedded rxfilename and classify it
return a tuple of
(TableType, rxfilename, RxfilenameType, options)
- Parameters
path (
str
) – A string that would be passed topydrobert.kaldi.io.open
- pydrobert.kaldi.io.util.parse_kaldi_output_path(path)[source]
Determine the charactersistics of an output stram by its path
Returns a 4-tuple of the following information
If path is not a wspecifier (
TableType.NotATable
)Classify path as a wxfilename
return a tuple of
(TableType, path, WxfilenameType, dict())
If path is an archive or script
Put all wspecifier options (binary, flush, permissive) into a dictionary
Extract the embedded wxfilename and classify it
return a tuple of
(TableType, wxfilename, WxfilenameType, options)
If path contains both an archive and a script (
TableType.BothTables
)Put all wspecifier options (binary, flush, permissive) into a dictionary
Extract both embedded wxfilenames and classify them
return a tuple of
(TableType, (arch_wxfilename, script_wxfilename), (arch_WxfilenameType, script_WxfilenameType), options)
- Parameters
path (
str
) – A string that would be passed topydrobert.kaldi.io.open()
pydrobert.kaldi.logging
Tie Kaldi’s logging into python’s builtin logging module
By default, Kaldi’s warning, error, and critical messages are all piped
directly to stderr. Any logging.Logger
instance can register with
register_logger_for_kaldi
to receive Kaldi messages. If some
logger is registered to receive Kaldi messages, messages will no longer
be sent to stderr by default. Kaldi codes are converted to logging
codes according to the following chart
logging |
kaldi |
---|---|
CRITICAL(50+) |
-3+ |
ERROR(40-49) |
-2 |
WARNING(30-39) |
-1 |
INFO(20-29) |
0 |
DEBUG(10-19) |
1 |
9 down to 1 |
2 up to 10 |
- class pydrobert.kaldi.logging.KaldiLogger(name, level=0)[source]
Bases:
Logger
Logger subclass that overwrites log info with kaldi’s
Setting the
Logger
class of the python modulelogging
(thrulogging.setLoggerClass
) toKaldiLogger
will allow new loggers to intercept messages from Kaldi and inject Kaldi’s trace information into the record. With this injection, the logger will point to the location in Kaldi’s source that the message originated from. Without it, the logger will point to a location within this submodule (pydrobert.kaldi.logging
).- makeRecord(name, lvl, fn, lno, msg, args, exc_info, func=None, extra=None, sinfo=None)[source]
Instances of the Logger class represent a single logging channel. A “logging channel” indicates an area of an application. Exactly how an “area” is defined is up to the application developer. Since an application can have any number of areas, logging channels are identified by a unique string. Application areas can be nested (e.g. an area of “input processing” might include sub-areas “read CSV files”, “read XLS files” and “read Gnumeric files”). To cater for this natural nesting, channel names are organized into a namespace hierarchy where levels are separated by periods, much like the Java or Python package namespace. So in the instance given above, channel names might be “input” for the upper level, and “input.csv”, “input.xls” and “input.gnu” for the sub-levels. There is no arbitrary limit to the depth of nesting.
- pydrobert.kaldi.logging.deregister_logger_for_kaldi(name)[source]
Deregister logger previously registered w register_logger_for_kaldi
- pydrobert.kaldi.logging.kaldi_vlog_level_cmd_decorator(func)[source]
Decorator to rename, then revert, level names according to Kaldi 1
See
pydrobert.kaldi.logging
for the conversion chart. After the return of the function, the level names before the call are reverted. This function is insensitive to renaming while the function executesReferences
- 1
Povey, D., et al (2011). The Kaldi Speech Recognition Toolkit. ASRU
- pydrobert.kaldi.logging.register_logger_for_kaldi(logger)[source]
Register logger to receive Kaldi’s messages
See module docstring for more info
- Parameters
logger (
str
orlogger
) – Either the logger or its name. When a new message comes along from Kaldi, the callback will send a message to the logger