inference-engine/samples/speech_sample/README.md

   1 # Automatic Speech Recognition C++ Sample
   2
   3 This topic shows how to run the speech sample application, which
   4 demonstrates acoustic model inference based on Kaldi\* neural networks
   5 and speech feature vectors.
   6
   7 ## How It Works
   8
   9 Upon the start-up, the application reads command line parameters
  10 and loads a Kaldi-trained neural network along with Kaldi ARK speech
  11 feature vector file to the Inference Engine plugin. It then performs
  12 inference on all speech utterances stored in the input ARK
  13 file. Context-windowed speech frames are processed in batches of 1-8
  14 frames according to the `-bs` parameter.  Batching across utterances is
  15 not supported by this sample.  When inference is done, the application
  16 creates an output ARK file.  If the `-r` option is given, error
  17 statistics are provided for each speech utterance as shown above.
  18
  19 ### GNA-specific details
  20
  21 #### Quantization
  22
  23 If the GNA device is selected (for example, using the `-d` GNA flag),
  24 the GNA Inference Engine plugin quantizes the model and input feature
  25 vector sequence to integer representation before performing inference.
  26 Several parameters control neural network quantization.  The `-q` flag
  27 determines the quantization mode.  Three modes are supported: static,
  28 dynamic, and user-defined.  In static quantization mode, the first
  29 utterance in the input ARK file is scanned for dynamic range.  The
  30 scale factor (floating point scalar multiplier) required to scale the
  31 maximum input value of the first utterance to 16384 (15 bits) is used
  32 for all subsequent inputs.  The neural network is quantized to
  33 accomodate the scaled input dynamic range.  In user-defined
  34 quantization mode, the user may specify a scale factor via the `-sf`
  35 flag that will be used for static quantization.  In dynamic
  36 quantization mode, the scale factor for each input batch is computed
  37 just before inference on that batch.  The input and network are
  38 (re)quantized on-the-fly using an efficient procedure.
  39
  40 The `-qb` flag provides a hint to the GNA plugin regarding the preferred
  41 target weight resolution for all layers.  For example, when `-qb 8` is
  42 specified, the plugin will use 8-bit weights wherever possible in the
  43 network.  Note that it is not always possible to use 8-bit weights due
  44 to GNA hardware limitations.  For example, convolutional layers always
  45 use 16-bit weights (GNA harware verison 1 and 2).  This limitation
  46 will be removed in GNA hardware version 3 and higher.
  47
  48 #### Execution Modes
  49
  50 Several execution modes are supported via the `-d` flag.  If the device
  51 is set to `CPU` and the GNA plugin is selected, the GNA device is
  52 emulated in fast-but-not-bit-exact mode.  If the device is set to
  53 `GNA_AUTO`, then the GNA hardware is used if available and the driver is
  54 installed.  Otherwise, the GNA device is emulated in
  55 fast-but-not-bit-exact mode.  If the device is set to `GNA_HW`, then the
  56 GNA hardware is used if available and the driver is installed.
  57 Otherwise, an error will occur.  If the device is set to `GNA_SW`, the
  58 GNA device is emulated in fast-but-not-bit-exact mode.  Finally, if
  59 the device is set to `GNA_SW_EXACT`, the GNA device is emulated in
  60 bit-exact mode.
  61
  62 #### Loading and Saving Models
  63
  64 The GNA plugin supports loading and saving of the GNA-optimized model
  65 (non-IR) via the `-rg` and `-wg` flags.  Thereby, it is possible to avoid
  66 the cost of full model quantization at run time. The GNA plugin also
  67 supports export of firmware-compatible embedded model images for the
  68 Intel® Speech Enabling Developer Kit and Amazon Alexa* Premium
  69 Far-Field Voice Development Kit via the `-we` flag (save only).
  70
  71 In addition to performing inference directly from a GNA model file, these options make it possible to:
  72 - Convert from IR format to GNA format model file (`-m`, `-wg`)
  73 - Convert from IR format to embedded format model file (`-m`, `-we`)
  74 - Convert from GNA format to embedded format model file (`-rg`, `-we`)
  75
  76
  77 ## Running
  78
  79 Running the application with the `-h` option yields the following
  80 usage message:
  81
  82 ```sh
  83 $ ./speech_sample -h
  84 InferenceEngine:
  85     API version ............ <version>
  86     Build .................. <number>
  87
  88 speech_sample [OPTION]
  89 Options:
  90
  91     -h                      Print a usage message.
  92     -i "<path>"             Required. Path to an .ark file.
  93     -m "<path>"             Required. Path to an .xml file with a trained model (required if -rg is missing).
  94     -o "<path>"             Optional. Output file name (default name is "scores.ark").
  95     -l "<absolute_path>"    Required for CPU custom layers. Absolute path to a shared library with the kernel implementations.
  96     -d "<device>"           Optional. Specify a target device to infer on. CPU, GPU, GNA_AUTO, GNA_HW, GNA_SW, GNA_SW_EXACT and HETERO with combination of GNA as the primary device and CPU as a secondary (e.g. HETERO:GNA,CPU) are supported. The sample will look for a suitable plugin for device specified.
  97     -p                      Optional. Plugin name. For example, GPU. If this parameter is set, the sample will look for this plugin only
  98     -pp                     Optional. Path to a plugin folder.
  99     -pc                     Optional. Enables performance report
 100     -q "<mode>"             Optional. Input quantization mode:  "static" (default), "dynamic", or "user" (use with -sf).
 101     -qb "<integer>"         Optional. Weight bits for quantization:  8 or 16 (default)
 102     -sf "<double>"          Optional. Input scale factor for quantization (use with -q user).
 103     -bs "<integer>"         Optional. Batch size 1-8 (default 1)
 104     -r "<path>"             Optional. Read reference score .ark file and compare scores.
 105     -rg "<path>"            Optional. Read GNA model from file using path/filename provided (required if -m is missing).
 106     -wg "<path>"            Optional. Write GNA model to file using path/filename provided.
 107     -we "<path>"            Optional. Write GNA embedded model to file using path/filename provided.
 108     -nthreads "<integer>"   Optional. Number of threads to use for concurrent async inference requests on the GNA.
 109     -cw "<integer>"         Optional. Number of frames for context windows (default is 0). Works only with context window networks. If you use the cw flag, the batch size and nthreads arguments are ignored.
 110
 111 ```
 112
 113 Running the application with the empty list of options yields the
 114 usage message given above and an error message.
 115
 116 ### Model Preparation
 117
 118 You can use the following model optimizer command to convert a Kaldi
 119 nnet1 or nnet2 neural network to Intel IR format:
 120
 121 ```sh
 122 $ python3 mo.py --framework kaldi --input_model wsj_dnn5b_smbr.nnet --counts wsj_dnn5b_smbr.counts --remove_output_softmax
 123 ```
 124
 125 Assuming that the model optimizer (`mo.py`), Kaldi-trained neural
 126 network, `wsj_dnn5b_smbr.nnet`, and Kaldi class counts file,
 127 `wsj_dnn5b_smbr.counts`, are in the working directory this produces
 128 the Intel IR network consisting of `wsj_dnn5b_smbr.xml` and
 129 `wsj_dnn5b_smbr.bin`.
 130
 131 The following pre-trained models are available:
 132
 133 * wsj\_dnn5b\_smbr
 134 * rm\_lstm4f
 135 * rm\_cnn4a\_smbr
 136
 137 All of them can be downloaded from [https://download.01.org/openvinotoolkit/models_contrib/speech/kaldi](https://download.01.org/openvinotoolkit/models_contrib/speech/kaldi) or using the OpenVINO [Model Downloader](https://github.com/opencv/open_model_zoo/tree/2018/model_downloader) .
 138
 139
 140 ### Speech Inference
 141
 142 Once the IR is created, you can use the following command to do
 143 inference on Intel^&reg; Processors with the GNA co-processor (or
 144 emulation library):
 145
 146 ```sh
 147 $ ./speech_sample -d GNA_AUTO -bs 2 -i wsj_dnn5b_smbr_dev93_10.ark -m wsj_dnn5b_smbr_fp32.xml -o scores.ark -r wsj_dnn5b_smbr_dev93_scores_10.ark
 148 ```
 149
 150 Here, the floating point Kaldi-generated reference neural network
 151 scores (`wsj_dnn5b_smbr_dev93_scores_10.ark`) corresponding to the input
 152 feature file (`wsj_dnn5b_smbr_dev93_10.ark`) are assumed to be available
 153 for comparison.
 154
 155 > **NOTE**: Before running the sample with a trained model, make sure the model is converted to the Inference Engine format (\*.xml + \*.bin) using the [Model Optimizer tool](./docs/MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md).
 156
 157 ## Sample Output
 158
 159 The acoustic log likelihood sequences for all utterances are stored in
 160 the Kaldi ARK file, `scores.ark`.  If the `-r` option is used, a report on
 161 the statistical score error is generated for each utterance such as
 162 the following:
 163
 164 ``` sh
 165 Utterance 0: 4k0c0301
 166    Average inference time per frame: 6.26867 ms
 167          max error: 0.0667191
 168          avg error: 0.00473641
 169      avg rms error: 0.00602212
 170        stdev error: 0.00393488
 171 ```
 172
 173 ## Use of Sample in Kaldi* Speech Recognition Pipeline
 174
 175 The Wall Street Journal DNN model used in this example was prepared
 176 using the Kaldi s5 recipe and the Kaldi Nnet (nnet1) framework.  It is
 177 possible to recognize speech by substituting the `speech_sample` for
 178 Kaldi's nnet-forward command.  Since the speech_sample does not yet
 179 use pipes, it is necessary to use temporary files for speaker-
 180 transformed feature vectors and scores when running the Kaldi speech
 181 recognition pipeline.  The following operations assume that feature
 182 extraction was already performed according to the `s5` recipe and that
 183 the working directory within the Kaldi source tree is `egs/wsj/s5`.
 184 1. Prepare a speaker-transformed feature set given the feature transform specified
 185   in `final.feature_transform` and the feature files specified in `feats.scp`:
 186 ```
 187 nnet-forward --use-gpu=no final.feature_transform "ark,s,cs:copy-feats scp:feats.scp ark:- |" ark:feat.ark
 188 ```
 189 2. Score the feature set using the `speech_sample`:
 190 ```
 191 ./speech_sample -d GNA_AUTO -bs 8 -i feat.ark -m wsj_dnn5b_smbr_fp32.xml -o scores.ark
 192 ```
 193 3. Run the Kaldi decoder to produce n-best text hypotheses and select most likely text given the WFST (`HCLG.fst`), vocabulary (`words.txt`), and TID/PID mapping (`final.mdl`):
 194 ```
 195 latgen-faster-mapped --max-active=7000 --max-mem=50000000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=0.0833 --allow-partial=true --word-symbol-table=words.txt final.mdl HCLG.fst ark:scores.ark ark:-| lattice-scale --inv-acoustic-scale=13 ark:- ark:- | lattice-best-path --word-symbol-table=words.txt ark:- ark,t:-  > out.txt &
 196 ```
 197 4. Run the word error rate tool to check accuracy given the vocabulary (`words.txt`) and reference transcript (`test_filt.txt`):
 198 ```
 199 cat out.txt | utils/int2sym.pl -f 2- words.txt | sed s:\<UNK\>::g | compute-wer --text --mode=present ark:test_filt.txt ark,p:-
 200 ```
 201
 202 ## See Also
 203 * [Using Inference Engine Samples](./docs/IE_DG/Samples_Overview.md)
 204 * [Model Optimizer](./docs/MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md)
 205 * [Model Downloader](https://github.com/opencv/open_model_zoo/tree/2018/model_downloader)