examples/mnist/readme.md

   1 ---
   2 title: LeNet MNIST Tutorial
   3 description: Train and test "LeNet" on the MNIST handwritten digit data.
   4 category: example
   5 include_in_docs: true
   6 priority: 1
   7 ---
   8
   9 # Training LeNet on MNIST with Caffe
  10
  11 We will assume that you have Caffe successfully compiled. If not, please refer to the [Installation page](/installation.html). In this tutorial, we will assume that your Caffe installation is located at `CAFFE_ROOT`.
  12
  13 ## Prepare Datasets
  14
  15 You will first need to download and convert the data format from the MNIST website. To do this, simply run the following commands:
  16
  17     cd $CAFFE_ROOT
  18     ./data/mnist/get_mnist.sh
  19     ./examples/mnist/create_mnist.sh
  20
  21 If it complains that `wget` or `gunzip` are not installed, you need to install them respectively. After running the script there should be two datasets, `mnist_train_lmdb`, and `mnist_test_lmdb`.
  22
  23 ## LeNet: the MNIST Classification Model
  24
  25 Before we actually run the training program, let's explain what will happen. We will use the [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) network, which is known to work well on digit classification tasks. We will use a slightly different version from the original LeNet implementation, replacing the sigmoid activations with Rectified Linear Unit (ReLU) activations for the neurons.
  26
  27 The design of LeNet contains the essence of CNNs that are still used in larger models such as the ones in ImageNet. In general, it consists of a convolutional layer followed by a pooling layer, another convolution layer followed by a pooling layer, and then two fully connected layers similar to the conventional multilayer perceptrons. We have defined the layers in `$CAFFE_ROOT/examples/mnist/lenet_train_test.prototxt`.
  28
  29 ## Define the MNIST Network
  30
  31 This section explains the `lenet_train_test.prototxt` model definition that specifies the LeNet model for MNIST handwritten digit classification. We assume that you are familiar with [Google Protobuf](https://developers.google.com/protocol-buffers/docs/overview), and assume that you have read the protobuf definitions used by Caffe, which can be found at `$CAFFE_ROOT/src/caffe/proto/caffe.proto`.
  32
  33 Specifically, we will write a `caffe::NetParameter` (or in python, `caffe.proto.caffe_pb2.NetParameter`) protobuf. We will start by giving the network a name:
  34
  35     name: "LeNet"
  36
  37 ### Writing the Data Layer
  38
  39 Currently, we will read the MNIST data from the lmdb we created earlier in the demo. This is defined by a data layer:
  40
  41     layer {
  42       name: "mnist"
  43       type: "Data"
  44       data_param {
  45         source: "mnist_train_lmdb"
  46         backend: LMDB
  47         batch_size: 64
  48         scale: 0.00390625
  49       }
  50       top: "data"
  51       top: "label"
  52     }
  53
  54 Specifically, this layer has name `mnist`, type `data`, and it reads the data from the given lmdb source. We will use a batch size of 64, and scale the incoming pixels so that they are in the range \[0,1\). Why 0.00390625? It is 1 divided by 256. And finally, this layer produces two blobs, one is the `data` blob, and one is the `label` blob.
  55
  56 ### Writing the Convolution Layer
  57
  58 Let's define the first convolution layer:
  59
  60     layer {
  61       name: "conv1"
  62       type: "Convolution"
  63       param { lr_mult: 1 }
  64       param { lr_mult: 2 }
  65       convolution_param {
  66         num_output: 20
  67         kernel_size: 5
  68         stride: 1
  69         weight_filler {
  70           type: "xavier"
  71         }
  72         bias_filler {
  73           type: "constant"
  74         }
  75       }
  76       bottom: "data"
  77       top: "conv1"
  78     }
  79
  80 This layer takes the `data` blob (it is provided by the data layer), and produces the `conv1` layer. It produces outputs of 20 channels, with the convolutional kernel size 5 and carried out with stride 1.
  81
  82 The fillers allow us to randomly initialize the value of the weights and bias. For the weight filler, we will use the `xavier` algorithm that automatically determines the scale of initialization based on the number of input and output neurons. For the bias filler, we will simply initialize it as constant, with the default filling value 0.
  83
  84 `lr_mult`s are the learning rate adjustments for the layer's learnable parameters. In this case, we will set the weight learning rate to be the same as the learning rate given by the solver during runtime, and the bias learning rate to be twice as large as that - this usually leads to better convergence rates.
  85
  86 ### Writing the Pooling Layer
  87
  88 Phew. Pooling layers are actually much easier to define:
  89
  90     layer {
  91       name: "pool1"
  92       type: "Pooling"
  93       pooling_param {
  94         kernel_size: 2
  95         stride: 2
  96         pool: MAX
  97       }
  98       bottom: "conv1"
  99       top: "pool1"
 100     }
 101
 102 This says we will perform max pooling with a pool kernel size 2 and a stride of 2 (so no overlapping between neighboring pooling regions).
 103
 104 Similarly, you can write up the second convolution and pooling layers. Check `$CAFFE_ROOT/examples/mnist/lenet_train_test.prototxt` for details.
 105
 106 ### Writing the Fully Connected Layer
 107
 108 Writing a fully connected layer is also simple:
 109
 110     layer {
 111       name: "ip1"
 112       type: "InnerProduct"
 113       param { lr_mult: 1 }
 114       param { lr_mult: 2 }
 115       inner_product_param {
 116         num_output: 500
 117         weight_filler {
 118           type: "xavier"
 119         }
 120         bias_filler {
 121           type: "constant"
 122         }
 123       }
 124       bottom: "pool2"
 125       top: "ip1"
 126     }
 127
 128 This defines a fully connected layer (known in Caffe as an `InnerProduct` layer) with 500 outputs. All other lines look familiar, right?
 129
 130 ### Writing the ReLU Layer
 131
 132 A ReLU Layer is also simple:
 133
 134     layer {
 135       name: "relu1"
 136       type: "ReLU"
 137       bottom: "ip1"
 138       top: "ip1"
 139     }
 140
 141 Since ReLU is an element-wise operation, we can do *in-place* operations to save some memory. This is achieved by simply giving the same name to the bottom and top blobs. Of course, do NOT use duplicated blob names for other layer types!
 142
 143 After the ReLU layer, we will write another innerproduct layer:
 144
 145     layer {
 146       name: "ip2"
 147       type: "InnerProduct"
 148       param { lr_mult: 1 }
 149       param { lr_mult: 2 }
 150       inner_product_param {
 151         num_output: 10
 152         weight_filler {
 153           type: "xavier"
 154         }
 155         bias_filler {
 156           type: "constant"
 157         }
 158       }
 159       bottom: "ip1"
 160       top: "ip2"
 161     }
 162
 163 ### Writing the Loss Layer
 164
 165 Finally, we will write the loss!
 166
 167     layer {
 168       name: "loss"
 169       type: "SoftmaxWithLoss"
 170       bottom: "ip2"
 171       bottom: "label"
 172     }
 173
 174 The `softmax_loss` layer implements both the softmax and the multinomial logistic loss (that saves time and improves numerical stability). It takes two blobs, the first one being the prediction and the second one being the `label` provided by the data layer (remember it?). It does not produce any outputs - all it does is to compute the loss function value, report it when backpropagation starts, and initiates the gradient with respect to `ip2`. This is where all magic starts.
 175
 176
 177 ### Additional Notes: Writing Layer Rules
 178
 179 Layer definitions can include rules for whether and when they are included in the network definition, like the one below:
 180
 181     layer {
 182       // ...layer definition...
 183       include: { phase: TRAIN }
 184     }
 185
 186 This is a rule, which controls layer inclusion in the network, based on current network's state.
 187 You can refer to `$CAFFE_ROOT/src/caffe/proto/caffe.proto` for more information about layer rules and model schema.
 188
 189 In the above example, this layer will be included only in `TRAIN` phase.
 190 If we change `TRAIN` with `TEST`, then this layer will be used only in test phase.
 191 By default, that is without layer rules, a layer is always included in the network.
 192 Thus, `lenet_train_test.prototxt` has two `DATA` layers defined (with different `batch_size`), one for the training phase and one for the testing phase.
 193 Also, there is an `Accuracy` layer which is included only in `TEST` phase for reporting the model accuracy every 100 iteration, as defined in `lenet_solver.prototxt`.
 194
 195 ## Define the MNIST Solver
 196
 197 Check out the comments explaining each line in the prototxt `$CAFFE_ROOT/examples/mnist/lenet_solver.prototxt`:
 198
 199     # The train/test net protocol buffer definition
 200     net: "examples/mnist/lenet_train_test.prototxt"
 201     # test_iter specifies how many forward passes the test should carry out.
 202     # In the case of MNIST, we have test batch size 100 and 100 test iterations,
 203     # covering the full 10,000 testing images.
 204     test_iter: 100
 205     # Carry out testing every 500 training iterations.
 206     test_interval: 500
 207     # The base learning rate, momentum and the weight decay of the network.
 208     base_lr: 0.01
 209     momentum: 0.9
 210     weight_decay: 0.0005
 211     # The learning rate policy
 212     lr_policy: "inv"
 213     gamma: 0.0001
 214     power: 0.75
 215     # Display every 100 iterations
 216     display: 100
 217     # The maximum number of iterations
 218     max_iter: 10000
 219     # snapshot intermediate results
 220     snapshot: 5000
 221     snapshot_prefix: "examples/mnist/lenet"
 222     # solver mode: CPU or GPU
 223     solver_mode: GPU
 224
 225
 226 ## Training and Testing the Model
 227
 228 Training the model is simple after you have written the network definition protobuf and solver protobuf files. Simply run `train_lenet.sh`, or the following command directly:
 229
 230     cd $CAFFE_ROOT
 231     ./examples/mnist/train_lenet.sh
 232
 233 `train_lenet.sh` is a simple script, but here is a quick explanation: the main tool for training is `caffe` with action `train` and the solver protobuf text file as its argument.
 234
 235 When you run the code, you will see a lot of messages flying by like this:
 236
 237     I1203 net.cpp:66] Creating Layer conv1
 238     I1203 net.cpp:76] conv1 <- data
 239     I1203 net.cpp:101] conv1 -> conv1
 240     I1203 net.cpp:116] Top shape: 20 24 24
 241     I1203 net.cpp:127] conv1 needs backward computation.
 242
 243 These messages tell you the details about each layer, its connections and its output shape, which may be helpful in debugging. After the initialization, the training will start:
 244
 245     I1203 net.cpp:142] Network initialization done.
 246     I1203 solver.cpp:36] Solver scaffolding done.
 247     I1203 solver.cpp:44] Solving LeNet
 248
 249 Based on the solver setting, we will print the training loss function every 100 iterations, and test the network every 1000 iterations. You will see messages like this:
 250
 251     I1203 solver.cpp:204] Iteration 100, lr = 0.00992565
 252     I1203 solver.cpp:66] Iteration 100, loss = 0.26044
 253     ...
 254     I1203 solver.cpp:84] Testing net
 255     I1203 solver.cpp:111] Test score #0: 0.9785
 256     I1203 solver.cpp:111] Test score #1: 0.0606671
 257
 258 For each training iteration, `lr` is the learning rate of that iteration, and `loss` is the training function. For the output of the testing phase, score 0 is the accuracy, and score 1 is the testing loss function.
 259
 260 And after a few minutes, you are done!
 261
 262     I1203 solver.cpp:84] Testing net
 263     I1203 solver.cpp:111] Test score #0: 0.9897
 264     I1203 solver.cpp:111] Test score #1: 0.0324599
 265     I1203 solver.cpp:126] Snapshotting to lenet_iter_10000
 266     I1203 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
 267     I1203 solver.cpp:78] Optimization Done.
 268
 269 The final model, stored as a binary protobuf file, is stored at
 270
 271     lenet_iter_10000
 272
 273 which you can deploy as a trained model in your application, if you are training on a real-world application dataset.
 274
 275 ### Um... How about GPU training?
 276
 277 You just did! All the training was carried out on the GPU. In fact, if you would like to do training on CPU, you can simply change one line in `lenet_solver.prototxt`:
 278
 279     # solver mode: CPU or GPU
 280     solver_mode: CPU
 281
 282 and you will be using CPU for training. Isn't that easy?
 283
 284 MNIST is a small dataset, so training with GPU does not really introduce too much benefit due to communication overheads. On larger datasets with more complex models, such as ImageNet, the computation speed difference will be more significant.
 285
 286 ### How to reduce the learning rate a fixed steps?
 287 Look at lenet_multistep_solver.prototxt