Fixed fused batch norm performance regression.
authorReed Wanderman-Milne <reedwm@google.com>
Fri, 5 Jan 2018 23:07:51 +0000 (15:07 -0800)
committerTensorFlower Gardener <gardener@tensorflow.org>
Fri, 5 Jan 2018 23:12:41 +0000 (15:12 -0800)
commit3021eb0bf4d76a26d03c53c2aca7192dbf154a86
treeff2867e819d2b1f19ea9979aa7faa2fcecde20ca
parent63a4f8de9ef2a3dc18f434bf6d599434680587b7
Fixed fused batch norm performance regression.

The regression was caused by 12a4c9b8628b23cc2bf4c89c83c32760aded6124. I suspect the regression was caused by calling cudaMemset without setting the CUDA stream. Using the SetZeroFunctor (or using Eigen) handles this type of initialization for us.

Benchmarks on tf_cnn_benchmarks, on a Volta DGX1, average of 3 iterations taken, with arguments: --optimizer=sgd --staged_vars=False --num_gpus=$GPU --variable_update=$VAR_UPDATE --use_fp16=True --batch_size=128 --model=$MODEL

model       gpu  var_update        im/sec after  im/sec before  percent diff
resnet50    1    replicated        680.37333     640.10333      6.29117%
resnet50    8    parameter_server  4046.04000    1282.28667     215.53319%
resnet50    8    replicated        4157.30667    1634.22667     154.38984%
inception3  1    replicated        463.88667     440.94333      5.20324%
inception3  8    parameter_server  2655.55000    902.22333      194.33400%
inception3  8    replicated        3034.81000    1033.43667     193.66192%

PiperOrigin-RevId: 180980799
tensorflow/core/kernels/fused_batch_norm_op.cc