Fixed fused batch norm performance regression.
The regression was caused by
12a4c9b8628b23cc2bf4c89c83c32760aded6124. I suspect the regression was caused by calling cudaMemset without setting the CUDA stream. Using the SetZeroFunctor (or using Eigen) handles this type of initialization for us.
Benchmarks on tf_cnn_benchmarks, on a Volta DGX1, average of 3 iterations taken, with arguments: --optimizer=sgd --staged_vars=False --num_gpus=$GPU --variable_update=$VAR_UPDATE --use_fp16=True --batch_size=128 --model=$MODEL
model gpu var_update im/sec after im/sec before percent diff
resnet50 1 replicated 680.37333 640.10333 6.29117%
resnet50 8 parameter_server 4046.04000 1282.28667 215.53319%
resnet50 8 replicated 4157.30667 1634.22667 154.38984%
inception3 1 replicated 463.88667 440.94333 5.20324%
inception3 8 parameter_server 2655.55000 902.22333 194.33400%
inception3 8 replicated 3034.81000 1033.43667 193.66192%
PiperOrigin-RevId:
180980799