[batchnorm] Optimize batch norm layer
This patch optimizes batch norm layer and tries to share the
calculations performed in calcGradient and calcDerivative.
- reuse dbeta and dgamma calculations
- reduce number of required temporary variables
- create all the required tensor variables with context
- add support for checking if the layer is trainable or not via run
context
- support average operation with the output tensor already allocated
- this patch reduces as much as memory as possible without sacrificing
speed. more memory optimization is possible at the expense of speed but
has been ommitted for now.
Note: this patch has slight improvement in performance, and adds no
extra operations.
Signed-off-by: Parichay Kapoor <pk.kapoor@samsung.com>