Allow for concurrent quantization in FullyConnectedDNNLowPOp (#16174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16174
Our service creates a new caffe2 workspace for the same underlying network on multiple threads concurrently at service startup time (later these workspaces are being reused for sequential requests), resulting in concurrent quantization via FullyConnectedDNNLowPOp calling GetOrCreateFbgemmPackBMatrix(). The lazily performed quantizations during the first inference in each workspace are all funnelled through GetOrCreateFbgemmPackBMatrix()'s cache_mutex, which means quantization is serialized, so at service startup time only a single CPU core is being used for around a minute until the serial quantization is done.
An better solution would be to avoid the quantization of the same weight matrix of the operator copies in different net copies to begin with, but this here is the simpler solution for our current problem.
Reviewed By: jspark1105
Differential Revision:
D13708785
fbshipit-source-id:
537519896b3b939c552d67f400bafc8a69ce11eb