| :--- | :--- | :--- | :--- |
| KEY_CPU_THREADS_NUM | positive integer values| 0 | Specifies the number of threads that CPU plugin should use for inference. Zero (default) means using all (logical) cores|
| KEY_CPU_BIND_THREAD | YES/NUMA/NO | YES | Binds inference threads to CPU cores. 'YES' (default) binding option maps threads to cores - this works best for static/synthetic scenarios like benchmarks. The 'NUMA' binding is more relaxed, binding inference threads only to NUMA nodes, leaving further scheduling to specific cores to the OS. This option might perform better in the real-life/contended scenarios. Note that for the latency-oriented cases (single execution stream, see below) both YES and NUMA options limit number of inference threads to the number of hardware cores (ignoring hyper-threading) on the multi-socket machines. |
-| KEY_CPU_THROUGHPUT_STREAMS | KEY_CPU_THROUGHPUT_NUMA, KEY_CPU_THROUGHPUT_AUTO, or positive integer values| 1 | Specifies number of CPU "execution" streams for the throughput mode. Upper bound for the number of inference requests that can be executed simultaneously. All available CPU cores are evenly distributed between the streams. The default value is 1, which implies latency-oriented behavior with all available cores processing requests one by one.<br>KEY_CPU_THROUGHPUT_NUMA creates as many streams as needed to accommodate NUMA and avoid associated penalties.<br>KEY_CPU_THROUGHPUT_AUTO creates bare minimum of streams to improve the performance; this is the most portable option if you don't know how many cores your target machine has (and what would be the optimal number of streams). Note that your application should provide enough parallel slack (for example, run many inference requests) to leverage the throughput mode. <br> A positive integer value creates the requested number of streams. |
+| KEY_CPU_THROUGHPUT_STREAMS | KEY_CPU_THROUGHPUT_NUMA, KEY_CPU_THROUGHPUT_AUTO, or positive integer values| 1 | Specifies number of CPU "execution" streams for the throughput mode. Upper bound for the number of inference requests that can be executed simultaneously. All available CPU cores are evenly distributed between the streams. The default value is 1, which implies latency-oriented behavior with all available cores processing requests one by one.<br>KEY_CPU_THROUGHPUT_NUMA creates as many streams as needed to accommodate NUMA and avoid associated penalties.<br>KEY_CPU_THROUGHPUT_AUTO creates bare minimum of streams to improve the performance; this is the most portable option if you don't know how many cores your target machine has (and what would be the optimal number of streams). Note that your application should provide enough parallel slack (for example, run many inference requests) to leverage the throughput mode. <br> Non-negative integer value creates the requested number of streams. If a number of streams is 0, no internal streams are created and user threads are interpreted as stream master threads.|
| KEY_ENFORCE_BF16 | YES/NO| YES | The name for setting to execute in bfloat16 precision whenever it is possible. This option lets plugin know to downscale the precision where it sees performance benefits from bfloat16 execution. Such option does not guarantee accuracy of the network, you need to verify the accuracy in this mode separately, based on performance and accuracy results. It should be your decision whether to use this option or not. |
+> **NOTE**: To disable all internal threading, use the following set of configuration parameters: `KEY_CPU_THROUGHPUT_STREAMS=0`, `KEY_CPU_THREADS_NUM=1`, `KEY_CPU_BIND_THREAD=NO`.
+
## See Also
* [Supported Devices](Supported_Devices.md)
_impl->_streamIdQueue.pop();
}
}
- _numaNodeId = _impl->_usedNumaNodes.at(
- (_streamId % _impl->_config._streams)/
- ((_impl->_config._streams + _impl->_usedNumaNodes.size() - 1)/_impl->_usedNumaNodes.size()));
+ _numaNodeId = _impl->_config._streams
+ ? _impl->_usedNumaNodes.at(
+ (_streamId % _impl->_config._streams)/
+ ((_impl->_config._streams + _impl->_usedNumaNodes.size() - 1)/_impl->_usedNumaNodes.size()))
+ : _impl->_usedNumaNodes.at(_streamId % _impl->_usedNumaNodes.size());
#if IE_THREAD == IE_THREAD_TBB || IE_THREAD == IE_THREAD_TBB_AUTO
auto concurrency = (0 == _impl->_config._threadsPerStream) ? tbb::task_arena::automatic : _impl->_config._threadsPerStream;
if (ThreadBindingType::NUMA == _impl->_config._threadBindingType) {
return std::make_shared<Impl::Stream>(this);
}) {
auto numaNodes = getAvailableNUMANodes();
- std::copy_n(std::begin(numaNodes),
- std::min(std::max(static_cast<std::size_t>(1),
- static_cast<std::size_t>(_config._streams)),
- numaNodes.size()),
- std::back_inserter(_usedNumaNodes));
+ if (_config._streams != 0) {
+ std::copy_n(std::begin(numaNodes),
+ std::min(static_cast<std::size_t>(_config._streams), numaNodes.size()),
+ std::back_inserter(_usedNumaNodes));
+ } else {
+ _usedNumaNodes = numaNodes;
+ }
for (auto streamId = 0; streamId < _config._streams; ++streamId) {
_threads.emplace_back([this, streamId] {
itt::threadName(_config._name + "_" + std::to_string(streamId));
<< ". Expected only positive numbers (#streams) or "
<< "PluginConfigParams::CPU_THROUGHPUT_NUMA/CPU_THROUGHPUT_AUTO";
}
- if (val_i > 0)
- _streams = val_i;
+ if (val_i < 0) {
+ THROW_IE_EXCEPTION << "Wrong value for property key " << CONFIG_KEY(CPU_THROUGHPUT_STREAMS)
+ << ". Expected only positive numbers (#streams)";
+ }
+ _streams = val_i;
}
} else if (key == CONFIG_KEY(CPU_THREADS_NUM)) {
int val_i;
THROW_IE_EXCEPTION << "Wrong value for property key " << CONFIG_KEY(CPU_THREADS_NUM)
<< ". Expected only positive numbers (#threads)";
}
- if (val_i <= 0) {
+ if (val_i < 0) {
THROW_IE_EXCEPTION << "Wrong value for property key " << CONFIG_KEY(CPU_THREADS_NUM)
<< ". Expected only positive numbers (#threads)";
}
// special case when all InferRequests are muxed into a single queue
_taskExecutor = ExecutorManager::getInstance()->getExecutor("CPU");
} else {
- const int env_threads = parallel_get_env_threads();
- const auto& numa_nodes = getAvailableNUMANodes();
- const auto numa_nodes_num = numa_nodes.size();
- auto streamExecutorConfig = cfg.streamExecutorConfig;
- // use logical cores only for single-socket targets in throughput mode
- const int hw_cores = streamExecutorConfig._streams > 1 && numa_nodes_num == 1 ? parallel_get_max_threads() : getNumberOfCPUCores();
- const int threads = streamExecutorConfig._threads ? streamExecutorConfig._threads : (env_threads ? env_threads : hw_cores);
- streamExecutorConfig._threadsPerStream = streamExecutorConfig._streams
- ? std::max(1, threads/streamExecutorConfig._streams)
- : threads;
- streamExecutorConfig._name = "CPUStreamsExecutor";
- _taskExecutor = ExecutorManager::getInstance()->getIdleCPUStreamsExecutor(streamExecutorConfig);
+ auto streamsExecutorConfig = InferenceEngine::IStreamsExecutor::Config::MakeDefaultMultiThreaded(_cfg.streamExecutorConfig);
+ streamsExecutorConfig._name = "CPUStreamsExecutor";
+ _taskExecutor = ExecutorManager::getInstance()->getIdleCPUStreamsExecutor(streamsExecutorConfig);
}
if (0 != cfg.streamExecutorConfig._streams) {
_callbackExecutor = ExecutorManager::getInstance()->getIdleCPUStreamsExecutor(