Don't initialize a new `std::vector` in a loop. (#15850)
Summary:
Before this diff, we execute `std::vector<optional<acc_t>> buffer((unsigned)max_threads, optional<acc_t> {});` in every iteration of `foreach_reduced_elt`. Change the code to only execute that line if we need it; i.e., we are actually about to parallelize.
This overhead is quite significant when we are doing a lot of small reductions in single-threaded code.
```
x=torch.randn((1024,10,1024),dtype=torch.float64)
torch.set_num_threads(1)
%timeit x.std(1)
```
Before (with #15845 applied): 708.25 ms
After: 508 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15850
Differential Revision:
D13612960
Pulled By: umanwizard
fbshipit-source-id:
f5e61abfe0027775c97ed81ac09c997fbee741df