Re-apply: [nnc] Support thread level parallelism in fused kernels (#63776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63776
I reverted this out of an abundance of caution because some test
failures occurred, but they were all due to precision issues fixed lower in
this stack. Let's try again.
I've rolled the elimination of the allow-parallelism-in-fusions toggle into
this diff since they're pretty tightly coupled.
ghstack-source-id:
136529847
Test Plan: CI
Reviewed By: huiguoo
Differential Revision:
D30484555
fbshipit-source-id:
38fd33520f710585d1130c365a8c60c9ce794a59