Implement reference counting for shared IPC CUDA tensors (#16854)
Summary:
This is to fix #16141 and similar issues.
The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.
ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854
Differential Revision:
D13994490
Pulled By: VitalyFedyunin
fbshipit-source-id:
565148ec3ac4fafb32d37fde0486b325bed6fbd1