[CodeGen][CUDA] Improve CUDA vectorizer (#4736)
- Fixes issues to enable fp16 vectorizer. Now correct packing and
unpacking CUDA code will be emitted. Enabled more unit tests.
- Do not emit code to read the first lane from an undef variable
int _3;
_3 = _3 & ~(0x000000ff << 0) | ...
and emit the following code instead:
_3 = (((0x000000ff & (_1 >> 0))+(0x000000ff & (_2 >> 0))) << 0);
Note that nvcc 10.2 is forgiving and emits the same code for both cases.
A warning appears in test_codegen_cuda.py.
Signed-off-by: Wei Pan <weip@nvidia.com>