[mlir][Vector] Add 16x16 strategy to vector.transpose lowering.
It adds a `shuffle_16x16` strategy LowerVectorTranspose and renames `shuffle` to `shuffle_1d`. The idea is similar to 8x8 cases in x86Vector::avx2. The general algorithm is:
```
interleave 32-bit lanes using
8x _mm512_unpacklo_epi32
8x _mm512_unpackhi_epi32
interleave 64-bit lanes using
8x _mm512_unpacklo_epi64
8x _mm512_unpackhi_epi64
permute 128-bit lanes using
16x _mm512_shuffle_i32x4
permute 256-bit lanes using again
16x _mm512_shuffle_i32x4
```
After the first stage, they got transposed to
```
0 16 1 17 4 20 5 21 8 24 9 25 12 28 13 29
2 18 3 19 6 22 7 23 10 26 11 27 14 30 15 31
32 48 33 49 ...
34 50 35 51 ...
64 80 65 81 ...
...
```
After the second stage, they got transposed to
```
0 16 32 48 ...
1 17 33 49 ...
2 18 34 49 ...
3 19 35 51 ...
64 80 96 112 ...
65 81 97 114 ...
66 82 98 113 ...
67 83 99 115 ...
...
```
After the thrid stage, they got transposed to
```
0 16 32 48 8 24 40 56 64 80 96 112 ...
1 17 33 49 ...
2 18 34 50 ...
3 19 35 51 ...
4 20 36 52 ...
5 21 37 53 ...
6 22 38 54 ...
7 23 39 55 ...
128 144 160 176 ...
129 145 161 177 ...
...
```
After the last stage, they got transposed to
```
0 16 32 48 64 80 96 112 ... 240
1 17 33 49 66 81 97 113 ... 241
2 18 34 50 67 82 98 114 ... 242
...
15 31 47 63 79 96 111 127 ... 255
```
Reviewed By: dcaballe
Differential Revision: https://reviews.llvm.org/D148685