[AMDGPU] Split unaligned 4 DWORD DS operations
Similarly to 3 DWORD operations it is better for performance
to split unlaligned operations as long a these are at least
DWORD alignmened. Performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-
ds_write_b128 aligned by 16: 4.9 sec
ds_write2_b64 aligned by 16: 5.1 sec
ds_write2_b32 * 2 aligned by 16: 5.5 sec
ds_write_b128 aligned by 1: 8.1 sec
ds_write2_b64 aligned by 1: 8.7 sec
ds_write2_b32 * 2 aligned by 1: 14.0 sec
ds_write_b128 aligned by 2: 8.1 sec
ds_write2_b64 aligned by 2: 8.7 sec
ds_write2_b32 * 2 aligned by 2: 14.0 sec
ds_write_b128 aligned by 4: 5.6 sec
ds_write2_b64 aligned by 4: 8.7 sec
ds_write2_b32 * 2 aligned by 4: 5.6 sec
ds_write_b128 aligned by 8: 5.6 sec
ds_write2_b64 aligned by 8: 5.1 sec
ds_write2_b32 * 2 aligned by 8: 5.6 sec
ds_read_b128 aligned by 16: 3.8 sec
ds_read2_b64 aligned by 16: 3.8 sec
ds_read2_b32 * 2 aligned by 16: 4.0 sec
ds_read_b128 aligned by 1: 4.6 sec
ds_read2_b64 aligned by 1: 8.1 sec
ds_read2_b32 * 2 aligned by 1: 14.0 sec
ds_read_b128 aligned by 2: 4.6 sec
ds_read2_b64 aligned by 2: 8.1 sec
ds_read2_b32 * 2 aligned by 2: 14.0 sec
ds_read_b128 aligned by 4: 4.6 sec
ds_read2_b64 aligned by 4: 8.1 sec
ds_read2_b32 * 2 aligned by 4: 4.0 sec
ds_read_b128 aligned by 8: 4.6 sec
ds_read2_b64 aligned by 8: 3.8 sec
ds_read2_b32 * 2 aligned by 8: 4.0 sec
Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030
ds_write_b128 aligned by 16: 6.2 sec
ds_write2_b64 aligned by 16: 7.1 sec
ds_write2_b32 * 2 aligned by 16: 7.6 sec
ds_write_b128 aligned by 1: 24.1 sec
ds_write2_b64 aligned by 1: 25.2 sec
ds_write2_b32 * 2 aligned by 1: 43.7 sec
ds_write_b128 aligned by 2: 24.1 sec
ds_write2_b64 aligned by 2: 25.1 sec
ds_write2_b32 * 2 aligned by 2: 43.7 sec
ds_write_b128 aligned by 4: 14.4 sec
ds_write2_b64 aligned by 4: 25.1 sec
ds_write2_b32 * 2 aligned by 4: 7.6 sec
ds_write_b128 aligned by 8: 14.4 sec
ds_write2_b64 aligned by 8: 7.1 sec
ds_write2_b32 * 2 aligned by 8: 7.6 sec
ds_read_b128 aligned by 16: 6.2 sec
ds_read2_b64 aligned by 16: 6.3 sec
ds_read2_b32 * 2 aligned by 16: 7.5 sec
ds_read_b128 aligned by 1: 12.5 sec
ds_read2_b64 aligned by 1: 24.0 sec
ds_read2_b32 * 2 aligned by 1: 43.6 sec
ds_read_b128 aligned by 2: 12.5 sec
ds_read2_b64 aligned by 2: 24.0 sec
ds_read2_b32 * 2 aligned by 2: 43.6 sec
ds_read_b128 aligned by 4: 12.5 sec
ds_read2_b64 aligned by 4: 24.0 sec
ds_read2_b32 * 2 aligned by 4: 7.5 sec
ds_read_b128 aligned by 8: 12.5 sec
ds_read2_b64 aligned by 8: 6.3 sec
ds_read2_b32 * 2 aligned by 8: 7.5 sec
```
Differential Revision: https://reviews.llvm.org/D123634