[AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation
This slightly increases the costs of InsertElement instructions that are part
of a vector splat sequence, i.e. a load, InsertElement and a shuffle (load +
dup). The resulting LD1R is a high latency instruction, and this slight
increase in costs avoids SLP vectorisation for a couple of cases where this
isn't profitable.
Fixes: https://github.com/llvm/llvm-project/issues/61047
Differential Revision: https://reviews.llvm.org/D145578