AVX128: vpshufd can be improved The implementation is just using the basic VInsElement loop for both 128-bit and 256-bit, this can be improved dramatically.