AVX128: VPERM{Q,PD} can be more optimized
We only hand optimize four selectors out of the total 256. Anything that falls out of those four turns in to a zero-register plus four inserts which is pretty bad. We don't have these sitting in instcountci at all right now.

An example of the bad output.
```json
    "vpermpd ymm0, ymm1, 01101011b": {
      "ExpectedInstructionCount": 9,
      "Comment": [
        "Map 3 0b01 0x01 256-bit"
      ],
      "ExpectedArm64ASM": [
        "ldr q2, [x28, #32]",
        "movi v3.2d, #0x0",
        "mov v4.16b, v3.16b",
        "mov v4.d[0], v2.d[1]",
        "mov v16.16b, v4.16b",
        "mov v16.d[1], v2.d[0]",
        "mov v3.d[0], v2.d[0]",
        "mov v3.d[1], v17.d[1]",
        "str q3, [x28, #16]"
      ]
    },
``` 

Theoretically worst case should devolve in to a TBL2 instruction which would be better than that mess of inserts, and of course we should support solving more selectors with handwritten optimizations.

VPERMQ and VPERMPD are aliases of each other so they behave the same.