AVX128: 256-bit vmovmsk can be improved
```json
    "vmovmskps rax, ymm0": {
      "ExpectedInstructionCount": 11,
      "Comment": [
        "Map 1 0b00 0x50 256-bit"
      ],
      "ExpectedArm64ASM": [
        "ldr q2, [x28, #16]",
        "ushr v3.4s, v16.4s, #31",
        "ldr q4, [x28, #2512]",
        "ushl v3.4s, v3.4s, v4.4s",
        "addv s3, v3.4s",
        "mov w20, v3.s[0]",
        "ushr v2.4s, v2.4s, #31",
        "ushl v2.4s, v2.4s, v4.4s",
        "addv s2, v2.4s",
        "mov w21, v2.s[0]",
        "orr x4, x20, x21, lsl #4"
      ]
```
Current algorithm just does two 128-bit algorithms and then merges at the end.

This can be improved by loading a second shift value for the high 128-bits which shifts the signs in to different locations. Then do a add between the two 128-bit halves, then a horizontal add, single FPR->GPR element extract.
Would shave the instructions from 11 to 10, removing a horizontal add and a FPR->GPR transfer.

vmovmskpd is similar in that it is doing two 128-bit algorithms and then extracting. Different algorithm so it would need more noodling.