Optimize PF calculation The PF calculation is very spicy and isn't generating optimal codegen. Current codegen: ```asm 0x00000002a10ed054: and x24, x22, #0xff 0x00000002a10ed058: fmov d0, x24 0x00000002a10ed05c: cnt v0.8b, v0.8b 0x00000002a10ed060: addv b0, v0.8b 0x00000002a10ed064: umov w24, v0.b[0] 0x00000002a10ed068: eor x24, x24, #0x1 0x00000002a10ed06c: ubfx x24, x24, #0, #1 0x00000002a10ed070: strb w24, [x28, #706] ``` What it can be: ```asm 0x00000002a10ed064: movi v1.8b, #1 0x00000002a10ed058: fmov s0, w22 0x00000002a10ed05c: cnt v0.8b, v0.8b 0x00000002a10ed068: bic v0.8b, v1.8b, v0.8b 0x00000002a10ed070: str b0, [x28, #706] ``` Not a significant reduction in instructions, but removes a FPR->GPR move which is quite an improvement. For CPUs with CSSC extension this can be even more optimal but even the upcoming X4/A720/A520 doesn't support them. llvm-mca results: Before: ``` Iterations: 100 Instructions: 800 Total Cycles: 315 Total uOps: 1000 Dispatch Width: 10 uOps Per Cycle: 3.17 IPC: 2.54 Block RThroughput: 3.0 ``` After: ``` Iterations: 100 Instructions: 500 Total Cycles: 408 Total uOps: 600 Dispatch Width: 10 uOps Per Cycle: 1.47 IPC: 1.23 Block RThroughput: 3.0 ```