AVX128: Scalar FMA can be optimized with AFP.NEP Currently scalar FMA eats the insert after calculation even when AFP.NEP is supported. This is because I didn't implement the AFP.NEP path. Will cut the scalar FMA implementation by one instruction in those cases, just needs to be implemented.