AVX128: Implement SSE4.1 non-temporal loads with SVE LDNT1B SSE4.1 introduced the 128-bit `MOVNTDQA` instruction This can be implemented with the SVE LDNT1B instruction Additionally AVX2 extends this to a 256-bit version, which can be implemented with ASIMD LDNP. Should be fairly similar to the store non-temporal version, just doesn't need to workaround the scalar and mmx variants since they don't exist.