{V,}PSHUF{L,H}W implementation can be unified Currently the AVX128 implementation does a InsElement dance, the SSE and MMX versions are slightly more efficient in that it understands broadcast and a TBL implementation. These three versions can be unified, which will immediately improve the perf of the AVX128 implementation. Related to #3785