Can we optimize non-locking RMW atomic operations?
Currently we convert all lock RMW ops to acquire-release semantics.

Couple weird things to investigate here

1. Basic ALU ops without lock
    - Non-lock ops get turned in to load + ALU + store
    - Can potentially convert in to atomic memory operation **without** acquire-release semantics.
    - Should only generate on ARMv8.1+ if it supports atomic memory ops
    - Might need hardware TSO support?
 2. RMW ops that don't imply LOCK but really should, used without LOCK
    - **CMPXCHG, CMPXCHG8B, CMPXCHG16B, XADD**
    - These instructions don't imply LOCK prefixes but they are almost universally used with them
    - Linux kernel has some optimization where it backpatches `lock cmpxchg` in to `nop cmpxchg` on uniprocessors? Citation needed.
    - These might be able to be converted to operations with...release? semantics?
    - Needs investigation.