Can we optimize non-locking RMW atomic operations? Currently we convert all lock RMW ops to acquire-release semantics. Couple weird things to investigate here 1. Basic ALU ops without lock - Non-lock ops get turned in to load + ALU + store - Can potentially convert in to atomic memory operation **without** acquire-release semantics. - Should only generate on ARMv8.1+ if it supports atomic memory ops - Might need hardware TSO support? 2. RMW ops that don't imply LOCK but really should, used without LOCK - **CMPXCHG, CMPXCHG8B, CMPXCHG16B, XADD** - These instructions don't imply LOCK prefixes but they are almost universally used with them - Linux kernel has some optimization where it backpatches `lock cmpxchg` in to `nop cmpxchg` on uniprocessors? Citation needed. - These might be able to be converted to operations with...release? semantics? - Needs investigation.