graphic: 0.919 other: 0.904 KVM: 0.892 device: 0.889 mistranslation: 0.861 semantic: 0.860 vnc: 0.859 assembly: 0.844 instruction: 0.844 socket: 0.831 boot: 0.822 network: 0.816 Deadlock when detaching network interface [Impact] Qemu guests hang indefinitely [Description] When running a Qemu guest with VirtIO network interfaces, detaching an interface that's currently being used can result in a deadlock. The guest instance will hang and become unresponsive to commands, and the only option is to kill -9 the instance. The reason for this is a dealock between a monitor and an RCU thread, which will fight over the BQL (qemu_global_mutex) and the critical RCU section locks. The monitor thread will acquire the BQL for detaching the network interface, and fire up a helper thread to deal with detaching the network adapter. That new thread needs to wait on the RCU thread to complete the deletion, but the RCU thread wants the BQL to commit its transactions. This bug is already fixed upstream (73c6e4013b4c rcu: completely disable pthread_atfork callbacks as soon as possible) and included for other series (see below), so we don't need to backport it to Bionic onwards. Upstream commit: https://git.qemu.org/?p=qemu.git;a=commit;h=73c6e4013b4c $ git describe --contains 73c6e4013b4c v2.10.0-rc2~1^2~8 $ rmadison qemu ===> qemu | 1:2.5+dfsg-5ubuntu10.34 | xenial-updates/universe | amd64, ... qemu | 1:2.11+dfsg-1ubuntu7 | bionic/universe | amd64, ... qemu | 1:2.12+dfsg-3ubuntu8 | cosmic/universe | amd64, ... qemu | 1:3.1+dfsg-2ubuntu2 | disco/universe | amd64, ... [Test Case] Being a racing condition, this is a tricky bug to reproduce consistently. We've had reports of users running into this with OpenStack deployments and Windows Server guests, and the scenario is usually like this: 1) Deploy a 16vCPU Windows Server 2012 R2 guest with a virtio network interface 2) Stress the network interface with e.g. Windows HLK test suite or similar 3) Repeatedly attach/detach the network adapter that's in use It usually takes more than ~4000 attach/detach cycles to trigger the bug. [Regression Potential] Regressions for this might arise from the fact that the fix changes RCU lock code. Since this patch has been upstream and in other series for a while, it's unlikely that it would regressions in RCU code specifically. Other code that makes use of the RCU locks (MMIO and some monitor events) will be thoroughly tested for any regressions with autokpkgtest and scripted Qemu runs. Patch v2: Added missing DEP3 info and corrected pkg version Ok, for such complicated-to-reproduce issues I can understand it might be hard to actually perform verification. In this case besides running the listed test case (which might or might not actually trigger the problem), could you also do some additional smoketesting/dogfooding of the package once it's in -proposed to make sure we didn't regress? (things like running it with different guests and playing around) Thanks! Hello Heitor, or anyone else affected, Accepted qemu into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:2.5+dfsg-5ubuntu10.35 in a few hours, and then in the -proposed repository. Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping! N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days. Hi, As this bug is very hard to trigger, I've been running some tests with qemu 1:2.5+dfsg-5ubuntu10.35 to see if we didn't regress or break anything with the new patch. My test setup is as follows: 1) Create new QEMU guest with uvt-kvm or virt-install # uvt-kvm create xenial release=xenial --cpu 8 2) Install iperf and latest stress-ng from git. Upstream stress-ng is desired to have a consistent load on each guest # apt install iperf # git clone git://kernel.ubuntu.com/cking/stress-ng # cd stress-ng # make clean # make 3) Start an iperf server instance on the host root@host:~# iperf -s 4) Stress the guest instance with the iperf-retry.sh script and stress-ng (run those commands in different screen windows) # ./stress-ng --cpu 4 --hdd 4 --io 4 --vm 4 # ./stress-ng --class network --all 2 # ./iperf-retry.sh 5) Attach and detach network adapter using the hotplug.sh script root@host:~# ./hotplug.sh xenial 52:54:00:19:7a:21 I've tested this with different guests, including Xenial, Bionic and Disco. Guest instances performed more than ~2000 hotplug cycles each: root@host:~# ./hotplug.sh xenial 52:54:00:19:7a:21 Detach #1 Interface detached successfully Detach #2 Interface detached successfully ... Detach #2168 Interface detached successfully The QEMU guests work correctly through and after the tests, and it doesn't look like we ran into any regressions due to the RCU patch. Thanks, Heitor Thanks, this should be enough in that case. Releasing. The verification of the Stable Release Update for qemu has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions. This bug was fixed in the package qemu - 1:2.5+dfsg-5ubuntu10.35 --------------- qemu (1:2.5+dfsg-5ubuntu10.35) xenial; urgency=medium * Fix deadlock when detaching network interface (LP: #1818880) Fixed by upstream patch: - d/p/lp-1818880-rcu-disable-atfork.patch: rcu: completely disable pthread_atfork callbacks as soon as possible --