From dee4dcba78baf712cab403d47d9db319ab7f95d6 Mon Sep 17 00:00:00 2001 From: Christian Krinitsin Date: Thu, 3 Jul 2025 19:39:53 +0200 Subject: restructure results --- results/classifier/012/debug/24190340 | 2074 --------------------------------- results/classifier/012/debug/53568181 | 96 -- results/classifier/012/debug/64571620 | 803 ------------- 3 files changed, 2973 deletions(-) delete mode 100644 results/classifier/012/debug/24190340 delete mode 100644 results/classifier/012/debug/53568181 delete mode 100644 results/classifier/012/debug/64571620 (limited to 'results/classifier/012/debug') diff --git a/results/classifier/012/debug/24190340 b/results/classifier/012/debug/24190340 deleted file mode 100644 index 4413b8bc..00000000 --- a/results/classifier/012/debug/24190340 +++ /dev/null @@ -1,2074 +0,0 @@ -debug: 0.876 -risc-v: 0.869 -register: 0.856 -permissions: 0.851 -TCG: 0.846 -device: 0.832 -performance: 0.832 -other: 0.811 -vnc: 0.808 -boot: 0.803 -arm: 0.794 -semantic: 0.793 -assembly: 0.790 -graphic: 0.775 -files: 0.770 -architecture: 0.769 -PID: 0.762 -mistranslation: 0.758 -network: 0.723 -socket: 0.715 -x86: 0.644 -kernel virtual machine: 0.617 - -[BUG, RFC] Block graph deadlock on job-dismiss - -Hi all, - -There's a bug in block layer which leads to block graph deadlock. -Notably, it takes place when blockdev IO is processed within a separate -iothread. - -This was initially caught by our tests, and I was able to reduce it to a -relatively simple reproducer. Such deadlocks are probably supposed to -be covered in iotests/graph-changes-while-io, but this deadlock isn't. - -Basically what the reproducer does is launches QEMU with a drive having -'iothread' option set, creates a chain of 2 snapshots, launches -block-commit job for a snapshot and then dismisses the job, starting -from the lower snapshot. If the guest is issuing IO at the same time, -there's a race in acquiring block graph lock and a potential deadlock. - -Here's how it can be reproduced: - -1. Run QEMU: -> -SRCDIR=/path/to/srcdir -> -> -> -> -> -$SRCDIR/build/qemu-system-x86_64 -enable-kvm \ -> -> --machine q35 -cpu Nehalem \ -> -> --name guest=alma8-vm,debug-threads=on \ -> -> --m 2g -smp 2 \ -> -> --nographic -nodefaults \ -> -> --qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ -> -> --serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ -> -> --object iothread,id=iothread0 \ -> -> --blockdev -> -node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 -> -\ -> --device virtio-blk-pci,drive=disk,iothread=iothread0 -2. Launch IO (random reads) from within the guest: -> -nc -U /var/run/alma8-serial.sock -> -... -> -[root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k -> ---size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting -> ---rw=randread --iodepth=1 --filename=/testfile -3. Run snapshots creation & removal of lower snapshot operation in a -loop (script attached): -> -while /bin/true ; do ./remove_lower_snap.sh ; done -And then it occasionally hangs. - -Note: I've tried bisecting this, and looks like deadlock occurs starting -from the following commit: - -(BAD) 5bdbaebcce virtio: Re-enable notifications after drain -(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll - -On the latest v10.0.0 it does hang as well. - - -Here's backtrace of the main thread: - -> -#0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, -> -timeout=, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43 -> -#1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, -> -timeout=-1) at ../util/qemu-timer.c:329 -> -#2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, -> -ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 -> -#3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at -> -../util/aio-posix.c:730 -> -#4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, -> -parent=0x0, poll=true) at ../block/io.c:378 -> -#5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at -> -../block/io.c:391 -> -#6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7682 -> -#7 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7608 -> -#8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7668 -> -#9 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7608 -> -#10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7668 -> -#11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7608 -> -#12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../blockjob.c:157 -> -#13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7592 -> -#14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7661 -> -#15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx -> -(child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = -> -{...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 -> -#16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7592 -> -#17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, -> -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -errp=0x0) -> -at ../block.c:7661 -> -#18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, -> -ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715 -> -#19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at -> -../block.c:3317 -> -#20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at -> -../blockjob.c:209 -> -#21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at -> -../blockjob.c:82 -> -#22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at -> -../job.c:474 -> -#23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at -> -../job.c:771 -> -#24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, -> -errp=0x7ffd94b4f488) at ../job.c:783 -> ---Type for more, q to quit, c to continue without paging-- -> -#25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1", -> -errp=0x7ffd94b4f488) at ../job-qmp.c:138 -> -#26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, -> -ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 -> -#27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at -> -../qapi/qmp-dispatch.c:128 -> -#28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at -> -../util/async.c:172 -> -#29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at -> -../util/async.c:219 -> -#30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at -> -../util/aio-posix.c:436 -> -#31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, -> -callback=0x0, user_data=0x0) at ../util/async.c:361 -> -#32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at -> -../glib/gmain.c:3364 -> -#33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079 -> -#34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287 -> -#35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at -> -../util/main-loop.c:310 -> -#36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at -> -../util/main-loop.c:589 -> -#37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 -> -#38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at -> -../system/main.c:50 -> -#39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at -> -../system/main.c:80 -And here's coroutine trying to acquire read lock: - -> -(gdb) qemu coroutine reader_queue->entries.sqh_first -> -#0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, -> -to_=0x7fc537fff508, action=COROUTINE_YIELD) at -> -../util/coroutine-ucontext.c:321 -> -#1 0x0000557eb47d4d4a in qemu_coroutine_yield () at -> -../util/qemu-coroutine.c:339 -> -#2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 -> -, lock=0x7fc53c57de50, flags=0) at -> -../util/qemu-coroutine-lock.c:60 -> -#3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231 -> -#4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at -> -/home/root/src/qemu/master/include/block/graph-lock.h:213 -> -#5 0x0000557eb460fa41 in blk_co_do_preadv_part -> -(blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988, -> -qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339 -> -#6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at -> -../block/block-backend.c:1619 -> -#7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at -> -../util/coroutine-ucontext.c:175 -> -#8 0x00007fc547c2a360 in __start_context () at -> -../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 -> -#9 0x00007ffd94b4ea40 in () -> -#10 0x0000000000000000 in () -So it looks like main thread is processing job-dismiss request and is -holding write lock taken in block_job_remove_all_bdrv() (frame #20 -above). At the same time iothread spawns a coroutine which performs IO -request. Before the coroutine is spawned, blk_aio_prwv() increases -'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is -trying to acquire the read lock. But main thread isn't releasing the -lock as blk_root_drained_poll() returns true since blk->in_flight > 0. -Here's the deadlock. - -Any comments and suggestions on the subject are welcomed. Thanks! - -Andrey -remove_lower_snap.sh -Description: -application/shellscript - -On 4/24/25 8:32 PM, Andrey Drobyshev wrote: -> -Hi all, -> -> -There's a bug in block layer which leads to block graph deadlock. -> -Notably, it takes place when blockdev IO is processed within a separate -> -iothread. -> -> -This was initially caught by our tests, and I was able to reduce it to a -> -relatively simple reproducer. Such deadlocks are probably supposed to -> -be covered in iotests/graph-changes-while-io, but this deadlock isn't. -> -> -Basically what the reproducer does is launches QEMU with a drive having -> -'iothread' option set, creates a chain of 2 snapshots, launches -> -block-commit job for a snapshot and then dismisses the job, starting -> -from the lower snapshot. If the guest is issuing IO at the same time, -> -there's a race in acquiring block graph lock and a potential deadlock. -> -> -Here's how it can be reproduced: -> -> -[...] -> -I took a closer look at iotests/graph-changes-while-io, and have managed -to reproduce the same deadlock in a much simpler setup, without a guest. - -1. Run QSD:> ./build/storage-daemon/qemu-storage-daemon --object -iothread,id=iothread0 \ -> ---blockdev null-co,node-name=node0,read-zeroes=true \ -> -> ---nbd-server addr.type=unix,addr.path=/var/run/qsd_nbd.sock \ -> -> ---export -> -nbd,id=exp0,node-name=node0,iothread=iothread0,fixed-iothread=true,writable=true -> -\ -> ---chardev -> -socket,id=qmp-sock,path=/var/run/qsd_qmp.sock,server=on,wait=off \ -> ---monitor chardev=qmp-sock -2. Launch IO: -> -qemu-img bench -f raw -c 2000000 -> -'nbd+unix:///node0?socket=/var/run/qsd_nbd.sock' -3. Add 2 snapshots and remove lower one (script attached):> while -/bin/true ; do ./rls_qsd.sh ; done - -And then it hangs. - -I'll also send a patch with corresponding test case added directly to -iotests. - -This reproduce seems to be hanging starting from Fiona's commit -67446e605dc ("blockjob: drop AioContext lock before calling -bdrv_graph_wrlock()"). AioContext locks were dropped entirely later on -in Stefan's commit b49f4755c7 ("block: remove AioContext locking"), but -the problem remains. - -Andrey -rls_qsd.sh -Description: -application/shellscript - -From: Andrey Drobyshev - -This case is catching potential deadlock which takes place when job-dismiss -is issued when I/O requests are processed in a separate iothread. - -See -https://mail.gnu.org/archive/html/qemu-devel/2025-04/msg04421.html -Signed-off-by: Andrey Drobyshev ---- - .../qemu-iotests/tests/graph-changes-while-io | 101 ++++++++++++++++-- - .../tests/graph-changes-while-io.out | 4 +- - 2 files changed, 96 insertions(+), 9 deletions(-) - -diff --git a/tests/qemu-iotests/tests/graph-changes-while-io -b/tests/qemu-iotests/tests/graph-changes-while-io -index 194fda500e..e30f823da4 100755 ---- a/tests/qemu-iotests/tests/graph-changes-while-io -+++ b/tests/qemu-iotests/tests/graph-changes-while-io -@@ -27,6 +27,8 @@ from iotests import imgfmt, qemu_img, qemu_img_create, -qemu_io, \ - - - top = os.path.join(iotests.test_dir, 'top.img') -+snap1 = os.path.join(iotests.test_dir, 'snap1.img') -+snap2 = os.path.join(iotests.test_dir, 'snap2.img') - nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock') - - -@@ -58,6 +60,15 @@ class TestGraphChangesWhileIO(QMPTestCase): - def tearDown(self) -> None: - self.qsd.stop() - -+ def _wait_for_blockjob(self, status) -> None: -+ done = False -+ while not done: -+ for event in self.qsd.get_qmp().get_events(wait=10.0): -+ if event['event'] != 'JOB_STATUS_CHANGE': -+ continue -+ if event['data']['status'] == status: -+ done = True -+ - def test_blockdev_add_while_io(self) -> None: - # Run qemu-img bench in the background - bench_thr = Thread(target=do_qemu_img_bench) -@@ -116,13 +127,89 @@ class TestGraphChangesWhileIO(QMPTestCase): - 'device': 'job0', - }) - -- cancelled = False -- while not cancelled: -- for event in self.qsd.get_qmp().get_events(wait=10.0): -- if event['event'] != 'JOB_STATUS_CHANGE': -- continue -- if event['data']['status'] == 'null': -- cancelled = True -+ self._wait_for_blockjob('null') -+ -+ bench_thr.join() -+ -+ def test_remove_lower_snapshot_while_io(self) -> None: -+ # Run qemu-img bench in the background -+ bench_thr = Thread(target=do_qemu_img_bench, args=(100000, )) -+ bench_thr.start() -+ -+ # While I/O is performed on 'node0' node, consequently add 2 snapshots -+ # on top of it, then remove (commit) them starting from lower one. -+ while bench_thr.is_alive(): -+ # Recreate snapshot images on every iteration -+ qemu_img_create('-f', imgfmt, snap1, '1G') -+ qemu_img_create('-f', imgfmt, snap2, '1G') -+ -+ self.qsd.cmd('blockdev-add', { -+ 'driver': imgfmt, -+ 'node-name': 'snap1', -+ 'file': { -+ 'driver': 'file', -+ 'filename': snap1 -+ } -+ }) -+ -+ self.qsd.cmd('blockdev-snapshot', { -+ 'node': 'node0', -+ 'overlay': 'snap1', -+ }) -+ -+ self.qsd.cmd('blockdev-add', { -+ 'driver': imgfmt, -+ 'node-name': 'snap2', -+ 'file': { -+ 'driver': 'file', -+ 'filename': snap2 -+ } -+ }) -+ -+ self.qsd.cmd('blockdev-snapshot', { -+ 'node': 'snap1', -+ 'overlay': 'snap2', -+ }) -+ -+ self.qsd.cmd('block-commit', { -+ 'job-id': 'commit-snap1', -+ 'device': 'snap2', -+ 'top-node': 'snap1', -+ 'base-node': 'node0', -+ 'auto-finalize': True, -+ 'auto-dismiss': False, -+ }) -+ -+ self._wait_for_blockjob('concluded') -+ self.qsd.cmd('job-dismiss', { -+ 'id': 'commit-snap1', -+ }) -+ -+ self.qsd.cmd('block-commit', { -+ 'job-id': 'commit-snap2', -+ 'device': 'snap2', -+ 'top-node': 'snap2', -+ 'base-node': 'node0', -+ 'auto-finalize': True, -+ 'auto-dismiss': False, -+ }) -+ -+ self._wait_for_blockjob('ready') -+ self.qsd.cmd('job-complete', { -+ 'id': 'commit-snap2', -+ }) -+ -+ self._wait_for_blockjob('concluded') -+ self.qsd.cmd('job-dismiss', { -+ 'id': 'commit-snap2', -+ }) -+ -+ self.qsd.cmd('blockdev-del', { -+ 'node-name': 'snap1' -+ }) -+ self.qsd.cmd('blockdev-del', { -+ 'node-name': 'snap2' -+ }) - - bench_thr.join() - -diff --git a/tests/qemu-iotests/tests/graph-changes-while-io.out -b/tests/qemu-iotests/tests/graph-changes-while-io.out -index fbc63e62f8..8d7e996700 100644 ---- a/tests/qemu-iotests/tests/graph-changes-while-io.out -+++ b/tests/qemu-iotests/tests/graph-changes-while-io.out -@@ -1,5 +1,5 @@ --.. -+... - ---------------------------------------------------------------------- --Ran 2 tests -+Ran 3 tests - - OK --- -2.43.5 - -Am 24.04.25 um 19:32 schrieb Andrey Drobyshev: -> -So it looks like main thread is processing job-dismiss request and is -> -holding write lock taken in block_job_remove_all_bdrv() (frame #20 -> -above). At the same time iothread spawns a coroutine which performs IO -> -request. Before the coroutine is spawned, blk_aio_prwv() increases -> -'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is -> -trying to acquire the read lock. But main thread isn't releasing the -> -lock as blk_root_drained_poll() returns true since blk->in_flight > 0. -> -Here's the deadlock. -And for the IO test you provided, it's client->nb_requests that behaves -similarly to blk->in_flight here. - -The issue also reproduces easily when issuing the following QMP command -in a loop while doing IO on a device: - -> -void qmp_block_locked_drain(const char *node_name, Error **errp) -> -{ -> -BlockDriverState *bs; -> -> -bs = bdrv_find_node(node_name); -> -if (!bs) { -> -error_setg(errp, "node not found"); -> -return; -> -} -> -> -bdrv_graph_wrlock(); -> -bdrv_drained_begin(bs); -> -bdrv_drained_end(bs); -> -bdrv_graph_wrunlock(); -> -} -It seems like either it would be necessary to require: -1. not draining inside an exclusively locked section -or -2. making sure that variables used by drained_poll routines are only set -while holding the reader lock -? - -Those seem to require rather involved changes, so a third option might -be to make draining inside an exclusively locked section possible, by -embedding such locked sections in a drained section: - -> -diff --git a/blockjob.c b/blockjob.c -> -index 32007f31a9..9b2f3b3ea9 100644 -> ---- a/blockjob.c -> -+++ b/blockjob.c -> -@@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job) -> -* one to make sure that such a concurrent access does not attempt -> -* to process an already freed BdrvChild. -> -*/ -> -+ bdrv_drain_all_begin(); -> -bdrv_graph_wrlock(); -> -while (job->nodes) { -> -GSList *l = job->nodes; -> -@@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job) -> -g_slist_free_1(l); -> -} -> -bdrv_graph_wrunlock(); -> -+ bdrv_drain_all_end(); -> -} -> -> -bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs) -This seems to fix the issue at hand. I can send a patch if this is -considered an acceptable approach. - -Best Regards, -Fiona - -On 4/30/25 11:47 AM, Fiona Ebner wrote: -> -Am 24.04.25 um 19:32 schrieb Andrey Drobyshev: -> -> So it looks like main thread is processing job-dismiss request and is -> -> holding write lock taken in block_job_remove_all_bdrv() (frame #20 -> -> above). At the same time iothread spawns a coroutine which performs IO -> -> request. Before the coroutine is spawned, blk_aio_prwv() increases -> -> 'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is -> -> trying to acquire the read lock. But main thread isn't releasing the -> -> lock as blk_root_drained_poll() returns true since blk->in_flight > 0. -> -> Here's the deadlock. -> -> -And for the IO test you provided, it's client->nb_requests that behaves -> -similarly to blk->in_flight here. -> -> -The issue also reproduces easily when issuing the following QMP command -> -in a loop while doing IO on a device: -> -> -> void qmp_block_locked_drain(const char *node_name, Error **errp) -> -> { -> -> BlockDriverState *bs; -> -> -> -> bs = bdrv_find_node(node_name); -> -> if (!bs) { -> -> error_setg(errp, "node not found"); -> -> return; -> -> } -> -> -> -> bdrv_graph_wrlock(); -> -> bdrv_drained_begin(bs); -> -> bdrv_drained_end(bs); -> -> bdrv_graph_wrunlock(); -> -> } -> -> -It seems like either it would be necessary to require: -> -1. not draining inside an exclusively locked section -> -or -> -2. making sure that variables used by drained_poll routines are only set -> -while holding the reader lock -> -? -> -> -Those seem to require rather involved changes, so a third option might -> -be to make draining inside an exclusively locked section possible, by -> -embedding such locked sections in a drained section: -> -> -> diff --git a/blockjob.c b/blockjob.c -> -> index 32007f31a9..9b2f3b3ea9 100644 -> -> --- a/blockjob.c -> -> +++ b/blockjob.c -> -> @@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job) -> -> * one to make sure that such a concurrent access does not attempt -> -> * to process an already freed BdrvChild. -> -> */ -> -> + bdrv_drain_all_begin(); -> -> bdrv_graph_wrlock(); -> -> while (job->nodes) { -> -> GSList *l = job->nodes; -> -> @@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job) -> -> g_slist_free_1(l); -> -> } -> -> bdrv_graph_wrunlock(); -> -> + bdrv_drain_all_end(); -> -> } -> -> -> -> bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs) -> -> -This seems to fix the issue at hand. I can send a patch if this is -> -considered an acceptable approach. -> -> -Best Regards, -> -Fiona -> -Hello Fiona, - -Thanks for looking into it. I've tried your 3rd option above and can -confirm it does fix the deadlock, at least I can't reproduce it. Other -iotests also don't seem to be breaking. So I personally am fine with -that patch. Would be nice to hear a word from the maintainers though on -whether there're any caveats with such approach. - -Andrey - -On Wed, Apr 30, 2025 at 10:11 AM Andrey Drobyshev - wrote: -> -> -On 4/30/25 11:47 AM, Fiona Ebner wrote: -> -> Am 24.04.25 um 19:32 schrieb Andrey Drobyshev: -> ->> So it looks like main thread is processing job-dismiss request and is -> ->> holding write lock taken in block_job_remove_all_bdrv() (frame #20 -> ->> above). At the same time iothread spawns a coroutine which performs IO -> ->> request. Before the coroutine is spawned, blk_aio_prwv() increases -> ->> 'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is -> ->> trying to acquire the read lock. But main thread isn't releasing the -> ->> lock as blk_root_drained_poll() returns true since blk->in_flight > 0. -> ->> Here's the deadlock. -> -> -> -> And for the IO test you provided, it's client->nb_requests that behaves -> -> similarly to blk->in_flight here. -> -> -> -> The issue also reproduces easily when issuing the following QMP command -> -> in a loop while doing IO on a device: -> -> -> ->> void qmp_block_locked_drain(const char *node_name, Error **errp) -> ->> { -> ->> BlockDriverState *bs; -> ->> -> ->> bs = bdrv_find_node(node_name); -> ->> if (!bs) { -> ->> error_setg(errp, "node not found"); -> ->> return; -> ->> } -> ->> -> ->> bdrv_graph_wrlock(); -> ->> bdrv_drained_begin(bs); -> ->> bdrv_drained_end(bs); -> ->> bdrv_graph_wrunlock(); -> ->> } -> -> -> -> It seems like either it would be necessary to require: -> -> 1. not draining inside an exclusively locked section -> -> or -> -> 2. making sure that variables used by drained_poll routines are only set -> -> while holding the reader lock -> -> ? -> -> -> -> Those seem to require rather involved changes, so a third option might -> -> be to make draining inside an exclusively locked section possible, by -> -> embedding such locked sections in a drained section: -> -> -> ->> diff --git a/blockjob.c b/blockjob.c -> ->> index 32007f31a9..9b2f3b3ea9 100644 -> ->> --- a/blockjob.c -> ->> +++ b/blockjob.c -> ->> @@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job) -> ->> * one to make sure that such a concurrent access does not attempt -> ->> * to process an already freed BdrvChild. -> ->> */ -> ->> + bdrv_drain_all_begin(); -> ->> bdrv_graph_wrlock(); -> ->> while (job->nodes) { -> ->> GSList *l = job->nodes; -> ->> @@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job) -> ->> g_slist_free_1(l); -> ->> } -> ->> bdrv_graph_wrunlock(); -> ->> + bdrv_drain_all_end(); -> ->> } -> ->> -> ->> bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs) -> -> -> -> This seems to fix the issue at hand. I can send a patch if this is -> -> considered an acceptable approach. -Kevin is aware of this thread but it's a public holiday tomorrow so it -may be a little longer. - -Stefan - -Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben: -> -Hi all, -> -> -There's a bug in block layer which leads to block graph deadlock. -> -Notably, it takes place when blockdev IO is processed within a separate -> -iothread. -> -> -This was initially caught by our tests, and I was able to reduce it to a -> -relatively simple reproducer. Such deadlocks are probably supposed to -> -be covered in iotests/graph-changes-while-io, but this deadlock isn't. -> -> -Basically what the reproducer does is launches QEMU with a drive having -> -'iothread' option set, creates a chain of 2 snapshots, launches -> -block-commit job for a snapshot and then dismisses the job, starting -> -from the lower snapshot. If the guest is issuing IO at the same time, -> -there's a race in acquiring block graph lock and a potential deadlock. -> -> -Here's how it can be reproduced: -> -> -1. Run QEMU: -> -> SRCDIR=/path/to/srcdir -> -> -> -> -> -> -> -> -> -> $SRCDIR/build/qemu-system-x86_64 -enable-kvm \ -> -> -> -> -machine q35 -cpu Nehalem \ -> -> -> -> -name guest=alma8-vm,debug-threads=on \ -> -> -> -> -m 2g -smp 2 \ -> -> -> -> -nographic -nodefaults \ -> -> -> -> -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ -> -> -> -> -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ -> -> -> -> -object iothread,id=iothread0 \ -> -> -> -> -blockdev -> -> node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 -> -> \ -> -> -device virtio-blk-pci,drive=disk,iothread=iothread0 -> -> -2. Launch IO (random reads) from within the guest: -> -> nc -U /var/run/alma8-serial.sock -> -> ... -> -> [root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k -> -> --size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting -> -> --rw=randread --iodepth=1 --filename=/testfile -> -> -3. Run snapshots creation & removal of lower snapshot operation in a -> -loop (script attached): -> -> while /bin/true ; do ./remove_lower_snap.sh ; done -> -> -And then it occasionally hangs. -> -> -Note: I've tried bisecting this, and looks like deadlock occurs starting -> -from the following commit: -> -> -(BAD) 5bdbaebcce virtio: Re-enable notifications after drain -> -(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll -> -> -On the latest v10.0.0 it does hang as well. -> -> -> -Here's backtrace of the main thread: -> -> -> #0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, -> -> timeout=, sigmask=0x0) at -> -> ../sysdeps/unix/sysv/linux/ppoll.c:43 -> -> #1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, -> -> timeout=-1) at ../util/qemu-timer.c:329 -> -> #2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, -> -> ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 -> -> #3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at -> -> ../util/aio-posix.c:730 -> -> #4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, -> -> parent=0x0, poll=true) at ../block/io.c:378 -> -> #5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at -> -> ../block/io.c:391 -> -> #6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7682 -> -> #7 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7608 -> -> #8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7668 -> -> #9 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7608 -> -> #10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7668 -> -> #11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7608 -> -> #12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../blockjob.c:157 -> -> #13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7592 -> -> #14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7661 -> -> #15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx -> -> (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = -> -> {...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 -> -> #16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7592 -> -> #17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, -> -> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -> -> errp=0x0) -> -> at ../block.c:7661 -> -> #18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, -> -> ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715 -> -> #19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at -> -> ../block.c:3317 -> -> #20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at -> -> ../blockjob.c:209 -> -> #21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at -> -> ../blockjob.c:82 -> -> #22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at -> -> ../job.c:474 -> -> #23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at -> -> ../job.c:771 -> -> #24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, -> -> errp=0x7ffd94b4f488) at ../job.c:783 -> -> --Type for more, q to quit, c to continue without paging-- -> -> #25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 -> -> "commit-snap1", errp=0x7ffd94b4f488) at ../job-qmp.c:138 -> -> #26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, -> -> ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 -> -> #27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at -> -> ../qapi/qmp-dispatch.c:128 -> -> #28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at -> -> ../util/async.c:172 -> -> #29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at -> -> ../util/async.c:219 -> -> #30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at -> -> ../util/aio-posix.c:436 -> -> #31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, -> -> callback=0x0, user_data=0x0) at ../util/async.c:361 -> -> #32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at -> -> ../glib/gmain.c:3364 -> -> #33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079 -> -> #34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287 -> -> #35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at -> -> ../util/main-loop.c:310 -> -> #36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at -> -> ../util/main-loop.c:589 -> -> #37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 -> -> #38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at -> -> ../system/main.c:50 -> -> #39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at -> -> ../system/main.c:80 -> -> -> -And here's coroutine trying to acquire read lock: -> -> -> (gdb) qemu coroutine reader_queue->entries.sqh_first -> -> #0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, -> -> to_=0x7fc537fff508, action=COROUTINE_YIELD) at -> -> ../util/coroutine-ucontext.c:321 -> -> #1 0x0000557eb47d4d4a in qemu_coroutine_yield () at -> -> ../util/qemu-coroutine.c:339 -> -> #2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 -> -> , lock=0x7fc53c57de50, flags=0) at -> -> ../util/qemu-coroutine-lock.c:60 -> -> #3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at -> -> ../block/graph-lock.c:231 -> -> #4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at -> -> /home/root/src/qemu/master/include/block/graph-lock.h:213 -> -> #5 0x0000557eb460fa41 in blk_co_do_preadv_part -> -> (blk=0x557eb84c0810, offset=6890553344, bytes=4096, -> -> qiov=0x7fc530006988, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at -> -> ../block/block-backend.c:1339 -> -> #6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at -> -> ../block/block-backend.c:1619 -> -> #7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) -> -> at ../util/coroutine-ucontext.c:175 -> -> #8 0x00007fc547c2a360 in __start_context () at -> -> ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 -> -> #9 0x00007ffd94b4ea40 in () -> -> #10 0x0000000000000000 in () -> -> -> -So it looks like main thread is processing job-dismiss request and is -> -holding write lock taken in block_job_remove_all_bdrv() (frame #20 -> -above). At the same time iothread spawns a coroutine which performs IO -> -request. Before the coroutine is spawned, blk_aio_prwv() increases -> -'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is -> -trying to acquire the read lock. But main thread isn't releasing the -> -lock as blk_root_drained_poll() returns true since blk->in_flight > 0. -> -Here's the deadlock. -> -> -Any comments and suggestions on the subject are welcomed. Thanks! -I think this is what the blk_wait_while_drained() call was supposed to -address in blk_co_do_preadv_part(). However, with the use of multiple -I/O threads, this is racy. - -Do you think that in your case we hit the small race window between the -checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there -another reason why blk_wait_while_drained() didn't do its job? - -Kevin - -On 5/2/25 19:34, Kevin Wolf wrote: -Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben: -Hi all, - -There's a bug in block layer which leads to block graph deadlock. -Notably, it takes place when blockdev IO is processed within a separate -iothread. - -This was initially caught by our tests, and I was able to reduce it to a -relatively simple reproducer. Such deadlocks are probably supposed to -be covered in iotests/graph-changes-while-io, but this deadlock isn't. - -Basically what the reproducer does is launches QEMU with a drive having -'iothread' option set, creates a chain of 2 snapshots, launches -block-commit job for a snapshot and then dismisses the job, starting -from the lower snapshot. If the guest is issuing IO at the same time, -there's a race in acquiring block graph lock and a potential deadlock. - -Here's how it can be reproduced: - -1. Run QEMU: -SRCDIR=/path/to/srcdir -$SRCDIR/build/qemu-system-x86_64 -enable-kvm \ --machine q35 -cpu Nehalem \ - -name guest=alma8-vm,debug-threads=on \ - -m 2g -smp 2 \ - -nographic -nodefaults \ - -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ - -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ - -object iothread,id=iothread0 \ - -blockdev -node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 - \ - -device virtio-blk-pci,drive=disk,iothread=iothread0 -2. Launch IO (random reads) from within the guest: -nc -U /var/run/alma8-serial.sock -... -[root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k ---size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting ---rw=randread --iodepth=1 --filename=/testfile -3. Run snapshots creation & removal of lower snapshot operation in a -loop (script attached): -while /bin/true ; do ./remove_lower_snap.sh ; done -And then it occasionally hangs. - -Note: I've tried bisecting this, and looks like deadlock occurs starting -from the following commit: - -(BAD) 5bdbaebcce virtio: Re-enable notifications after drain -(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll - -On the latest v10.0.0 it does hang as well. - - -Here's backtrace of the main thread: -#0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, timeout=, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43 -#1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, timeout=-1) -at ../util/qemu-timer.c:329 -#2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, -ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 -#3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at -../util/aio-posix.c:730 -#4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, parent=0x0, -poll=true) at ../block/io.c:378 -#5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at -../block/io.c:391 -#6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7682 -#7 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7608 -#8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7668 -#9 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7608 -#10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7668 -#11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7608 -#12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../blockjob.c:157 -#13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7592 -#14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7661 -#15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx - (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 -#16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7592 -#17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, -ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, -errp=0x0) - at ../block.c:7661 -#18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, -ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715 -#19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at -../block.c:3317 -#20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at -../blockjob.c:209 -#21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at -../blockjob.c:82 -#22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at ../job.c:474 -#23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at -../job.c:771 -#24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, -errp=0x7ffd94b4f488) at ../job.c:783 ---Type for more, q to quit, c to continue without paging-- -#25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1", -errp=0x7ffd94b4f488) at ../job-qmp.c:138 -#26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, -ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 -#27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at -../qapi/qmp-dispatch.c:128 -#28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at ../util/async.c:172 -#29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at -../util/async.c:219 -#30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at -../util/aio-posix.c:436 -#31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, -callback=0x0, user_data=0x0) at ../util/async.c:361 -#32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at -../glib/gmain.c:3364 -#33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079 -#34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287 -#35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at -../util/main-loop.c:310 -#36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at -../util/main-loop.c:589 -#37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 -#38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at ../system/main.c:50 -#39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at -../system/main.c:80 -And here's coroutine trying to acquire read lock: -(gdb) qemu coroutine reader_queue->entries.sqh_first -#0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, -to_=0x7fc537fff508, action=COROUTINE_YIELD) at ../util/coroutine-ucontext.c:321 -#1 0x0000557eb47d4d4a in qemu_coroutine_yield () at -../util/qemu-coroutine.c:339 -#2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 -, lock=0x7fc53c57de50, flags=0) at -../util/qemu-coroutine-lock.c:60 -#3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231 -#4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at -/home/root/src/qemu/master/include/block/graph-lock.h:213 -#5 0x0000557eb460fa41 in blk_co_do_preadv_part - (blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988, -qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339 -#6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at -../block/block-backend.c:1619 -#7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at -../util/coroutine-ucontext.c:175 -#8 0x00007fc547c2a360 in __start_context () at -../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 -#9 0x00007ffd94b4ea40 in () -#10 0x0000000000000000 in () -So it looks like main thread is processing job-dismiss request and is -holding write lock taken in block_job_remove_all_bdrv() (frame #20 -above). At the same time iothread spawns a coroutine which performs IO -request. Before the coroutine is spawned, blk_aio_prwv() increases -'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is -trying to acquire the read lock. But main thread isn't releasing the -lock as blk_root_drained_poll() returns true since blk->in_flight > 0. -Here's the deadlock. - -Any comments and suggestions on the subject are welcomed. Thanks! -I think this is what the blk_wait_while_drained() call was supposed to -address in blk_co_do_preadv_part(). However, with the use of multiple -I/O threads, this is racy. - -Do you think that in your case we hit the small race window between the -checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there -another reason why blk_wait_while_drained() didn't do its job? - -Kevin -At my opinion there is very big race window. Main thread has -eaten graph write lock. After that another coroutine is stalled -within GRAPH_RDLOCK_GUARD() as there is no drain at the moment and only -after that main thread has started drain. That is why Fiona's idea is -looking working. Though this would mean that normally we should always -do that at the moment when we acquire write lock. May be even inside -this function. Den - -Am 02.05.2025 um 19:52 hat Denis V. Lunev geschrieben: -> -On 5/2/25 19:34, Kevin Wolf wrote: -> -> Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben: -> -> > Hi all, -> -> > -> -> > There's a bug in block layer which leads to block graph deadlock. -> -> > Notably, it takes place when blockdev IO is processed within a separate -> -> > iothread. -> -> > -> -> > This was initially caught by our tests, and I was able to reduce it to a -> -> > relatively simple reproducer. Such deadlocks are probably supposed to -> -> > be covered in iotests/graph-changes-while-io, but this deadlock isn't. -> -> > -> -> > Basically what the reproducer does is launches QEMU with a drive having -> -> > 'iothread' option set, creates a chain of 2 snapshots, launches -> -> > block-commit job for a snapshot and then dismisses the job, starting -> -> > from the lower snapshot. If the guest is issuing IO at the same time, -> -> > there's a race in acquiring block graph lock and a potential deadlock. -> -> > -> -> > Here's how it can be reproduced: -> -> > -> -> > 1. Run QEMU: -> -> > > SRCDIR=/path/to/srcdir -> -> > > $SRCDIR/build/qemu-system-x86_64 -enable-kvm \ -> -> > > -machine q35 -cpu Nehalem \ -> -> > > -name guest=alma8-vm,debug-threads=on \ -> -> > > -m 2g -smp 2 \ -> -> > > -nographic -nodefaults \ -> -> > > -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ -> -> > > -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ -> -> > > -object iothread,id=iothread0 \ -> -> > > -blockdev -> -> > > node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 -> -> > > \ -> -> > > -device virtio-blk-pci,drive=disk,iothread=iothread0 -> -> > 2. Launch IO (random reads) from within the guest: -> -> > > nc -U /var/run/alma8-serial.sock -> -> > > ... -> -> > > [root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 -> -> > > --bs=4k --size=1G --numjobs=1 --time_based=1 --runtime=300 -> -> > > --group_reporting --rw=randread --iodepth=1 --filename=/testfile -> -> > 3. Run snapshots creation & removal of lower snapshot operation in a -> -> > loop (script attached): -> -> > > while /bin/true ; do ./remove_lower_snap.sh ; done -> -> > And then it occasionally hangs. -> -> > -> -> > Note: I've tried bisecting this, and looks like deadlock occurs starting -> -> > from the following commit: -> -> > -> -> > (BAD) 5bdbaebcce virtio: Re-enable notifications after drain -> -> > (GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll -> -> > -> -> > On the latest v10.0.0 it does hang as well. -> -> > -> -> > -> -> > Here's backtrace of the main thread: -> -> > -> -> > > #0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, -> -> > > timeout=, sigmask=0x0) at -> -> > > ../sysdeps/unix/sysv/linux/ppoll.c:43 -> -> > > #1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, -> -> > > timeout=-1) at ../util/qemu-timer.c:329 -> -> > > #2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, -> -> > > ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 -> -> > > #3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) -> -> > > at ../util/aio-posix.c:730 -> -> > > #4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, -> -> > > parent=0x0, poll=true) at ../block/io.c:378 -> -> > > #5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at -> -> > > ../block/io.c:391 -> -> > > #6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, -> -> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7682 -> -> > > #7 0x0000557eb45ebf2b in bdrv_child_change_aio_context -> -> > > (c=0x557eb7964250, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7608 -> -> > > #8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, -> -> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7668 -> -> > > #9 0x0000557eb45ebf2b in bdrv_child_change_aio_context -> -> > > (c=0x557eb7e59110, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7608 -> -> > > #10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, -> -> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7668 -> -> > > #11 0x0000557eb45ebf2b in bdrv_child_change_aio_context -> -> > > (c=0x557eb814ed80, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7608 -> -> > > #12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, -> -> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../blockjob.c:157 -> -> > > #13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context -> -> > > (c=0x557eb7c9d3f0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7592 -> -> > > #14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, -> -> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7661 -> -> > > #15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx -> -> > > (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 -> -> > > = {...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 -> -> > > #16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context -> -> > > (c=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7592 -> -> > > #17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, -> -> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, -> -> > > tran=0x557eb7a87160, errp=0x0) -> -> > > at ../block.c:7661 -> -> > > #18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context -> -> > > (bs=0x557eb79575e0, ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at -> -> > > ../block.c:7715 -> -> > > #19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) -> -> > > at ../block.c:3317 -> -> > > #20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv -> -> > > (job=0x557eb7952800) at ../blockjob.c:209 -> -> > > #21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at -> -> > > ../blockjob.c:82 -> -> > > #22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at -> -> > > ../job.c:474 -> -> > > #23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at -> -> > > ../job.c:771 -> -> > > #24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, -> -> > > errp=0x7ffd94b4f488) at ../job.c:783 -> -> > > --Type for more, q to quit, c to continue without paging-- -> -> > > #25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 -> -> > > "commit-snap1", errp=0x7ffd94b4f488) at ../job-qmp.c:138 -> -> > > #26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, -> -> > > ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 -> -> > > #27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at -> -> > > ../qapi/qmp-dispatch.c:128 -> -> > > #28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at -> -> > > ../util/async.c:172 -> -> > > #29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at -> -> > > ../util/async.c:219 -> -> > > #30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at -> -> > > ../util/aio-posix.c:436 -> -> > > #31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, -> -> > > callback=0x0, user_data=0x0) at ../util/async.c:361 -> -> > > #32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at -> -> > > ../glib/gmain.c:3364 -> -> > > #33 g_main_context_dispatch (context=0x557eb76c6430) at -> -> > > ../glib/gmain.c:4079 -> -> > > #34 0x0000557eb47d3ab1 in glib_pollfds_poll () at -> -> > > ../util/main-loop.c:287 -> -> > > #35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at -> -> > > ../util/main-loop.c:310 -> -> > > #36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at -> -> > > ../util/main-loop.c:589 -> -> > > #37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 -> -> > > #38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at -> -> > > ../system/main.c:50 -> -> > > #39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at -> -> > > ../system/main.c:80 -> -> > -> -> > And here's coroutine trying to acquire read lock: -> -> > -> -> > > (gdb) qemu coroutine reader_queue->entries.sqh_first -> -> > > #0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, -> -> > > to_=0x7fc537fff508, action=COROUTINE_YIELD) at -> -> > > ../util/coroutine-ucontext.c:321 -> -> > > #1 0x0000557eb47d4d4a in qemu_coroutine_yield () at -> -> > > ../util/qemu-coroutine.c:339 -> -> > > #2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 -> -> > > , lock=0x7fc53c57de50, flags=0) at -> -> > > ../util/qemu-coroutine-lock.c:60 -> -> > > #3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at -> -> > > ../block/graph-lock.c:231 -> -> > > #4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) -> -> > > at /home/root/src/qemu/master/include/block/graph-lock.h:213 -> -> > > #5 0x0000557eb460fa41 in blk_co_do_preadv_part -> -> > > (blk=0x557eb84c0810, offset=6890553344, bytes=4096, -> -> > > qiov=0x7fc530006988, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at -> -> > > ../block/block-backend.c:1339 -> -> > > #6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at -> -> > > ../block/block-backend.c:1619 -> -> > > #7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, -> -> > > i1=21886) at ../util/coroutine-ucontext.c:175 -> -> > > #8 0x00007fc547c2a360 in __start_context () at -> -> > > ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 -> -> > > #9 0x00007ffd94b4ea40 in () -> -> > > #10 0x0000000000000000 in () -> -> > -> -> > So it looks like main thread is processing job-dismiss request and is -> -> > holding write lock taken in block_job_remove_all_bdrv() (frame #20 -> -> > above). At the same time iothread spawns a coroutine which performs IO -> -> > request. Before the coroutine is spawned, blk_aio_prwv() increases -> -> > 'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is -> -> > trying to acquire the read lock. But main thread isn't releasing the -> -> > lock as blk_root_drained_poll() returns true since blk->in_flight > 0. -> -> > Here's the deadlock. -> -> > -> -> > Any comments and suggestions on the subject are welcomed. Thanks! -> -> I think this is what the blk_wait_while_drained() call was supposed to -> -> address in blk_co_do_preadv_part(). However, with the use of multiple -> -> I/O threads, this is racy. -> -> -> -> Do you think that in your case we hit the small race window between the -> -> checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there -> -> another reason why blk_wait_while_drained() didn't do its job? -> -> -> -At my opinion there is very big race window. Main thread has -> -eaten graph write lock. After that another coroutine is stalled -> -within GRAPH_RDLOCK_GUARD() as there is no drain at the moment and only -> -after that main thread has started drain. -You're right, I confused taking the write lock with draining there. - -> -That is why Fiona's idea is looking working. Though this would mean -> -that normally we should always do that at the moment when we acquire -> -write lock. May be even inside this function. -I actually see now that not all of my graph locking patches were merged. -At least I did have the thought that bdrv_drained_begin() must be marked -GRAPH_UNLOCKED because it polls. That means that calling it from inside -bdrv_try_change_aio_context() is actually forbidden (and that's the part -I didn't see back then because it doesn't have TSA annotations). - -If you refactor the code to move the drain out to before the lock is -taken, I think you end up with Fiona's patch, except you'll remove the -forbidden inner drain and add more annotations for some functions and -clarify the rules around them. I don't know, but I wouldn't be surprised -if along the process we find other bugs, too. - -So Fiona's drain looks right to me, but we should probably approach it -more systematically. - -Kevin - diff --git a/results/classifier/012/debug/53568181 b/results/classifier/012/debug/53568181 deleted file mode 100644 index c79e6f59..00000000 --- a/results/classifier/012/debug/53568181 +++ /dev/null @@ -1,96 +0,0 @@ -debug: 0.968 -permissions: 0.965 -performance: 0.948 -architecture: 0.945 -semantic: 0.943 -graphic: 0.940 -PID: 0.938 -assembly: 0.936 -device: 0.936 -vnc: 0.935 -x86: 0.932 -register: 0.930 -network: 0.925 -other: 0.921 -TCG: 0.921 -arm: 0.918 -risc-v: 0.907 -files: 0.890 -kernel virtual machine: 0.877 -boot: 0.876 -socket: 0.875 -mistranslation: 0.854 - -[BUG] x86/PAT handling severely crippled AMD-V SVM KVM performance - -Hi, I maintain an out-of-tree 3D APIs pass-through QEMU device models at -https://github.com/kjliew/qemu-3dfx -that provide 3D acceleration for legacy -32-bit Windows guests (Win98SE, WinME, Win2k and WinXP) with the focus on -playing old legacy games from 1996-2003. It currently supports the now-defunct -3Dfx propriety API called Glide and an alternative OpenGL pass-through based on -MESA implementation. - -The basic concept of both implementations create memory-mapped virtual -interfaces consist of host/guest shared memory with guest-push model instead of -a more common host-pull model for typical QEMU device model implementation. -Guest uses shared memory as FIFOs for drawing commands and data to bulk up the -operations until serialization event that flushes the FIFOs into host. This -achieves extremely good performance since virtual CPUs are fast with hardware -acceleration (Intel VT/AMD-V) and reduces the overhead of frequent VMEXITs to -service the device emulation. Both implementations work on Windows 10 with WHPX -and HAXM accelerators as well as KVM in Linux. - -On Windows 10, QEMU WHPX implementation does not sync MSR_IA32_PAT during -host/guest states sync. There is no visibility into the closed-source WHPX on -how things are managed behind the scene, but from measuring performance figures -I can conclude that it didn't handle the MSR_IA32_PAT correctly for both Intel -and AMD. Call this fair enough, if you will, it didn't flag any concerns, in -fact games such as Quake2 and Quake3 were still within playable frame rate of -40~60FPS on Win2k/XP guest. Until the same games were run on Win98/ME guest and -the frame rate blew off the roof (300~500FPS) on the same CPU and GPU. In fact, -the later seemed to be more inlined with runnng the games bare-metal with vsync -off. - -On Linux (at the time of writing kernel 5.6.7/Mesa 20.0), the difference -prevailed. Intel CPUs (and it so happened that I was on laptop with Intel GPU), -the VMX-based kvm_intel got it right while SVM-based kvm_amd did not. -To put this in simple exaggeration, an aging Core i3-4010U/HD Graphics 4400 -(Haswell GT2) exhibited an insane performance in Quake2/Quake3 timedemos that -totally crushed more recent AMD Ryzen 2500U APU/Vega 8 Graphics and AMD -FX8300/NVIDIA GT730 on desktop. Simply unbelievable! - -It turned out that there was something to do with AMD-V NPT. By loading kvm_amd -with npt=0, AMD Ryzen APU and FX8300 regained a huge performance leap. However, -AMD NPT issue with KVM was supposedly fixed in 2017 kernel commits. NPT=0 would -actually incur performance loss for VM due to intervention required by -hypervisors to maintain the shadow page tables. Finally, I was able to find the -pointer that pointed to MSR_IA32_PAT register. By updating the MSR_IA32_PAT to -0x0606xxxx0606xxxxULL, AMD CPUs now regain their rightful performance without -taking the hit of NPT=0 for Linux KVM. Taking the same solution into Windows, -both Intel and AMD CPUs no longer require Win98/ME guest to unleash the full -performance potentials and performance figures based on games measured on WHPX -were not very far behind Linux KVM. - -So I guess the problem lies in host/guest shared memory regions mapped as -uncacheable from virtual CPU perspective. As virtual CPUs now completely execute -in hardware context with x86 hardware virtualiztion extensions, the cacheability -of memory types would severely impact the performance on guests. WHPX didn't -handle it for both Intel EPT and AMD NPT, but KVM seems to do it right for Intel -EPT. I don't have the correct fix for QEMU. But what I can do for my 3D APIs -pass-through device models is to implement host-side hooks to reprogram and -restore MSR_IA32_PAT upon activation/deactivation of the 3D APIs. Perhaps there -is also a better solution of having the proper kernel drivers for virtual -interfaces to manage the memory types of host/guest shared memory in kernel -space, but to do that and the needs of Microsoft tools/DDKs, I will just forget -it. The guest stubs uses the same kernel drivers included in 3Dfx drivers for -memory mapping and the virtual interfaces remain driver-less from Windows OS -perspective. Considering the current state of halting progress for QEMU native -virgil3D to support Windows OS, I am just being pragmatic. I understand that -QEMU virgil3D will eventually bring 3D acceleration for Windows guests, but I do -not expect anything to support legacy 32-bit Windows OSes which have out-grown -their commercial usefulness. - -Regards, -KJ Liew - diff --git a/results/classifier/012/debug/64571620 b/results/classifier/012/debug/64571620 deleted file mode 100644 index 534d8903..00000000 --- a/results/classifier/012/debug/64571620 +++ /dev/null @@ -1,803 +0,0 @@ -debug: 0.927 -other: 0.922 -mistranslation: 0.917 -assembly: 0.913 -arm: 0.910 -risc-v: 0.909 -semantic: 0.903 -permissions: 0.902 -device: 0.899 -performance: 0.897 -graphic: 0.897 -architecture: 0.895 -PID: 0.887 -boot: 0.879 -register: 0.871 -files: 0.855 -socket: 0.855 -network: 0.853 -kernel virtual machine: 0.852 -TCG: 0.843 -vnc: 0.819 -x86: 0.794 - -[BUG] Migration hv_time rollback - -Hi, - -We are experiencing timestamp rollbacks during live-migration of -Windows 10 guests with the following qemu configuration (linux 5.4.46 -and qemu master): -``` -$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] -``` - -I have tracked the bug to the fact that `kvmclock` is not exposed and -disabled from qemu PoV but is in fact used by `hv-time` (in KVM). - -I think we should enable the `kvmclock` (qemu device) if `hv-time` is -present and add Hyper-V support for the `kvmclock_current_nsec` -function. - -I'm asking for advice because I am unsure this is the _right_ approach -and how to keep migration compatibility between qemu versions. - -Thank you all, - --- -Antoine 'xdbob' Damhet -signature.asc -Description: -PGP signature - -cc'ing in Vitaly who knows about the hv stuff. - -* Antoine Damhet (antoine.damhet@blade-group.com) wrote: -> -Hi, -> -> -We are experiencing timestamp rollbacks during live-migration of -> -Windows 10 guests with the following qemu configuration (linux 5.4.46 -> -and qemu master): -> -``` -> -$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] -> -``` -How big a jump are you seeing, and how did you notice it in the guest? - -Dave - -> -I have tracked the bug to the fact that `kvmclock` is not exposed and -> -disabled from qemu PoV but is in fact used by `hv-time` (in KVM). -> -> -I think we should enable the `kvmclock` (qemu device) if `hv-time` is -> -present and add Hyper-V support for the `kvmclock_current_nsec` -> -function. -> -> -I'm asking for advice because I am unsure this is the _right_ approach -> -and how to keep migration compatibility between qemu versions. -> -> -Thank you all, -> -> --- -> -Antoine 'xdbob' Damhet --- -Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK - -"Dr. David Alan Gilbert" writes: - -> -cc'ing in Vitaly who knows about the hv stuff. -> -cc'ing Marcelo who knows about clocksources :-) - -> -* Antoine Damhet (antoine.damhet@blade-group.com) wrote: -> -> Hi, -> -> -> -> We are experiencing timestamp rollbacks during live-migration of -> -> Windows 10 guests -Are you migrating to the same hardware (with the same TSC frequency)? Is -TSC used as the clocksource on the host? - -> -> with the following qemu configuration (linux 5.4.46 -> -> and qemu master): -> -> ``` -> -> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] -> -> ``` -Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is -not going to check for KVM identification anyway so we pretend we're -Hyper-V. - -Also, have you tried adding more Hyper-V enlightenments? - -> -> -How big a jump are you seeing, and how did you notice it in the guest? -> -> -Dave -> -> -> I have tracked the bug to the fact that `kvmclock` is not exposed and -> -> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). -> -> -> -> I think we should enable the `kvmclock` (qemu device) if `hv-time` is -> -> present and add Hyper-V support for the `kvmclock_current_nsec` -> -> function. -AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by -the guest: - - if (!(env->system_time_msr & 1ULL)) { - /* KVM clock not active */ - return 0; - } - -and this is (and way) always false for Windows guests. - -> -> -> -> I'm asking for advice because I am unsure this is the _right_ approach -> -> and how to keep migration compatibility between qemu versions. -> -> -> -> Thank you all, -> -> -> -> -- -> -> Antoine 'xdbob' Damhet --- -Vitaly - -On Wed, Sep 16, 2020 at 01:59:43PM +0200, Vitaly Kuznetsov wrote: -> -"Dr. David Alan Gilbert" writes: -> -> -> cc'ing in Vitaly who knows about the hv stuff. -> -> -> -> -cc'ing Marcelo who knows about clocksources :-) -> -> -> * Antoine Damhet (antoine.damhet@blade-group.com) wrote: -> ->> Hi, -> ->> -> ->> We are experiencing timestamp rollbacks during live-migration of -> ->> Windows 10 guests -> -> -Are you migrating to the same hardware (with the same TSC frequency)? Is -> -TSC used as the clocksource on the host? -Yes we are migrating to the exact same hardware. And yes TSC is used as -a clocksource in the host (but the bug is still happening with `hpet` as -a clocksource). - -> -> ->> with the following qemu configuration (linux 5.4.46 -> ->> and qemu master): -> ->> ``` -> ->> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] -> ->> ``` -> -> -Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is -> -not going to check for KVM identification anyway so we pretend we're -> -Hyper-V. -Some softwares explicitly checks for the presence of KVM and then crash -if they find it in CPUID :/ - -> -> -Also, have you tried adding more Hyper-V enlightenments? -Yes, I published a stripped-down command-line for a minimal reproducer -but even `hv-frequencies` and `hv-reenlightenment` don't help. - -> -> -> -> -> How big a jump are you seeing, and how did you notice it in the guest? -> -> -> -> Dave -> -> -> ->> I have tracked the bug to the fact that `kvmclock` is not exposed and -> ->> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). -> ->> -> ->> I think we should enable the `kvmclock` (qemu device) if `hv-time` is -> ->> present and add Hyper-V support for the `kvmclock_current_nsec` -> ->> function. -> -> -AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by -> -the guest: -> -> -if (!(env->system_time_msr & 1ULL)) { -> -/* KVM clock not active */ -> -return 0; -> -} -> -> -and this is (and way) always false for Windows guests. -Hooo, I missed this piece. When is `clock_is_reliable` expected to be -false ? Because if it is I still think we should be able to query at -least `HV_X64_MSR_REFERENCE_TSC` - -> -> ->> -> ->> I'm asking for advice because I am unsure this is the _right_ approach -> ->> and how to keep migration compatibility between qemu versions. -> ->> -> ->> Thank you all, -> ->> -> ->> -- -> ->> Antoine 'xdbob' Damhet -> -> --- -> -Vitaly -> --- -Antoine 'xdbob' Damhet -signature.asc -Description: -PGP signature - -On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote: -> -cc'ing in Vitaly who knows about the hv stuff. -Thanks - -> -> -* Antoine Damhet (antoine.damhet@blade-group.com) wrote: -> -> Hi, -> -> -> -> We are experiencing timestamp rollbacks during live-migration of -> -> Windows 10 guests with the following qemu configuration (linux 5.4.46 -> -> and qemu master): -> -> ``` -> -> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] -> -> ``` -> -> -How big a jump are you seeing, and how did you notice it in the guest? -I'm seeing jumps of about the guest uptime (indicating a reset of the -counter). It's expected because we won't call `KVM_SET_CLOCK` to -restore any value. - -We first noticed it because after some migrations `dwm.exe` crashes with -the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in -the past." error code. - -I can also confirm the following hack makes the behavior disappear: - -``` -diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c -index 64283358f9..f334bdf35f 100644 ---- a/hw/i386/kvm/clock.c -+++ b/hw/i386/kvm/clock.c -@@ -332,11 +332,7 @@ void kvmclock_create(void) - { - X86CPU *cpu = X86_CPU(first_cpu); - -- if (kvm_enabled() && -- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | -- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { -- sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); -- } -+ sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); - } - - static void kvmclock_register_types(void) -diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c -index 32b1453e6a..11d980ba85 100644 ---- a/hw/i386/pc_piix.c -+++ b/hw/i386/pc_piix.c -@@ -158,9 +158,7 @@ static void pc_init1(MachineState *machine, - - x86_cpus_init(x86ms, pcmc->default_cpu_version); - -- if (kvm_enabled() && pcmc->kvmclock_enabled) { -- kvmclock_create(); -- } -+ kvmclock_create(); - - if (pcmc->pci_enabled) { - pci_memory = g_new(MemoryRegion, 1); -``` - -> -> -Dave -> -> -> I have tracked the bug to the fact that `kvmclock` is not exposed and -> -> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). -> -> -> -> I think we should enable the `kvmclock` (qemu device) if `hv-time` is -> -> present and add Hyper-V support for the `kvmclock_current_nsec` -> -> function. -> -> -> -> I'm asking for advice because I am unsure this is the _right_ approach -> -> and how to keep migration compatibility between qemu versions. -> -> -> -> Thank you all, -> -> -> -> -- -> -> Antoine 'xdbob' Damhet -> -> -> --- -> -Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -> --- -Antoine 'xdbob' Damhet -signature.asc -Description: -PGP signature - -Antoine Damhet writes: - -> -On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote: -> -> cc'ing in Vitaly who knows about the hv stuff. -> -> -Thanks -> -> -> -> -> * Antoine Damhet (antoine.damhet@blade-group.com) wrote: -> -> > Hi, -> -> > -> -> > We are experiencing timestamp rollbacks during live-migration of -> -> > Windows 10 guests with the following qemu configuration (linux 5.4.46 -> -> > and qemu master): -> -> > ``` -> -> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] -> -> > ``` -> -> -> -> How big a jump are you seeing, and how did you notice it in the guest? -> -> -I'm seeing jumps of about the guest uptime (indicating a reset of the -> -counter). It's expected because we won't call `KVM_SET_CLOCK` to -> -restore any value. -> -> -We first noticed it because after some migrations `dwm.exe` crashes with -> -the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in -> -the past." error code. -> -> -I can also confirm the following hack makes the behavior disappear: -> -> -``` -> -diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c -> -index 64283358f9..f334bdf35f 100644 -> ---- a/hw/i386/kvm/clock.c -> -+++ b/hw/i386/kvm/clock.c -> -@@ -332,11 +332,7 @@ void kvmclock_create(void) -> -{ -> -X86CPU *cpu = X86_CPU(first_cpu); -> -> -- if (kvm_enabled() && -> -- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | -> -- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { -> -- sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); -> -- } -> -+ sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); -> -} -> -Oh, I think I see what's going on. When you add 'kvm=off' -cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so -kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on -migration. - -In case we really want to support 'kvm=off' I think we can add Hyper-V -features check here along with KVM, this should do the job. - --- -Vitaly - -Vitaly Kuznetsov writes: - -> -Antoine Damhet writes: -> -> -> On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote: -> ->> cc'ing in Vitaly who knows about the hv stuff. -> -> -> -> Thanks -> -> -> ->> -> ->> * Antoine Damhet (antoine.damhet@blade-group.com) wrote: -> ->> > Hi, -> ->> > -> ->> > We are experiencing timestamp rollbacks during live-migration of -> ->> > Windows 10 guests with the following qemu configuration (linux 5.4.46 -> ->> > and qemu master): -> ->> > ``` -> ->> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] -> ->> > ``` -> ->> -> ->> How big a jump are you seeing, and how did you notice it in the guest? -> -> -> -> I'm seeing jumps of about the guest uptime (indicating a reset of the -> -> counter). It's expected because we won't call `KVM_SET_CLOCK` to -> -> restore any value. -> -> -> -> We first noticed it because after some migrations `dwm.exe` crashes with -> -> the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in -> -> the past." error code. -> -> -> -> I can also confirm the following hack makes the behavior disappear: -> -> -> -> ``` -> -> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c -> -> index 64283358f9..f334bdf35f 100644 -> -> --- a/hw/i386/kvm/clock.c -> -> +++ b/hw/i386/kvm/clock.c -> -> @@ -332,11 +332,7 @@ void kvmclock_create(void) -> -> { -> -> X86CPU *cpu = X86_CPU(first_cpu); -> -> -> -> - if (kvm_enabled() && -> -> - cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | -> -> - (1ULL << KVM_FEATURE_CLOCKSOURCE2))) -> -> { -> -> - sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); -> -> - } -> -> + sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); -> -> } -> -> -> -> -> -Oh, I think I see what's going on. When you add 'kvm=off' -> -cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so -> -kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on -> -migration. -> -> -In case we really want to support 'kvm=off' I think we can add Hyper-V -> -features check here along with KVM, this should do the job. -Does the untested - -diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c -index 64283358f91d..e03b2ca6d8f6 100644 ---- a/hw/i386/kvm/clock.c -+++ b/hw/i386/kvm/clock.c -@@ -333,8 +333,9 @@ void kvmclock_create(void) - X86CPU *cpu = X86_CPU(first_cpu); - - if (kvm_enabled() && -- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | -- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { -+ ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | -+ (1ULL << KVM_FEATURE_CLOCKSOURCE2))) -|| -+ (cpu->env.features[FEAT_HYPERV_EAX] & HV_TIME_REF_COUNT_AVAILABLE))) { - sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); - } - } - -help? - -(I don't think we need to remove all 'if (kvm_enabled())' checks from -machine types as 'kvm=off' should not be related). - --- -Vitaly - -On Wed, Sep 16, 2020 at 02:50:56PM +0200, Vitaly Kuznetsov wrote: -[...] - -> ->> -> -> -> -> -> -> Oh, I think I see what's going on. When you add 'kvm=off' -> -> cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so -> -> kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on -> -> migration. -> -> -> -> In case we really want to support 'kvm=off' I think we can add Hyper-V -> -> features check here along with KVM, this should do the job. -> -> -Does the untested -> -> -diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c -> -index 64283358f91d..e03b2ca6d8f6 100644 -> ---- a/hw/i386/kvm/clock.c -> -+++ b/hw/i386/kvm/clock.c -> -@@ -333,8 +333,9 @@ void kvmclock_create(void) -> -X86CPU *cpu = X86_CPU(first_cpu); -> -> -if (kvm_enabled() && -> -- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | -> -- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { -> -+ ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | -> -+ (1ULL << -> -KVM_FEATURE_CLOCKSOURCE2))) || -> -+ (cpu->env.features[FEAT_HYPERV_EAX] & -> -HV_TIME_REF_COUNT_AVAILABLE))) { -> -sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); -> -} -> -} -> -> -help? -It appears to work :) - -> -> -(I don't think we need to remove all 'if (kvm_enabled())' checks from -> -machine types as 'kvm=off' should not be related). -Indeed (I didn't look at the macro, it was just quick & dirty). - -> -> --- -> -Vitaly -> -> --- -Antoine 'xdbob' Damhet -signature.asc -Description: -PGP signature - -On 16/09/20 13:29, Dr. David Alan Gilbert wrote: -> -> I have tracked the bug to the fact that `kvmclock` is not exposed and -> -> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). -> -> -> -> I think we should enable the `kvmclock` (qemu device) if `hv-time` is -> -> present and add Hyper-V support for the `kvmclock_current_nsec` -> -> function. -Yes, this seems correct. I would have to check but it may even be -better to _always_ send kvmclock data in the live migration stream. - -Paolo - -Paolo Bonzini writes: - -> -On 16/09/20 13:29, Dr. David Alan Gilbert wrote: -> ->> I have tracked the bug to the fact that `kvmclock` is not exposed and -> ->> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). -> ->> -> ->> I think we should enable the `kvmclock` (qemu device) if `hv-time` is -> ->> present and add Hyper-V support for the `kvmclock_current_nsec` -> ->> function. -> -> -Yes, this seems correct. I would have to check but it may even be -> -better to _always_ send kvmclock data in the live migration stream. -> -The question I have is: with 'kvm=off', do we actually restore TSC -reading on migration? (and I guess the answer is 'no' or Hyper-V TSC -page would 'just work' I guess). So yea, maybe dropping the -'cpu->env.features[FEAT_KVM]' check is the right fix. - --- -Vitaly - -- cgit 1.4.1