diff options
| author | Christian Krinitsin <mail@krinitsin.com> | 2025-07-03 19:39:53 +0200 |
|---|---|---|
| committer | Christian Krinitsin <mail@krinitsin.com> | 2025-07-03 19:39:53 +0200 |
| commit | dee4dcba78baf712cab403d47d9db319ab7f95d6 (patch) | |
| tree | 418478faf06786701a56268672f73d6b0b4eb239 /results/classifier/zero-shot/012/debug | |
| parent | 4d9e26c0333abd39bdbd039dcdb30ed429c475ba (diff) | |
| download | emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.tar.gz emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.zip | |
restructure results
Diffstat (limited to 'results/classifier/zero-shot/012/debug')
| -rw-r--r-- | results/classifier/zero-shot/012/debug/24190340 | 2074 | ||||
| -rw-r--r-- | results/classifier/zero-shot/012/debug/53568181 | 96 | ||||
| -rw-r--r-- | results/classifier/zero-shot/012/debug/64571620 | 803 |
3 files changed, 2973 insertions, 0 deletions
diff --git a/results/classifier/zero-shot/012/debug/24190340 b/results/classifier/zero-shot/012/debug/24190340 new file mode 100644 index 00000000..4413b8bc --- /dev/null +++ b/results/classifier/zero-shot/012/debug/24190340 @@ -0,0 +1,2074 @@ +debug: 0.876 +risc-v: 0.869 +register: 0.856 +permissions: 0.851 +TCG: 0.846 +device: 0.832 +performance: 0.832 +other: 0.811 +vnc: 0.808 +boot: 0.803 +arm: 0.794 +semantic: 0.793 +assembly: 0.790 +graphic: 0.775 +files: 0.770 +architecture: 0.769 +PID: 0.762 +mistranslation: 0.758 +network: 0.723 +socket: 0.715 +x86: 0.644 +kernel virtual machine: 0.617 + +[BUG, RFC] Block graph deadlock on job-dismiss + +Hi all, + +There's a bug in block layer which leads to block graph deadlock. +Notably, it takes place when blockdev IO is processed within a separate +iothread. + +This was initially caught by our tests, and I was able to reduce it to a +relatively simple reproducer. Such deadlocks are probably supposed to +be covered in iotests/graph-changes-while-io, but this deadlock isn't. + +Basically what the reproducer does is launches QEMU with a drive having +'iothread' option set, creates a chain of 2 snapshots, launches +block-commit job for a snapshot and then dismisses the job, starting +from the lower snapshot. If the guest is issuing IO at the same time, +there's a race in acquiring block graph lock and a potential deadlock. + +Here's how it can be reproduced: + +1. Run QEMU: +> +SRCDIR=/path/to/srcdir +> +> +> +> +> +$SRCDIR/build/qemu-system-x86_64 -enable-kvm \ +> +> +-machine q35 -cpu Nehalem \ +> +> +-name guest=alma8-vm,debug-threads=on \ +> +> +-m 2g -smp 2 \ +> +> +-nographic -nodefaults \ +> +> +-qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ +> +> +-serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ +> +> +-object iothread,id=iothread0 \ +> +> +-blockdev +> +node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 +> +\ +> +-device virtio-blk-pci,drive=disk,iothread=iothread0 +2. Launch IO (random reads) from within the guest: +> +nc -U /var/run/alma8-serial.sock +> +... +> +[root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k +> +--size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting +> +--rw=randread --iodepth=1 --filename=/testfile +3. Run snapshots creation & removal of lower snapshot operation in a +loop (script attached): +> +while /bin/true ; do ./remove_lower_snap.sh ; done +And then it occasionally hangs. + +Note: I've tried bisecting this, and looks like deadlock occurs starting +from the following commit: + +(BAD) 5bdbaebcce virtio: Re-enable notifications after drain +(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll + +On the latest v10.0.0 it does hang as well. + + +Here's backtrace of the main thread: + +> +#0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, +> +timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43 +> +#1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, +> +timeout=-1) at ../util/qemu-timer.c:329 +> +#2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, +> +ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 +> +#3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at +> +../util/aio-posix.c:730 +> +#4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, +> +parent=0x0, poll=true) at ../block/io.c:378 +> +#5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at +> +../block/io.c:391 +> +#6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7682 +> +#7 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7608 +> +#8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7668 +> +#9 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7608 +> +#10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7668 +> +#11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7608 +> +#12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../blockjob.c:157 +> +#13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7592 +> +#14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7661 +> +#15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx +> +(child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = +> +{...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 +> +#16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7592 +> +#17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, +> +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +errp=0x0) +> +at ../block.c:7661 +> +#18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, +> +ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715 +> +#19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at +> +../block.c:3317 +> +#20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at +> +../blockjob.c:209 +> +#21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at +> +../blockjob.c:82 +> +#22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at +> +../job.c:474 +> +#23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at +> +../job.c:771 +> +#24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, +> +errp=0x7ffd94b4f488) at ../job.c:783 +> +--Type <RET> for more, q to quit, c to continue without paging-- +> +#25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1", +> +errp=0x7ffd94b4f488) at ../job-qmp.c:138 +> +#26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, +> +ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 +> +#27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at +> +../qapi/qmp-dispatch.c:128 +> +#28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at +> +../util/async.c:172 +> +#29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at +> +../util/async.c:219 +> +#30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at +> +../util/aio-posix.c:436 +> +#31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, +> +callback=0x0, user_data=0x0) at ../util/async.c:361 +> +#32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at +> +../glib/gmain.c:3364 +> +#33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079 +> +#34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287 +> +#35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at +> +../util/main-loop.c:310 +> +#36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at +> +../util/main-loop.c:589 +> +#37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 +> +#38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at +> +../system/main.c:50 +> +#39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at +> +../system/main.c:80 +And here's coroutine trying to acquire read lock: + +> +(gdb) qemu coroutine reader_queue->entries.sqh_first +> +#0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, +> +to_=0x7fc537fff508, action=COROUTINE_YIELD) at +> +../util/coroutine-ucontext.c:321 +> +#1 0x0000557eb47d4d4a in qemu_coroutine_yield () at +> +../util/qemu-coroutine.c:339 +> +#2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 +> +<reader_queue>, lock=0x7fc53c57de50, flags=0) at +> +../util/qemu-coroutine-lock.c:60 +> +#3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231 +> +#4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at +> +/home/root/src/qemu/master/include/block/graph-lock.h:213 +> +#5 0x0000557eb460fa41 in blk_co_do_preadv_part +> +(blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988, +> +qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339 +> +#6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at +> +../block/block-backend.c:1619 +> +#7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at +> +../util/coroutine-ucontext.c:175 +> +#8 0x00007fc547c2a360 in __start_context () at +> +../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 +> +#9 0x00007ffd94b4ea40 in () +> +#10 0x0000000000000000 in () +So it looks like main thread is processing job-dismiss request and is +holding write lock taken in block_job_remove_all_bdrv() (frame #20 +above). At the same time iothread spawns a coroutine which performs IO +request. Before the coroutine is spawned, blk_aio_prwv() increases +'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is +trying to acquire the read lock. But main thread isn't releasing the +lock as blk_root_drained_poll() returns true since blk->in_flight > 0. +Here's the deadlock. + +Any comments and suggestions on the subject are welcomed. Thanks! + +Andrey +remove_lower_snap.sh +Description: +application/shellscript + +On 4/24/25 8:32 PM, Andrey Drobyshev wrote: +> +Hi all, +> +> +There's a bug in block layer which leads to block graph deadlock. +> +Notably, it takes place when blockdev IO is processed within a separate +> +iothread. +> +> +This was initially caught by our tests, and I was able to reduce it to a +> +relatively simple reproducer. Such deadlocks are probably supposed to +> +be covered in iotests/graph-changes-while-io, but this deadlock isn't. +> +> +Basically what the reproducer does is launches QEMU with a drive having +> +'iothread' option set, creates a chain of 2 snapshots, launches +> +block-commit job for a snapshot and then dismisses the job, starting +> +from the lower snapshot. If the guest is issuing IO at the same time, +> +there's a race in acquiring block graph lock and a potential deadlock. +> +> +Here's how it can be reproduced: +> +> +[...] +> +I took a closer look at iotests/graph-changes-while-io, and have managed +to reproduce the same deadlock in a much simpler setup, without a guest. + +1. Run QSD:> ./build/storage-daemon/qemu-storage-daemon --object +iothread,id=iothread0 \ +> +--blockdev null-co,node-name=node0,read-zeroes=true \ +> +> +--nbd-server addr.type=unix,addr.path=/var/run/qsd_nbd.sock \ +> +> +--export +> +nbd,id=exp0,node-name=node0,iothread=iothread0,fixed-iothread=true,writable=true +> +\ +> +--chardev +> +socket,id=qmp-sock,path=/var/run/qsd_qmp.sock,server=on,wait=off \ +> +--monitor chardev=qmp-sock +2. Launch IO: +> +qemu-img bench -f raw -c 2000000 +> +'nbd+unix:///node0?socket=/var/run/qsd_nbd.sock' +3. Add 2 snapshots and remove lower one (script attached):> while +/bin/true ; do ./rls_qsd.sh ; done + +And then it hangs. + +I'll also send a patch with corresponding test case added directly to +iotests. + +This reproduce seems to be hanging starting from Fiona's commit +67446e605dc ("blockjob: drop AioContext lock before calling +bdrv_graph_wrlock()"). AioContext locks were dropped entirely later on +in Stefan's commit b49f4755c7 ("block: remove AioContext locking"), but +the problem remains. + +Andrey +rls_qsd.sh +Description: +application/shellscript + +From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> + +This case is catching potential deadlock which takes place when job-dismiss +is issued when I/O requests are processed in a separate iothread. + +See +https://mail.gnu.org/archive/html/qemu-devel/2025-04/msg04421.html +Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> +--- + .../qemu-iotests/tests/graph-changes-while-io | 101 ++++++++++++++++-- + .../tests/graph-changes-while-io.out | 4 +- + 2 files changed, 96 insertions(+), 9 deletions(-) + +diff --git a/tests/qemu-iotests/tests/graph-changes-while-io +b/tests/qemu-iotests/tests/graph-changes-while-io +index 194fda500e..e30f823da4 100755 +--- a/tests/qemu-iotests/tests/graph-changes-while-io ++++ b/tests/qemu-iotests/tests/graph-changes-while-io +@@ -27,6 +27,8 @@ from iotests import imgfmt, qemu_img, qemu_img_create, +qemu_io, \ + + + top = os.path.join(iotests.test_dir, 'top.img') ++snap1 = os.path.join(iotests.test_dir, 'snap1.img') ++snap2 = os.path.join(iotests.test_dir, 'snap2.img') + nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock') + + +@@ -58,6 +60,15 @@ class TestGraphChangesWhileIO(QMPTestCase): + def tearDown(self) -> None: + self.qsd.stop() + ++ def _wait_for_blockjob(self, status) -> None: ++ done = False ++ while not done: ++ for event in self.qsd.get_qmp().get_events(wait=10.0): ++ if event['event'] != 'JOB_STATUS_CHANGE': ++ continue ++ if event['data']['status'] == status: ++ done = True ++ + def test_blockdev_add_while_io(self) -> None: + # Run qemu-img bench in the background + bench_thr = Thread(target=do_qemu_img_bench) +@@ -116,13 +127,89 @@ class TestGraphChangesWhileIO(QMPTestCase): + 'device': 'job0', + }) + +- cancelled = False +- while not cancelled: +- for event in self.qsd.get_qmp().get_events(wait=10.0): +- if event['event'] != 'JOB_STATUS_CHANGE': +- continue +- if event['data']['status'] == 'null': +- cancelled = True ++ self._wait_for_blockjob('null') ++ ++ bench_thr.join() ++ ++ def test_remove_lower_snapshot_while_io(self) -> None: ++ # Run qemu-img bench in the background ++ bench_thr = Thread(target=do_qemu_img_bench, args=(100000, )) ++ bench_thr.start() ++ ++ # While I/O is performed on 'node0' node, consequently add 2 snapshots ++ # on top of it, then remove (commit) them starting from lower one. ++ while bench_thr.is_alive(): ++ # Recreate snapshot images on every iteration ++ qemu_img_create('-f', imgfmt, snap1, '1G') ++ qemu_img_create('-f', imgfmt, snap2, '1G') ++ ++ self.qsd.cmd('blockdev-add', { ++ 'driver': imgfmt, ++ 'node-name': 'snap1', ++ 'file': { ++ 'driver': 'file', ++ 'filename': snap1 ++ } ++ }) ++ ++ self.qsd.cmd('blockdev-snapshot', { ++ 'node': 'node0', ++ 'overlay': 'snap1', ++ }) ++ ++ self.qsd.cmd('blockdev-add', { ++ 'driver': imgfmt, ++ 'node-name': 'snap2', ++ 'file': { ++ 'driver': 'file', ++ 'filename': snap2 ++ } ++ }) ++ ++ self.qsd.cmd('blockdev-snapshot', { ++ 'node': 'snap1', ++ 'overlay': 'snap2', ++ }) ++ ++ self.qsd.cmd('block-commit', { ++ 'job-id': 'commit-snap1', ++ 'device': 'snap2', ++ 'top-node': 'snap1', ++ 'base-node': 'node0', ++ 'auto-finalize': True, ++ 'auto-dismiss': False, ++ }) ++ ++ self._wait_for_blockjob('concluded') ++ self.qsd.cmd('job-dismiss', { ++ 'id': 'commit-snap1', ++ }) ++ ++ self.qsd.cmd('block-commit', { ++ 'job-id': 'commit-snap2', ++ 'device': 'snap2', ++ 'top-node': 'snap2', ++ 'base-node': 'node0', ++ 'auto-finalize': True, ++ 'auto-dismiss': False, ++ }) ++ ++ self._wait_for_blockjob('ready') ++ self.qsd.cmd('job-complete', { ++ 'id': 'commit-snap2', ++ }) ++ ++ self._wait_for_blockjob('concluded') ++ self.qsd.cmd('job-dismiss', { ++ 'id': 'commit-snap2', ++ }) ++ ++ self.qsd.cmd('blockdev-del', { ++ 'node-name': 'snap1' ++ }) ++ self.qsd.cmd('blockdev-del', { ++ 'node-name': 'snap2' ++ }) + + bench_thr.join() + +diff --git a/tests/qemu-iotests/tests/graph-changes-while-io.out +b/tests/qemu-iotests/tests/graph-changes-while-io.out +index fbc63e62f8..8d7e996700 100644 +--- a/tests/qemu-iotests/tests/graph-changes-while-io.out ++++ b/tests/qemu-iotests/tests/graph-changes-while-io.out +@@ -1,5 +1,5 @@ +-.. ++... + ---------------------------------------------------------------------- +-Ran 2 tests ++Ran 3 tests + + OK +-- +2.43.5 + +Am 24.04.25 um 19:32 schrieb Andrey Drobyshev: +> +So it looks like main thread is processing job-dismiss request and is +> +holding write lock taken in block_job_remove_all_bdrv() (frame #20 +> +above). At the same time iothread spawns a coroutine which performs IO +> +request. Before the coroutine is spawned, blk_aio_prwv() increases +> +'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is +> +trying to acquire the read lock. But main thread isn't releasing the +> +lock as blk_root_drained_poll() returns true since blk->in_flight > 0. +> +Here's the deadlock. +And for the IO test you provided, it's client->nb_requests that behaves +similarly to blk->in_flight here. + +The issue also reproduces easily when issuing the following QMP command +in a loop while doing IO on a device: + +> +void qmp_block_locked_drain(const char *node_name, Error **errp) +> +{ +> +BlockDriverState *bs; +> +> +bs = bdrv_find_node(node_name); +> +if (!bs) { +> +error_setg(errp, "node not found"); +> +return; +> +} +> +> +bdrv_graph_wrlock(); +> +bdrv_drained_begin(bs); +> +bdrv_drained_end(bs); +> +bdrv_graph_wrunlock(); +> +} +It seems like either it would be necessary to require: +1. not draining inside an exclusively locked section +or +2. making sure that variables used by drained_poll routines are only set +while holding the reader lock +? + +Those seem to require rather involved changes, so a third option might +be to make draining inside an exclusively locked section possible, by +embedding such locked sections in a drained section: + +> +diff --git a/blockjob.c b/blockjob.c +> +index 32007f31a9..9b2f3b3ea9 100644 +> +--- a/blockjob.c +> ++++ b/blockjob.c +> +@@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job) +> +* one to make sure that such a concurrent access does not attempt +> +* to process an already freed BdrvChild. +> +*/ +> ++ bdrv_drain_all_begin(); +> +bdrv_graph_wrlock(); +> +while (job->nodes) { +> +GSList *l = job->nodes; +> +@@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job) +> +g_slist_free_1(l); +> +} +> +bdrv_graph_wrunlock(); +> ++ bdrv_drain_all_end(); +> +} +> +> +bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs) +This seems to fix the issue at hand. I can send a patch if this is +considered an acceptable approach. + +Best Regards, +Fiona + +On 4/30/25 11:47 AM, Fiona Ebner wrote: +> +Am 24.04.25 um 19:32 schrieb Andrey Drobyshev: +> +> So it looks like main thread is processing job-dismiss request and is +> +> holding write lock taken in block_job_remove_all_bdrv() (frame #20 +> +> above). At the same time iothread spawns a coroutine which performs IO +> +> request. Before the coroutine is spawned, blk_aio_prwv() increases +> +> 'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is +> +> trying to acquire the read lock. But main thread isn't releasing the +> +> lock as blk_root_drained_poll() returns true since blk->in_flight > 0. +> +> Here's the deadlock. +> +> +And for the IO test you provided, it's client->nb_requests that behaves +> +similarly to blk->in_flight here. +> +> +The issue also reproduces easily when issuing the following QMP command +> +in a loop while doing IO on a device: +> +> +> void qmp_block_locked_drain(const char *node_name, Error **errp) +> +> { +> +> BlockDriverState *bs; +> +> +> +> bs = bdrv_find_node(node_name); +> +> if (!bs) { +> +> error_setg(errp, "node not found"); +> +> return; +> +> } +> +> +> +> bdrv_graph_wrlock(); +> +> bdrv_drained_begin(bs); +> +> bdrv_drained_end(bs); +> +> bdrv_graph_wrunlock(); +> +> } +> +> +It seems like either it would be necessary to require: +> +1. not draining inside an exclusively locked section +> +or +> +2. making sure that variables used by drained_poll routines are only set +> +while holding the reader lock +> +? +> +> +Those seem to require rather involved changes, so a third option might +> +be to make draining inside an exclusively locked section possible, by +> +embedding such locked sections in a drained section: +> +> +> diff --git a/blockjob.c b/blockjob.c +> +> index 32007f31a9..9b2f3b3ea9 100644 +> +> --- a/blockjob.c +> +> +++ b/blockjob.c +> +> @@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job) +> +> * one to make sure that such a concurrent access does not attempt +> +> * to process an already freed BdrvChild. +> +> */ +> +> + bdrv_drain_all_begin(); +> +> bdrv_graph_wrlock(); +> +> while (job->nodes) { +> +> GSList *l = job->nodes; +> +> @@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job) +> +> g_slist_free_1(l); +> +> } +> +> bdrv_graph_wrunlock(); +> +> + bdrv_drain_all_end(); +> +> } +> +> +> +> bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs) +> +> +This seems to fix the issue at hand. I can send a patch if this is +> +considered an acceptable approach. +> +> +Best Regards, +> +Fiona +> +Hello Fiona, + +Thanks for looking into it. I've tried your 3rd option above and can +confirm it does fix the deadlock, at least I can't reproduce it. Other +iotests also don't seem to be breaking. So I personally am fine with +that patch. Would be nice to hear a word from the maintainers though on +whether there're any caveats with such approach. + +Andrey + +On Wed, Apr 30, 2025 at 10:11â¯AM Andrey Drobyshev +<andrey.drobyshev@virtuozzo.com> wrote: +> +> +On 4/30/25 11:47 AM, Fiona Ebner wrote: +> +> Am 24.04.25 um 19:32 schrieb Andrey Drobyshev: +> +>> So it looks like main thread is processing job-dismiss request and is +> +>> holding write lock taken in block_job_remove_all_bdrv() (frame #20 +> +>> above). At the same time iothread spawns a coroutine which performs IO +> +>> request. Before the coroutine is spawned, blk_aio_prwv() increases +> +>> 'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is +> +>> trying to acquire the read lock. But main thread isn't releasing the +> +>> lock as blk_root_drained_poll() returns true since blk->in_flight > 0. +> +>> Here's the deadlock. +> +> +> +> And for the IO test you provided, it's client->nb_requests that behaves +> +> similarly to blk->in_flight here. +> +> +> +> The issue also reproduces easily when issuing the following QMP command +> +> in a loop while doing IO on a device: +> +> +> +>> void qmp_block_locked_drain(const char *node_name, Error **errp) +> +>> { +> +>> BlockDriverState *bs; +> +>> +> +>> bs = bdrv_find_node(node_name); +> +>> if (!bs) { +> +>> error_setg(errp, "node not found"); +> +>> return; +> +>> } +> +>> +> +>> bdrv_graph_wrlock(); +> +>> bdrv_drained_begin(bs); +> +>> bdrv_drained_end(bs); +> +>> bdrv_graph_wrunlock(); +> +>> } +> +> +> +> It seems like either it would be necessary to require: +> +> 1. not draining inside an exclusively locked section +> +> or +> +> 2. making sure that variables used by drained_poll routines are only set +> +> while holding the reader lock +> +> ? +> +> +> +> Those seem to require rather involved changes, so a third option might +> +> be to make draining inside an exclusively locked section possible, by +> +> embedding such locked sections in a drained section: +> +> +> +>> diff --git a/blockjob.c b/blockjob.c +> +>> index 32007f31a9..9b2f3b3ea9 100644 +> +>> --- a/blockjob.c +> +>> +++ b/blockjob.c +> +>> @@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job) +> +>> * one to make sure that such a concurrent access does not attempt +> +>> * to process an already freed BdrvChild. +> +>> */ +> +>> + bdrv_drain_all_begin(); +> +>> bdrv_graph_wrlock(); +> +>> while (job->nodes) { +> +>> GSList *l = job->nodes; +> +>> @@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job) +> +>> g_slist_free_1(l); +> +>> } +> +>> bdrv_graph_wrunlock(); +> +>> + bdrv_drain_all_end(); +> +>> } +> +>> +> +>> bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs) +> +> +> +> This seems to fix the issue at hand. I can send a patch if this is +> +> considered an acceptable approach. +Kevin is aware of this thread but it's a public holiday tomorrow so it +may be a little longer. + +Stefan + +Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben: +> +Hi all, +> +> +There's a bug in block layer which leads to block graph deadlock. +> +Notably, it takes place when blockdev IO is processed within a separate +> +iothread. +> +> +This was initially caught by our tests, and I was able to reduce it to a +> +relatively simple reproducer. Such deadlocks are probably supposed to +> +be covered in iotests/graph-changes-while-io, but this deadlock isn't. +> +> +Basically what the reproducer does is launches QEMU with a drive having +> +'iothread' option set, creates a chain of 2 snapshots, launches +> +block-commit job for a snapshot and then dismisses the job, starting +> +from the lower snapshot. If the guest is issuing IO at the same time, +> +there's a race in acquiring block graph lock and a potential deadlock. +> +> +Here's how it can be reproduced: +> +> +1. Run QEMU: +> +> SRCDIR=/path/to/srcdir +> +> +> +> +> +> +> +> +> +> $SRCDIR/build/qemu-system-x86_64 -enable-kvm \ +> +> +> +> -machine q35 -cpu Nehalem \ +> +> +> +> -name guest=alma8-vm,debug-threads=on \ +> +> +> +> -m 2g -smp 2 \ +> +> +> +> -nographic -nodefaults \ +> +> +> +> -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ +> +> +> +> -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ +> +> +> +> -object iothread,id=iothread0 \ +> +> +> +> -blockdev +> +> node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 +> +> \ +> +> -device virtio-blk-pci,drive=disk,iothread=iothread0 +> +> +2. Launch IO (random reads) from within the guest: +> +> nc -U /var/run/alma8-serial.sock +> +> ... +> +> [root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k +> +> --size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting +> +> --rw=randread --iodepth=1 --filename=/testfile +> +> +3. Run snapshots creation & removal of lower snapshot operation in a +> +loop (script attached): +> +> while /bin/true ; do ./remove_lower_snap.sh ; done +> +> +And then it occasionally hangs. +> +> +Note: I've tried bisecting this, and looks like deadlock occurs starting +> +from the following commit: +> +> +(BAD) 5bdbaebcce virtio: Re-enable notifications after drain +> +(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll +> +> +On the latest v10.0.0 it does hang as well. +> +> +> +Here's backtrace of the main thread: +> +> +> #0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, +> +> timeout=<optimized out>, sigmask=0x0) at +> +> ../sysdeps/unix/sysv/linux/ppoll.c:43 +> +> #1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, +> +> timeout=-1) at ../util/qemu-timer.c:329 +> +> #2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, +> +> ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 +> +> #3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at +> +> ../util/aio-posix.c:730 +> +> #4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, +> +> parent=0x0, poll=true) at ../block/io.c:378 +> +> #5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at +> +> ../block/io.c:391 +> +> #6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7682 +> +> #7 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7608 +> +> #8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7668 +> +> #9 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7608 +> +> #10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7668 +> +> #11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7608 +> +> #12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../blockjob.c:157 +> +> #13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7592 +> +> #14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7661 +> +> #15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx +> +> (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = +> +> {...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 +> +> #16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7592 +> +> #17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, +> +> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +> +> errp=0x0) +> +> at ../block.c:7661 +> +> #18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, +> +> ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715 +> +> #19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at +> +> ../block.c:3317 +> +> #20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at +> +> ../blockjob.c:209 +> +> #21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at +> +> ../blockjob.c:82 +> +> #22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at +> +> ../job.c:474 +> +> #23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at +> +> ../job.c:771 +> +> #24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, +> +> errp=0x7ffd94b4f488) at ../job.c:783 +> +> --Type <RET> for more, q to quit, c to continue without paging-- +> +> #25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 +> +> "commit-snap1", errp=0x7ffd94b4f488) at ../job-qmp.c:138 +> +> #26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, +> +> ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 +> +> #27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at +> +> ../qapi/qmp-dispatch.c:128 +> +> #28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at +> +> ../util/async.c:172 +> +> #29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at +> +> ../util/async.c:219 +> +> #30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at +> +> ../util/aio-posix.c:436 +> +> #31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, +> +> callback=0x0, user_data=0x0) at ../util/async.c:361 +> +> #32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at +> +> ../glib/gmain.c:3364 +> +> #33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079 +> +> #34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287 +> +> #35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at +> +> ../util/main-loop.c:310 +> +> #36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at +> +> ../util/main-loop.c:589 +> +> #37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 +> +> #38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at +> +> ../system/main.c:50 +> +> #39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at +> +> ../system/main.c:80 +> +> +> +And here's coroutine trying to acquire read lock: +> +> +> (gdb) qemu coroutine reader_queue->entries.sqh_first +> +> #0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, +> +> to_=0x7fc537fff508, action=COROUTINE_YIELD) at +> +> ../util/coroutine-ucontext.c:321 +> +> #1 0x0000557eb47d4d4a in qemu_coroutine_yield () at +> +> ../util/qemu-coroutine.c:339 +> +> #2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 +> +> <reader_queue>, lock=0x7fc53c57de50, flags=0) at +> +> ../util/qemu-coroutine-lock.c:60 +> +> #3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at +> +> ../block/graph-lock.c:231 +> +> #4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at +> +> /home/root/src/qemu/master/include/block/graph-lock.h:213 +> +> #5 0x0000557eb460fa41 in blk_co_do_preadv_part +> +> (blk=0x557eb84c0810, offset=6890553344, bytes=4096, +> +> qiov=0x7fc530006988, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at +> +> ../block/block-backend.c:1339 +> +> #6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at +> +> ../block/block-backend.c:1619 +> +> #7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) +> +> at ../util/coroutine-ucontext.c:175 +> +> #8 0x00007fc547c2a360 in __start_context () at +> +> ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 +> +> #9 0x00007ffd94b4ea40 in () +> +> #10 0x0000000000000000 in () +> +> +> +So it looks like main thread is processing job-dismiss request and is +> +holding write lock taken in block_job_remove_all_bdrv() (frame #20 +> +above). At the same time iothread spawns a coroutine which performs IO +> +request. Before the coroutine is spawned, blk_aio_prwv() increases +> +'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is +> +trying to acquire the read lock. But main thread isn't releasing the +> +lock as blk_root_drained_poll() returns true since blk->in_flight > 0. +> +Here's the deadlock. +> +> +Any comments and suggestions on the subject are welcomed. Thanks! +I think this is what the blk_wait_while_drained() call was supposed to +address in blk_co_do_preadv_part(). However, with the use of multiple +I/O threads, this is racy. + +Do you think that in your case we hit the small race window between the +checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there +another reason why blk_wait_while_drained() didn't do its job? + +Kevin + +On 5/2/25 19:34, Kevin Wolf wrote: +Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben: +Hi all, + +There's a bug in block layer which leads to block graph deadlock. +Notably, it takes place when blockdev IO is processed within a separate +iothread. + +This was initially caught by our tests, and I was able to reduce it to a +relatively simple reproducer. Such deadlocks are probably supposed to +be covered in iotests/graph-changes-while-io, but this deadlock isn't. + +Basically what the reproducer does is launches QEMU with a drive having +'iothread' option set, creates a chain of 2 snapshots, launches +block-commit job for a snapshot and then dismisses the job, starting +from the lower snapshot. If the guest is issuing IO at the same time, +there's a race in acquiring block graph lock and a potential deadlock. + +Here's how it can be reproduced: + +1. Run QEMU: +SRCDIR=/path/to/srcdir +$SRCDIR/build/qemu-system-x86_64 -enable-kvm \ +-machine q35 -cpu Nehalem \ + -name guest=alma8-vm,debug-threads=on \ + -m 2g -smp 2 \ + -nographic -nodefaults \ + -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ + -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ + -object iothread,id=iothread0 \ + -blockdev +node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 + \ + -device virtio-blk-pci,drive=disk,iothread=iothread0 +2. Launch IO (random reads) from within the guest: +nc -U /var/run/alma8-serial.sock +... +[root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k +--size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting +--rw=randread --iodepth=1 --filename=/testfile +3. Run snapshots creation & removal of lower snapshot operation in a +loop (script attached): +while /bin/true ; do ./remove_lower_snap.sh ; done +And then it occasionally hangs. + +Note: I've tried bisecting this, and looks like deadlock occurs starting +from the following commit: + +(BAD) 5bdbaebcce virtio: Re-enable notifications after drain +(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll + +On the latest v10.0.0 it does hang as well. + + +Here's backtrace of the main thread: +#0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, timeout=<optimized +out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43 +#1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, timeout=-1) +at ../util/qemu-timer.c:329 +#2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, +ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 +#3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at +../util/aio-posix.c:730 +#4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, parent=0x0, +poll=true) at ../block/io.c:378 +#5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at +../block/io.c:391 +#6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7682 +#7 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7608 +#8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7668 +#9 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7608 +#10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7668 +#11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7608 +#12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../blockjob.c:157 +#13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7592 +#14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7661 +#15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx + (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 +#16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7592 +#17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, +ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, +errp=0x0) + at ../block.c:7661 +#18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, +ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715 +#19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at +../block.c:3317 +#20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at +../blockjob.c:209 +#21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at +../blockjob.c:82 +#22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at ../job.c:474 +#23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at +../job.c:771 +#24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, +errp=0x7ffd94b4f488) at ../job.c:783 +--Type <RET> for more, q to quit, c to continue without paging-- +#25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1", +errp=0x7ffd94b4f488) at ../job-qmp.c:138 +#26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, +ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 +#27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at +../qapi/qmp-dispatch.c:128 +#28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at ../util/async.c:172 +#29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at +../util/async.c:219 +#30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at +../util/aio-posix.c:436 +#31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, +callback=0x0, user_data=0x0) at ../util/async.c:361 +#32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at +../glib/gmain.c:3364 +#33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079 +#34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287 +#35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at +../util/main-loop.c:310 +#36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at +../util/main-loop.c:589 +#37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 +#38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at ../system/main.c:50 +#39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at +../system/main.c:80 +And here's coroutine trying to acquire read lock: +(gdb) qemu coroutine reader_queue->entries.sqh_first +#0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, +to_=0x7fc537fff508, action=COROUTINE_YIELD) at ../util/coroutine-ucontext.c:321 +#1 0x0000557eb47d4d4a in qemu_coroutine_yield () at +../util/qemu-coroutine.c:339 +#2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 +<reader_queue>, lock=0x7fc53c57de50, flags=0) at +../util/qemu-coroutine-lock.c:60 +#3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231 +#4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at +/home/root/src/qemu/master/include/block/graph-lock.h:213 +#5 0x0000557eb460fa41 in blk_co_do_preadv_part + (blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988, +qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339 +#6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at +../block/block-backend.c:1619 +#7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at +../util/coroutine-ucontext.c:175 +#8 0x00007fc547c2a360 in __start_context () at +../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 +#9 0x00007ffd94b4ea40 in () +#10 0x0000000000000000 in () +So it looks like main thread is processing job-dismiss request and is +holding write lock taken in block_job_remove_all_bdrv() (frame #20 +above). At the same time iothread spawns a coroutine which performs IO +request. Before the coroutine is spawned, blk_aio_prwv() increases +'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is +trying to acquire the read lock. But main thread isn't releasing the +lock as blk_root_drained_poll() returns true since blk->in_flight > 0. +Here's the deadlock. + +Any comments and suggestions on the subject are welcomed. Thanks! +I think this is what the blk_wait_while_drained() call was supposed to +address in blk_co_do_preadv_part(). However, with the use of multiple +I/O threads, this is racy. + +Do you think that in your case we hit the small race window between the +checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there +another reason why blk_wait_while_drained() didn't do its job? + +Kevin +At my opinion there is very big race window. Main thread has +eaten graph write lock. After that another coroutine is stalled +within GRAPH_RDLOCK_GUARD() as there is no drain at the moment and only +after that main thread has started drain. That is why Fiona's idea is +looking working. Though this would mean that normally we should always +do that at the moment when we acquire write lock. May be even inside +this function. Den + +Am 02.05.2025 um 19:52 hat Denis V. Lunev geschrieben: +> +On 5/2/25 19:34, Kevin Wolf wrote: +> +> Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben: +> +> > Hi all, +> +> > +> +> > There's a bug in block layer which leads to block graph deadlock. +> +> > Notably, it takes place when blockdev IO is processed within a separate +> +> > iothread. +> +> > +> +> > This was initially caught by our tests, and I was able to reduce it to a +> +> > relatively simple reproducer. Such deadlocks are probably supposed to +> +> > be covered in iotests/graph-changes-while-io, but this deadlock isn't. +> +> > +> +> > Basically what the reproducer does is launches QEMU with a drive having +> +> > 'iothread' option set, creates a chain of 2 snapshots, launches +> +> > block-commit job for a snapshot and then dismisses the job, starting +> +> > from the lower snapshot. If the guest is issuing IO at the same time, +> +> > there's a race in acquiring block graph lock and a potential deadlock. +> +> > +> +> > Here's how it can be reproduced: +> +> > +> +> > 1. Run QEMU: +> +> > > SRCDIR=/path/to/srcdir +> +> > > $SRCDIR/build/qemu-system-x86_64 -enable-kvm \ +> +> > > -machine q35 -cpu Nehalem \ +> +> > > -name guest=alma8-vm,debug-threads=on \ +> +> > > -m 2g -smp 2 \ +> +> > > -nographic -nodefaults \ +> +> > > -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \ +> +> > > -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \ +> +> > > -object iothread,id=iothread0 \ +> +> > > -blockdev +> +> > > node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2 +> +> > > \ +> +> > > -device virtio-blk-pci,drive=disk,iothread=iothread0 +> +> > 2. Launch IO (random reads) from within the guest: +> +> > > nc -U /var/run/alma8-serial.sock +> +> > > ... +> +> > > [root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 +> +> > > --bs=4k --size=1G --numjobs=1 --time_based=1 --runtime=300 +> +> > > --group_reporting --rw=randread --iodepth=1 --filename=/testfile +> +> > 3. Run snapshots creation & removal of lower snapshot operation in a +> +> > loop (script attached): +> +> > > while /bin/true ; do ./remove_lower_snap.sh ; done +> +> > And then it occasionally hangs. +> +> > +> +> > Note: I've tried bisecting this, and looks like deadlock occurs starting +> +> > from the following commit: +> +> > +> +> > (BAD) 5bdbaebcce virtio: Re-enable notifications after drain +> +> > (GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll +> +> > +> +> > On the latest v10.0.0 it does hang as well. +> +> > +> +> > +> +> > Here's backtrace of the main thread: +> +> > +> +> > > #0 0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, +> +> > > timeout=<optimized out>, sigmask=0x0) at +> +> > > ../sysdeps/unix/sysv/linux/ppoll.c:43 +> +> > > #1 0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, +> +> > > timeout=-1) at ../util/qemu-timer.c:329 +> +> > > #2 0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, +> +> > > ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79 +> +> > > #3 0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) +> +> > > at ../util/aio-posix.c:730 +> +> > > #4 0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, +> +> > > parent=0x0, poll=true) at ../block/io.c:378 +> +> > > #5 0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at +> +> > > ../block/io.c:391 +> +> > > #6 0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, +> +> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7682 +> +> > > #7 0x0000557eb45ebf2b in bdrv_child_change_aio_context +> +> > > (c=0x557eb7964250, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7608 +> +> > > #8 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, +> +> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7668 +> +> > > #9 0x0000557eb45ebf2b in bdrv_child_change_aio_context +> +> > > (c=0x557eb7e59110, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7608 +> +> > > #10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, +> +> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7668 +> +> > > #11 0x0000557eb45ebf2b in bdrv_child_change_aio_context +> +> > > (c=0x557eb814ed80, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7608 +> +> > > #12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, +> +> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../blockjob.c:157 +> +> > > #13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context +> +> > > (c=0x557eb7c9d3f0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7592 +> +> > > #14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, +> +> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7661 +> +> > > #15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx +> +> > > (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 +> +> > > = {...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234 +> +> > > #16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context +> +> > > (c=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7592 +> +> > > #17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, +> +> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, +> +> > > tran=0x557eb7a87160, errp=0x0) +> +> > > at ../block.c:7661 +> +> > > #18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context +> +> > > (bs=0x557eb79575e0, ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at +> +> > > ../block.c:7715 +> +> > > #19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) +> +> > > at ../block.c:3317 +> +> > > #20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv +> +> > > (job=0x557eb7952800) at ../blockjob.c:209 +> +> > > #21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at +> +> > > ../blockjob.c:82 +> +> > > #22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at +> +> > > ../job.c:474 +> +> > > #23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at +> +> > > ../job.c:771 +> +> > > #24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, +> +> > > errp=0x7ffd94b4f488) at ../job.c:783 +> +> > > --Type <RET> for more, q to quit, c to continue without paging-- +> +> > > #25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 +> +> > > "commit-snap1", errp=0x7ffd94b4f488) at ../job-qmp.c:138 +> +> > > #26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, +> +> > > ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221 +> +> > > #27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at +> +> > > ../qapi/qmp-dispatch.c:128 +> +> > > #28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at +> +> > > ../util/async.c:172 +> +> > > #29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at +> +> > > ../util/async.c:219 +> +> > > #30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at +> +> > > ../util/aio-posix.c:436 +> +> > > #31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, +> +> > > callback=0x0, user_data=0x0) at ../util/async.c:361 +> +> > > #32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at +> +> > > ../glib/gmain.c:3364 +> +> > > #33 g_main_context_dispatch (context=0x557eb76c6430) at +> +> > > ../glib/gmain.c:4079 +> +> > > #34 0x0000557eb47d3ab1 in glib_pollfds_poll () at +> +> > > ../util/main-loop.c:287 +> +> > > #35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at +> +> > > ../util/main-loop.c:310 +> +> > > #36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at +> +> > > ../util/main-loop.c:589 +> +> > > #37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835 +> +> > > #38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at +> +> > > ../system/main.c:50 +> +> > > #39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at +> +> > > ../system/main.c:80 +> +> > +> +> > And here's coroutine trying to acquire read lock: +> +> > +> +> > > (gdb) qemu coroutine reader_queue->entries.sqh_first +> +> > > #0 0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, +> +> > > to_=0x7fc537fff508, action=COROUTINE_YIELD) at +> +> > > ../util/coroutine-ucontext.c:321 +> +> > > #1 0x0000557eb47d4d4a in qemu_coroutine_yield () at +> +> > > ../util/qemu-coroutine.c:339 +> +> > > #2 0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 +> +> > > <reader_queue>, lock=0x7fc53c57de50, flags=0) at +> +> > > ../util/qemu-coroutine-lock.c:60 +> +> > > #3 0x0000557eb461fea7 in bdrv_graph_co_rdlock () at +> +> > > ../block/graph-lock.c:231 +> +> > > #4 0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) +> +> > > at /home/root/src/qemu/master/include/block/graph-lock.h:213 +> +> > > #5 0x0000557eb460fa41 in blk_co_do_preadv_part +> +> > > (blk=0x557eb84c0810, offset=6890553344, bytes=4096, +> +> > > qiov=0x7fc530006988, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at +> +> > > ../block/block-backend.c:1339 +> +> > > #6 0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at +> +> > > ../block/block-backend.c:1619 +> +> > > #7 0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, +> +> > > i1=21886) at ../util/coroutine-ucontext.c:175 +> +> > > #8 0x00007fc547c2a360 in __start_context () at +> +> > > ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 +> +> > > #9 0x00007ffd94b4ea40 in () +> +> > > #10 0x0000000000000000 in () +> +> > +> +> > So it looks like main thread is processing job-dismiss request and is +> +> > holding write lock taken in block_job_remove_all_bdrv() (frame #20 +> +> > above). At the same time iothread spawns a coroutine which performs IO +> +> > request. Before the coroutine is spawned, blk_aio_prwv() increases +> +> > 'in_flight' counter for Blk. Then blk_co_do_preadv_part() (frame #5) is +> +> > trying to acquire the read lock. But main thread isn't releasing the +> +> > lock as blk_root_drained_poll() returns true since blk->in_flight > 0. +> +> > Here's the deadlock. +> +> > +> +> > Any comments and suggestions on the subject are welcomed. Thanks! +> +> I think this is what the blk_wait_while_drained() call was supposed to +> +> address in blk_co_do_preadv_part(). However, with the use of multiple +> +> I/O threads, this is racy. +> +> +> +> Do you think that in your case we hit the small race window between the +> +> checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there +> +> another reason why blk_wait_while_drained() didn't do its job? +> +> +> +At my opinion there is very big race window. Main thread has +> +eaten graph write lock. After that another coroutine is stalled +> +within GRAPH_RDLOCK_GUARD() as there is no drain at the moment and only +> +after that main thread has started drain. +You're right, I confused taking the write lock with draining there. + +> +That is why Fiona's idea is looking working. Though this would mean +> +that normally we should always do that at the moment when we acquire +> +write lock. May be even inside this function. +I actually see now that not all of my graph locking patches were merged. +At least I did have the thought that bdrv_drained_begin() must be marked +GRAPH_UNLOCKED because it polls. That means that calling it from inside +bdrv_try_change_aio_context() is actually forbidden (and that's the part +I didn't see back then because it doesn't have TSA annotations). + +If you refactor the code to move the drain out to before the lock is +taken, I think you end up with Fiona's patch, except you'll remove the +forbidden inner drain and add more annotations for some functions and +clarify the rules around them. I don't know, but I wouldn't be surprised +if along the process we find other bugs, too. + +So Fiona's drain looks right to me, but we should probably approach it +more systematically. + +Kevin + diff --git a/results/classifier/zero-shot/012/debug/53568181 b/results/classifier/zero-shot/012/debug/53568181 new file mode 100644 index 00000000..c79e6f59 --- /dev/null +++ b/results/classifier/zero-shot/012/debug/53568181 @@ -0,0 +1,96 @@ +debug: 0.968 +permissions: 0.965 +performance: 0.948 +architecture: 0.945 +semantic: 0.943 +graphic: 0.940 +PID: 0.938 +assembly: 0.936 +device: 0.936 +vnc: 0.935 +x86: 0.932 +register: 0.930 +network: 0.925 +other: 0.921 +TCG: 0.921 +arm: 0.918 +risc-v: 0.907 +files: 0.890 +kernel virtual machine: 0.877 +boot: 0.876 +socket: 0.875 +mistranslation: 0.854 + +[BUG] x86/PAT handling severely crippled AMD-V SVM KVM performance + +Hi, I maintain an out-of-tree 3D APIs pass-through QEMU device models at +https://github.com/kjliew/qemu-3dfx +that provide 3D acceleration for legacy +32-bit Windows guests (Win98SE, WinME, Win2k and WinXP) with the focus on +playing old legacy games from 1996-2003. It currently supports the now-defunct +3Dfx propriety API called Glide and an alternative OpenGL pass-through based on +MESA implementation. + +The basic concept of both implementations create memory-mapped virtual +interfaces consist of host/guest shared memory with guest-push model instead of +a more common host-pull model for typical QEMU device model implementation. +Guest uses shared memory as FIFOs for drawing commands and data to bulk up the +operations until serialization event that flushes the FIFOs into host. This +achieves extremely good performance since virtual CPUs are fast with hardware +acceleration (Intel VT/AMD-V) and reduces the overhead of frequent VMEXITs to +service the device emulation. Both implementations work on Windows 10 with WHPX +and HAXM accelerators as well as KVM in Linux. + +On Windows 10, QEMU WHPX implementation does not sync MSR_IA32_PAT during +host/guest states sync. There is no visibility into the closed-source WHPX on +how things are managed behind the scene, but from measuring performance figures +I can conclude that it didn't handle the MSR_IA32_PAT correctly for both Intel +and AMD. Call this fair enough, if you will, it didn't flag any concerns, in +fact games such as Quake2 and Quake3 were still within playable frame rate of +40~60FPS on Win2k/XP guest. Until the same games were run on Win98/ME guest and +the frame rate blew off the roof (300~500FPS) on the same CPU and GPU. In fact, +the later seemed to be more inlined with runnng the games bare-metal with vsync +off. + +On Linux (at the time of writing kernel 5.6.7/Mesa 20.0), the difference +prevailed. Intel CPUs (and it so happened that I was on laptop with Intel GPU), +the VMX-based kvm_intel got it right while SVM-based kvm_amd did not. +To put this in simple exaggeration, an aging Core i3-4010U/HD Graphics 4400 +(Haswell GT2) exhibited an insane performance in Quake2/Quake3 timedemos that +totally crushed more recent AMD Ryzen 2500U APU/Vega 8 Graphics and AMD +FX8300/NVIDIA GT730 on desktop. Simply unbelievable! + +It turned out that there was something to do with AMD-V NPT. By loading kvm_amd +with npt=0, AMD Ryzen APU and FX8300 regained a huge performance leap. However, +AMD NPT issue with KVM was supposedly fixed in 2017 kernel commits. NPT=0 would +actually incur performance loss for VM due to intervention required by +hypervisors to maintain the shadow page tables. Finally, I was able to find the +pointer that pointed to MSR_IA32_PAT register. By updating the MSR_IA32_PAT to +0x0606xxxx0606xxxxULL, AMD CPUs now regain their rightful performance without +taking the hit of NPT=0 for Linux KVM. Taking the same solution into Windows, +both Intel and AMD CPUs no longer require Win98/ME guest to unleash the full +performance potentials and performance figures based on games measured on WHPX +were not very far behind Linux KVM. + +So I guess the problem lies in host/guest shared memory regions mapped as +uncacheable from virtual CPU perspective. As virtual CPUs now completely execute +in hardware context with x86 hardware virtualiztion extensions, the cacheability +of memory types would severely impact the performance on guests. WHPX didn't +handle it for both Intel EPT and AMD NPT, but KVM seems to do it right for Intel +EPT. I don't have the correct fix for QEMU. But what I can do for my 3D APIs +pass-through device models is to implement host-side hooks to reprogram and +restore MSR_IA32_PAT upon activation/deactivation of the 3D APIs. Perhaps there +is also a better solution of having the proper kernel drivers for virtual +interfaces to manage the memory types of host/guest shared memory in kernel +space, but to do that and the needs of Microsoft tools/DDKs, I will just forget +it. The guest stubs uses the same kernel drivers included in 3Dfx drivers for +memory mapping and the virtual interfaces remain driver-less from Windows OS +perspective. Considering the current state of halting progress for QEMU native +virgil3D to support Windows OS, I am just being pragmatic. I understand that +QEMU virgil3D will eventually bring 3D acceleration for Windows guests, but I do +not expect anything to support legacy 32-bit Windows OSes which have out-grown +their commercial usefulness. + +Regards, +KJ Liew + diff --git a/results/classifier/zero-shot/012/debug/64571620 b/results/classifier/zero-shot/012/debug/64571620 new file mode 100644 index 00000000..534d8903 --- /dev/null +++ b/results/classifier/zero-shot/012/debug/64571620 @@ -0,0 +1,803 @@ +debug: 0.927 +other: 0.922 +mistranslation: 0.917 +assembly: 0.913 +arm: 0.910 +risc-v: 0.909 +semantic: 0.903 +permissions: 0.902 +device: 0.899 +performance: 0.897 +graphic: 0.897 +architecture: 0.895 +PID: 0.887 +boot: 0.879 +register: 0.871 +files: 0.855 +socket: 0.855 +network: 0.853 +kernel virtual machine: 0.852 +TCG: 0.843 +vnc: 0.819 +x86: 0.794 + +[BUG] Migration hv_time rollback + +Hi, + +We are experiencing timestamp rollbacks during live-migration of +Windows 10 guests with the following qemu configuration (linux 5.4.46 +and qemu master): +``` +$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] +``` + +I have tracked the bug to the fact that `kvmclock` is not exposed and +disabled from qemu PoV but is in fact used by `hv-time` (in KVM). + +I think we should enable the `kvmclock` (qemu device) if `hv-time` is +present and add Hyper-V support for the `kvmclock_current_nsec` +function. + +I'm asking for advice because I am unsure this is the _right_ approach +and how to keep migration compatibility between qemu versions. + +Thank you all, + +-- +Antoine 'xdbob' Damhet +signature.asc +Description: +PGP signature + +cc'ing in Vitaly who knows about the hv stuff. + +* Antoine Damhet (antoine.damhet@blade-group.com) wrote: +> +Hi, +> +> +We are experiencing timestamp rollbacks during live-migration of +> +Windows 10 guests with the following qemu configuration (linux 5.4.46 +> +and qemu master): +> +``` +> +$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] +> +``` +How big a jump are you seeing, and how did you notice it in the guest? + +Dave + +> +I have tracked the bug to the fact that `kvmclock` is not exposed and +> +disabled from qemu PoV but is in fact used by `hv-time` (in KVM). +> +> +I think we should enable the `kvmclock` (qemu device) if `hv-time` is +> +present and add Hyper-V support for the `kvmclock_current_nsec` +> +function. +> +> +I'm asking for advice because I am unsure this is the _right_ approach +> +and how to keep migration compatibility between qemu versions. +> +> +Thank you all, +> +> +-- +> +Antoine 'xdbob' Damhet +-- +Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK + +"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes: + +> +cc'ing in Vitaly who knows about the hv stuff. +> +cc'ing Marcelo who knows about clocksources :-) + +> +* Antoine Damhet (antoine.damhet@blade-group.com) wrote: +> +> Hi, +> +> +> +> We are experiencing timestamp rollbacks during live-migration of +> +> Windows 10 guests +Are you migrating to the same hardware (with the same TSC frequency)? Is +TSC used as the clocksource on the host? + +> +> with the following qemu configuration (linux 5.4.46 +> +> and qemu master): +> +> ``` +> +> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] +> +> ``` +Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is +not going to check for KVM identification anyway so we pretend we're +Hyper-V. + +Also, have you tried adding more Hyper-V enlightenments? + +> +> +How big a jump are you seeing, and how did you notice it in the guest? +> +> +Dave +> +> +> I have tracked the bug to the fact that `kvmclock` is not exposed and +> +> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). +> +> +> +> I think we should enable the `kvmclock` (qemu device) if `hv-time` is +> +> present and add Hyper-V support for the `kvmclock_current_nsec` +> +> function. +AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by +the guest: + + if (!(env->system_time_msr & 1ULL)) { + /* KVM clock not active */ + return 0; + } + +and this is (and way) always false for Windows guests. + +> +> +> +> I'm asking for advice because I am unsure this is the _right_ approach +> +> and how to keep migration compatibility between qemu versions. +> +> +> +> Thank you all, +> +> +> +> -- +> +> Antoine 'xdbob' Damhet +-- +Vitaly + +On Wed, Sep 16, 2020 at 01:59:43PM +0200, Vitaly Kuznetsov wrote: +> +"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes: +> +> +> cc'ing in Vitaly who knows about the hv stuff. +> +> +> +> +cc'ing Marcelo who knows about clocksources :-) +> +> +> * Antoine Damhet (antoine.damhet@blade-group.com) wrote: +> +>> Hi, +> +>> +> +>> We are experiencing timestamp rollbacks during live-migration of +> +>> Windows 10 guests +> +> +Are you migrating to the same hardware (with the same TSC frequency)? Is +> +TSC used as the clocksource on the host? +Yes we are migrating to the exact same hardware. And yes TSC is used as +a clocksource in the host (but the bug is still happening with `hpet` as +a clocksource). + +> +> +>> with the following qemu configuration (linux 5.4.46 +> +>> and qemu master): +> +>> ``` +> +>> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] +> +>> ``` +> +> +Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is +> +not going to check for KVM identification anyway so we pretend we're +> +Hyper-V. +Some softwares explicitly checks for the presence of KVM and then crash +if they find it in CPUID :/ + +> +> +Also, have you tried adding more Hyper-V enlightenments? +Yes, I published a stripped-down command-line for a minimal reproducer +but even `hv-frequencies` and `hv-reenlightenment` don't help. + +> +> +> +> +> How big a jump are you seeing, and how did you notice it in the guest? +> +> +> +> Dave +> +> +> +>> I have tracked the bug to the fact that `kvmclock` is not exposed and +> +>> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). +> +>> +> +>> I think we should enable the `kvmclock` (qemu device) if `hv-time` is +> +>> present and add Hyper-V support for the `kvmclock_current_nsec` +> +>> function. +> +> +AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by +> +the guest: +> +> +if (!(env->system_time_msr & 1ULL)) { +> +/* KVM clock not active */ +> +return 0; +> +} +> +> +and this is (and way) always false for Windows guests. +Hooo, I missed this piece. When is `clock_is_reliable` expected to be +false ? Because if it is I still think we should be able to query at +least `HV_X64_MSR_REFERENCE_TSC` + +> +> +>> +> +>> I'm asking for advice because I am unsure this is the _right_ approach +> +>> and how to keep migration compatibility between qemu versions. +> +>> +> +>> Thank you all, +> +>> +> +>> -- +> +>> Antoine 'xdbob' Damhet +> +> +-- +> +Vitaly +> +-- +Antoine 'xdbob' Damhet +signature.asc +Description: +PGP signature + +On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote: +> +cc'ing in Vitaly who knows about the hv stuff. +Thanks + +> +> +* Antoine Damhet (antoine.damhet@blade-group.com) wrote: +> +> Hi, +> +> +> +> We are experiencing timestamp rollbacks during live-migration of +> +> Windows 10 guests with the following qemu configuration (linux 5.4.46 +> +> and qemu master): +> +> ``` +> +> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] +> +> ``` +> +> +How big a jump are you seeing, and how did you notice it in the guest? +I'm seeing jumps of about the guest uptime (indicating a reset of the +counter). It's expected because we won't call `KVM_SET_CLOCK` to +restore any value. + +We first noticed it because after some migrations `dwm.exe` crashes with +the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in +the past." error code. + +I can also confirm the following hack makes the behavior disappear: + +``` +diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c +index 64283358f9..f334bdf35f 100644 +--- a/hw/i386/kvm/clock.c ++++ b/hw/i386/kvm/clock.c +@@ -332,11 +332,7 @@ void kvmclock_create(void) + { + X86CPU *cpu = X86_CPU(first_cpu); + +- if (kvm_enabled() && +- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | +- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { +- sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); +- } ++ sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); + } + + static void kvmclock_register_types(void) +diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c +index 32b1453e6a..11d980ba85 100644 +--- a/hw/i386/pc_piix.c ++++ b/hw/i386/pc_piix.c +@@ -158,9 +158,7 @@ static void pc_init1(MachineState *machine, + + x86_cpus_init(x86ms, pcmc->default_cpu_version); + +- if (kvm_enabled() && pcmc->kvmclock_enabled) { +- kvmclock_create(); +- } ++ kvmclock_create(); + + if (pcmc->pci_enabled) { + pci_memory = g_new(MemoryRegion, 1); +``` + +> +> +Dave +> +> +> I have tracked the bug to the fact that `kvmclock` is not exposed and +> +> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). +> +> +> +> I think we should enable the `kvmclock` (qemu device) if `hv-time` is +> +> present and add Hyper-V support for the `kvmclock_current_nsec` +> +> function. +> +> +> +> I'm asking for advice because I am unsure this is the _right_ approach +> +> and how to keep migration compatibility between qemu versions. +> +> +> +> Thank you all, +> +> +> +> -- +> +> Antoine 'xdbob' Damhet +> +> +> +-- +> +Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK +> +-- +Antoine 'xdbob' Damhet +signature.asc +Description: +PGP signature + +Antoine Damhet <antoine.damhet@blade-group.com> writes: + +> +On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote: +> +> cc'ing in Vitaly who knows about the hv stuff. +> +> +Thanks +> +> +> +> +> * Antoine Damhet (antoine.damhet@blade-group.com) wrote: +> +> > Hi, +> +> > +> +> > We are experiencing timestamp rollbacks during live-migration of +> +> > Windows 10 guests with the following qemu configuration (linux 5.4.46 +> +> > and qemu master): +> +> > ``` +> +> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] +> +> > ``` +> +> +> +> How big a jump are you seeing, and how did you notice it in the guest? +> +> +I'm seeing jumps of about the guest uptime (indicating a reset of the +> +counter). It's expected because we won't call `KVM_SET_CLOCK` to +> +restore any value. +> +> +We first noticed it because after some migrations `dwm.exe` crashes with +> +the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in +> +the past." error code. +> +> +I can also confirm the following hack makes the behavior disappear: +> +> +``` +> +diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c +> +index 64283358f9..f334bdf35f 100644 +> +--- a/hw/i386/kvm/clock.c +> ++++ b/hw/i386/kvm/clock.c +> +@@ -332,11 +332,7 @@ void kvmclock_create(void) +> +{ +> +X86CPU *cpu = X86_CPU(first_cpu); +> +> +- if (kvm_enabled() && +> +- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | +> +- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { +> +- sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); +> +- } +> ++ sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); +> +} +> +Oh, I think I see what's going on. When you add 'kvm=off' +cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so +kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on +migration. + +In case we really want to support 'kvm=off' I think we can add Hyper-V +features check here along with KVM, this should do the job. + +-- +Vitaly + +Vitaly Kuznetsov <vkuznets@redhat.com> writes: + +> +Antoine Damhet <antoine.damhet@blade-group.com> writes: +> +> +> On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote: +> +>> cc'ing in Vitaly who knows about the hv stuff. +> +> +> +> Thanks +> +> +> +>> +> +>> * Antoine Damhet (antoine.damhet@blade-group.com) wrote: +> +>> > Hi, +> +>> > +> +>> > We are experiencing timestamp rollbacks during live-migration of +> +>> > Windows 10 guests with the following qemu configuration (linux 5.4.46 +> +>> > and qemu master): +> +>> > ``` +> +>> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...] +> +>> > ``` +> +>> +> +>> How big a jump are you seeing, and how did you notice it in the guest? +> +> +> +> I'm seeing jumps of about the guest uptime (indicating a reset of the +> +> counter). It's expected because we won't call `KVM_SET_CLOCK` to +> +> restore any value. +> +> +> +> We first noticed it because after some migrations `dwm.exe` crashes with +> +> the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in +> +> the past." error code. +> +> +> +> I can also confirm the following hack makes the behavior disappear: +> +> +> +> ``` +> +> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c +> +> index 64283358f9..f334bdf35f 100644 +> +> --- a/hw/i386/kvm/clock.c +> +> +++ b/hw/i386/kvm/clock.c +> +> @@ -332,11 +332,7 @@ void kvmclock_create(void) +> +> { +> +> X86CPU *cpu = X86_CPU(first_cpu); +> +> +> +> - if (kvm_enabled() && +> +> - cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | +> +> - (1ULL << KVM_FEATURE_CLOCKSOURCE2))) +> +> { +> +> - sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); +> +> - } +> +> + sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); +> +> } +> +> +> +> +> +Oh, I think I see what's going on. When you add 'kvm=off' +> +cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so +> +kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on +> +migration. +> +> +In case we really want to support 'kvm=off' I think we can add Hyper-V +> +features check here along with KVM, this should do the job. +Does the untested + +diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c +index 64283358f91d..e03b2ca6d8f6 100644 +--- a/hw/i386/kvm/clock.c ++++ b/hw/i386/kvm/clock.c +@@ -333,8 +333,9 @@ void kvmclock_create(void) + X86CPU *cpu = X86_CPU(first_cpu); + + if (kvm_enabled() && +- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | +- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { ++ ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | ++ (1ULL << KVM_FEATURE_CLOCKSOURCE2))) +|| ++ (cpu->env.features[FEAT_HYPERV_EAX] & HV_TIME_REF_COUNT_AVAILABLE))) { + sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); + } + } + +help? + +(I don't think we need to remove all 'if (kvm_enabled())' checks from +machine types as 'kvm=off' should not be related). + +-- +Vitaly + +On Wed, Sep 16, 2020 at 02:50:56PM +0200, Vitaly Kuznetsov wrote: +[...] + +> +>> +> +> +> +> +> +> Oh, I think I see what's going on. When you add 'kvm=off' +> +> cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so +> +> kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on +> +> migration. +> +> +> +> In case we really want to support 'kvm=off' I think we can add Hyper-V +> +> features check here along with KVM, this should do the job. +> +> +Does the untested +> +> +diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c +> +index 64283358f91d..e03b2ca6d8f6 100644 +> +--- a/hw/i386/kvm/clock.c +> ++++ b/hw/i386/kvm/clock.c +> +@@ -333,8 +333,9 @@ void kvmclock_create(void) +> +X86CPU *cpu = X86_CPU(first_cpu); +> +> +if (kvm_enabled() && +> +- cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | +> +- (1ULL << KVM_FEATURE_CLOCKSOURCE2))) { +> ++ ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) | +> ++ (1ULL << +> +KVM_FEATURE_CLOCKSOURCE2))) || +> ++ (cpu->env.features[FEAT_HYPERV_EAX] & +> +HV_TIME_REF_COUNT_AVAILABLE))) { +> +sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL); +> +} +> +} +> +> +help? +It appears to work :) + +> +> +(I don't think we need to remove all 'if (kvm_enabled())' checks from +> +machine types as 'kvm=off' should not be related). +Indeed (I didn't look at the macro, it was just quick & dirty). + +> +> +-- +> +Vitaly +> +> +-- +Antoine 'xdbob' Damhet +signature.asc +Description: +PGP signature + +On 16/09/20 13:29, Dr. David Alan Gilbert wrote: +> +> I have tracked the bug to the fact that `kvmclock` is not exposed and +> +> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). +> +> +> +> I think we should enable the `kvmclock` (qemu device) if `hv-time` is +> +> present and add Hyper-V support for the `kvmclock_current_nsec` +> +> function. +Yes, this seems correct. I would have to check but it may even be +better to _always_ send kvmclock data in the live migration stream. + +Paolo + +Paolo Bonzini <pbonzini@redhat.com> writes: + +> +On 16/09/20 13:29, Dr. David Alan Gilbert wrote: +> +>> I have tracked the bug to the fact that `kvmclock` is not exposed and +> +>> disabled from qemu PoV but is in fact used by `hv-time` (in KVM). +> +>> +> +>> I think we should enable the `kvmclock` (qemu device) if `hv-time` is +> +>> present and add Hyper-V support for the `kvmclock_current_nsec` +> +>> function. +> +> +Yes, this seems correct. I would have to check but it may even be +> +better to _always_ send kvmclock data in the live migration stream. +> +The question I have is: with 'kvm=off', do we actually restore TSC +reading on migration? (and I guess the answer is 'no' or Hyper-V TSC +page would 'just work' I guess). So yea, maybe dropping the +'cpu->env.features[FEAT_KVM]' check is the right fix. + +-- +Vitaly + |