summary refs log tree commit diff stats
path: root/results/classifier/zero-shot/012/debug
diff options
context:
space:
mode:
Diffstat (limited to 'results/classifier/zero-shot/012/debug')
-rw-r--r--results/classifier/zero-shot/012/debug/241903402074
-rw-r--r--results/classifier/zero-shot/012/debug/5356818196
-rw-r--r--results/classifier/zero-shot/012/debug/64571620803
3 files changed, 2973 insertions, 0 deletions
diff --git a/results/classifier/zero-shot/012/debug/24190340 b/results/classifier/zero-shot/012/debug/24190340
new file mode 100644
index 00000000..4413b8bc
--- /dev/null
+++ b/results/classifier/zero-shot/012/debug/24190340
@@ -0,0 +1,2074 @@
+debug: 0.876
+risc-v: 0.869
+register: 0.856
+permissions: 0.851
+TCG: 0.846
+device: 0.832
+performance: 0.832
+other: 0.811
+vnc: 0.808
+boot: 0.803
+arm: 0.794
+semantic: 0.793
+assembly: 0.790
+graphic: 0.775
+files: 0.770
+architecture: 0.769
+PID: 0.762
+mistranslation: 0.758
+network: 0.723
+socket: 0.715
+x86: 0.644
+kernel virtual machine: 0.617
+
+[BUG, RFC] Block graph deadlock on job-dismiss
+
+Hi all,
+
+There's a bug in block layer which leads to block graph deadlock.
+Notably, it takes place when blockdev IO is processed within a separate
+iothread.
+
+This was initially caught by our tests, and I was able to reduce it to a
+relatively simple reproducer.  Such deadlocks are probably supposed to
+be covered in iotests/graph-changes-while-io, but this deadlock isn't.
+
+Basically what the reproducer does is launches QEMU with a drive having
+'iothread' option set, creates a chain of 2 snapshots, launches
+block-commit job for a snapshot and then dismisses the job, starting
+from the lower snapshot.  If the guest is issuing IO at the same time,
+there's a race in acquiring block graph lock and a potential deadlock.
+
+Here's how it can be reproduced:
+
+1. Run QEMU:
+>
+SRCDIR=/path/to/srcdir
+>
+>
+>
+>
+>
+$SRCDIR/build/qemu-system-x86_64 -enable-kvm \
+>
+>
+-machine q35 -cpu Nehalem \
+>
+>
+-name guest=alma8-vm,debug-threads=on \
+>
+>
+-m 2g -smp 2 \
+>
+>
+-nographic -nodefaults \
+>
+>
+-qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \
+>
+>
+-serial unix:/var/run/alma8-serial.sock,server=on,wait=off \
+>
+>
+-object iothread,id=iothread0 \
+>
+>
+-blockdev
+>
+node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2
+>
+\
+>
+-device virtio-blk-pci,drive=disk,iothread=iothread0
+2. Launch IO (random reads) from within the guest:
+>
+nc -U /var/run/alma8-serial.sock
+>
+...
+>
+[root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k
+>
+--size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting
+>
+--rw=randread --iodepth=1 --filename=/testfile
+3. Run snapshots creation & removal of lower snapshot operation in a
+loop (script attached):
+>
+while /bin/true ; do ./remove_lower_snap.sh ; done
+And then it occasionally hangs.
+
+Note: I've tried bisecting this, and looks like deadlock occurs starting
+from the following commit:
+
+(BAD)  5bdbaebcce virtio: Re-enable notifications after drain
+(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll
+
+On the latest v10.0.0 it does hang as well.
+
+
+Here's backtrace of the main thread:
+
+>
+#0  0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1,
+>
+timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43
+>
+#1  0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1,
+>
+timeout=-1) at ../util/qemu-timer.c:329
+>
+#2  0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20,
+>
+ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79
+>
+#3  0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at
+>
+../util/aio-posix.c:730
+>
+#4  0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950,
+>
+parent=0x0, poll=true) at ../block/io.c:378
+>
+#5  0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at
+>
+../block/io.c:391
+>
+#6  0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7682
+>
+#7  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7608
+>
+#8  0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7668
+>
+#9  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7608
+>
+#10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7668
+>
+#11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7608
+>
+#12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../blockjob.c:157
+>
+#13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7592
+>
+#14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7661
+>
+#15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx
+>
+(child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 =
+>
+{...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234
+>
+#16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7592
+>
+#17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0,
+>
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+errp=0x0)
+>
+at ../block.c:7661
+>
+#18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0,
+>
+ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715
+>
+#19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at
+>
+../block.c:3317
+>
+#20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at
+>
+../blockjob.c:209
+>
+#21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at
+>
+../blockjob.c:82
+>
+#22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at
+>
+../job.c:474
+>
+#23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at
+>
+../job.c:771
+>
+#24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400,
+>
+errp=0x7ffd94b4f488) at ../job.c:783
+>
+--Type <RET> for more, q to quit, c to continue without paging--
+>
+#25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1",
+>
+errp=0x7ffd94b4f488) at ../job-qmp.c:138
+>
+#26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0,
+>
+ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221
+>
+#27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at
+>
+../qapi/qmp-dispatch.c:128
+>
+#28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at
+>
+../util/async.c:172
+>
+#29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at
+>
+../util/async.c:219
+>
+#30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at
+>
+../util/aio-posix.c:436
+>
+#31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200,
+>
+callback=0x0, user_data=0x0) at ../util/async.c:361
+>
+#32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at
+>
+../glib/gmain.c:3364
+>
+#33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079
+>
+#34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287
+>
+#35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at
+>
+../util/main-loop.c:310
+>
+#36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at
+>
+../util/main-loop.c:589
+>
+#37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835
+>
+#38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at
+>
+../system/main.c:50
+>
+#39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at
+>
+../system/main.c:80
+And here's coroutine trying to acquire read lock:
+
+>
+(gdb) qemu coroutine reader_queue->entries.sqh_first
+>
+#0  0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0,
+>
+to_=0x7fc537fff508, action=COROUTINE_YIELD) at
+>
+../util/coroutine-ucontext.c:321
+>
+#1  0x0000557eb47d4d4a in qemu_coroutine_yield () at
+>
+../util/qemu-coroutine.c:339
+>
+#2  0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0
+>
+<reader_queue>, lock=0x7fc53c57de50, flags=0) at
+>
+../util/qemu-coroutine-lock.c:60
+>
+#3  0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231
+>
+#4  0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at
+>
+/home/root/src/qemu/master/include/block/graph-lock.h:213
+>
+#5  0x0000557eb460fa41 in blk_co_do_preadv_part
+>
+(blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988,
+>
+qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339
+>
+#6  0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at
+>
+../block/block-backend.c:1619
+>
+#7  0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at
+>
+../util/coroutine-ucontext.c:175
+>
+#8  0x00007fc547c2a360 in __start_context () at
+>
+../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
+>
+#9  0x00007ffd94b4ea40 in  ()
+>
+#10 0x0000000000000000 in  ()
+So it looks like main thread is processing job-dismiss request and is
+holding write lock taken in block_job_remove_all_bdrv() (frame #20
+above).  At the same time iothread spawns a coroutine which performs IO
+request.  Before the coroutine is spawned, blk_aio_prwv() increases
+'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
+trying to acquire the read lock.  But main thread isn't releasing the
+lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
+Here's the deadlock.
+
+Any comments and suggestions on the subject are welcomed.  Thanks!
+
+Andrey
+remove_lower_snap.sh
+Description:
+application/shellscript
+
+On 4/24/25 8:32 PM, Andrey Drobyshev wrote:
+>
+Hi all,
+>
+>
+There's a bug in block layer which leads to block graph deadlock.
+>
+Notably, it takes place when blockdev IO is processed within a separate
+>
+iothread.
+>
+>
+This was initially caught by our tests, and I was able to reduce it to a
+>
+relatively simple reproducer.  Such deadlocks are probably supposed to
+>
+be covered in iotests/graph-changes-while-io, but this deadlock isn't.
+>
+>
+Basically what the reproducer does is launches QEMU with a drive having
+>
+'iothread' option set, creates a chain of 2 snapshots, launches
+>
+block-commit job for a snapshot and then dismisses the job, starting
+>
+from the lower snapshot.  If the guest is issuing IO at the same time,
+>
+there's a race in acquiring block graph lock and a potential deadlock.
+>
+>
+Here's how it can be reproduced:
+>
+>
+[...]
+>
+I took a closer look at iotests/graph-changes-while-io, and have managed
+to reproduce the same deadlock in a much simpler setup, without a guest.
+
+1. Run QSD:> ./build/storage-daemon/qemu-storage-daemon --object
+iothread,id=iothread0 \
+>
+--blockdev null-co,node-name=node0,read-zeroes=true \
+>
+>
+--nbd-server addr.type=unix,addr.path=/var/run/qsd_nbd.sock \
+>
+>
+--export
+>
+nbd,id=exp0,node-name=node0,iothread=iothread0,fixed-iothread=true,writable=true
+>
+\
+>
+--chardev
+>
+socket,id=qmp-sock,path=/var/run/qsd_qmp.sock,server=on,wait=off \
+>
+--monitor chardev=qmp-sock
+2. Launch IO:
+>
+qemu-img bench -f raw -c 2000000
+>
+'nbd+unix:///node0?socket=/var/run/qsd_nbd.sock'
+3. Add 2 snapshots and remove lower one (script attached):> while
+/bin/true ; do ./rls_qsd.sh ; done
+
+And then it hangs.
+
+I'll also send a patch with corresponding test case added directly to
+iotests.
+
+This reproduce seems to be hanging starting from Fiona's commit
+67446e605dc ("blockjob: drop AioContext lock before calling
+bdrv_graph_wrlock()").  AioContext locks were dropped entirely later on
+in Stefan's commit b49f4755c7 ("block: remove AioContext locking"), but
+the problem remains.
+
+Andrey
+rls_qsd.sh
+Description:
+application/shellscript
+
+From: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
+
+This case is catching potential deadlock which takes place when job-dismiss
+is issued when I/O requests are processed in a separate iothread.
+
+See
+https://mail.gnu.org/archive/html/qemu-devel/2025-04/msg04421.html
+Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
+---
+ .../qemu-iotests/tests/graph-changes-while-io | 101 ++++++++++++++++--
+ .../tests/graph-changes-while-io.out          |   4 +-
+ 2 files changed, 96 insertions(+), 9 deletions(-)
+
+diff --git a/tests/qemu-iotests/tests/graph-changes-while-io 
+b/tests/qemu-iotests/tests/graph-changes-while-io
+index 194fda500e..e30f823da4 100755
+--- a/tests/qemu-iotests/tests/graph-changes-while-io
++++ b/tests/qemu-iotests/tests/graph-changes-while-io
+@@ -27,6 +27,8 @@ from iotests import imgfmt, qemu_img, qemu_img_create, 
+qemu_io, \
+ 
+ 
+ top = os.path.join(iotests.test_dir, 'top.img')
++snap1 = os.path.join(iotests.test_dir, 'snap1.img')
++snap2 = os.path.join(iotests.test_dir, 'snap2.img')
+ nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock')
+ 
+ 
+@@ -58,6 +60,15 @@ class TestGraphChangesWhileIO(QMPTestCase):
+     def tearDown(self) -> None:
+         self.qsd.stop()
+ 
++    def _wait_for_blockjob(self, status) -> None:
++        done = False
++        while not done:
++            for event in self.qsd.get_qmp().get_events(wait=10.0):
++                if event['event'] != 'JOB_STATUS_CHANGE':
++                    continue
++                if event['data']['status'] == status:
++                    done = True
++
+     def test_blockdev_add_while_io(self) -> None:
+         # Run qemu-img bench in the background
+         bench_thr = Thread(target=do_qemu_img_bench)
+@@ -116,13 +127,89 @@ class TestGraphChangesWhileIO(QMPTestCase):
+                 'device': 'job0',
+             })
+ 
+-            cancelled = False
+-            while not cancelled:
+-                for event in self.qsd.get_qmp().get_events(wait=10.0):
+-                    if event['event'] != 'JOB_STATUS_CHANGE':
+-                        continue
+-                    if event['data']['status'] == 'null':
+-                        cancelled = True
++            self._wait_for_blockjob('null')
++
++        bench_thr.join()
++
++    def test_remove_lower_snapshot_while_io(self) -> None:
++        # Run qemu-img bench in the background
++        bench_thr = Thread(target=do_qemu_img_bench, args=(100000, ))
++        bench_thr.start()
++
++        # While I/O is performed on 'node0' node, consequently add 2 snapshots
++        # on top of it, then remove (commit) them starting from lower one.
++        while bench_thr.is_alive():
++            # Recreate snapshot images on every iteration
++            qemu_img_create('-f', imgfmt, snap1, '1G')
++            qemu_img_create('-f', imgfmt, snap2, '1G')
++
++            self.qsd.cmd('blockdev-add', {
++                'driver': imgfmt,
++                'node-name': 'snap1',
++                'file': {
++                    'driver': 'file',
++                    'filename': snap1
++                }
++            })
++
++            self.qsd.cmd('blockdev-snapshot', {
++                'node': 'node0',
++                'overlay': 'snap1',
++            })
++
++            self.qsd.cmd('blockdev-add', {
++                'driver': imgfmt,
++                'node-name': 'snap2',
++                'file': {
++                    'driver': 'file',
++                    'filename': snap2
++                }
++            })
++
++            self.qsd.cmd('blockdev-snapshot', {
++                'node': 'snap1',
++                'overlay': 'snap2',
++            })
++
++            self.qsd.cmd('block-commit', {
++                'job-id': 'commit-snap1',
++                'device': 'snap2',
++                'top-node': 'snap1',
++                'base-node': 'node0',
++                'auto-finalize': True,
++                'auto-dismiss': False,
++            })
++
++            self._wait_for_blockjob('concluded')
++            self.qsd.cmd('job-dismiss', {
++                'id': 'commit-snap1',
++            })
++
++            self.qsd.cmd('block-commit', {
++                'job-id': 'commit-snap2',
++                'device': 'snap2',
++                'top-node': 'snap2',
++                'base-node': 'node0',
++                'auto-finalize': True,
++                'auto-dismiss': False,
++            })
++
++            self._wait_for_blockjob('ready')
++            self.qsd.cmd('job-complete', {
++                'id': 'commit-snap2',
++            })
++
++            self._wait_for_blockjob('concluded')
++            self.qsd.cmd('job-dismiss', {
++                'id': 'commit-snap2',
++            })
++
++            self.qsd.cmd('blockdev-del', {
++                'node-name': 'snap1'
++            })
++            self.qsd.cmd('blockdev-del', {
++                'node-name': 'snap2'
++            })
+ 
+         bench_thr.join()
+ 
+diff --git a/tests/qemu-iotests/tests/graph-changes-while-io.out 
+b/tests/qemu-iotests/tests/graph-changes-while-io.out
+index fbc63e62f8..8d7e996700 100644
+--- a/tests/qemu-iotests/tests/graph-changes-while-io.out
++++ b/tests/qemu-iotests/tests/graph-changes-while-io.out
+@@ -1,5 +1,5 @@
+-..
++...
+ ----------------------------------------------------------------------
+-Ran 2 tests
++Ran 3 tests
+ 
+ OK
+-- 
+2.43.5
+
+Am 24.04.25 um 19:32 schrieb Andrey Drobyshev:
+>
+So it looks like main thread is processing job-dismiss request and is
+>
+holding write lock taken in block_job_remove_all_bdrv() (frame #20
+>
+above).  At the same time iothread spawns a coroutine which performs IO
+>
+request.  Before the coroutine is spawned, blk_aio_prwv() increases
+>
+'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
+>
+trying to acquire the read lock.  But main thread isn't releasing the
+>
+lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
+>
+Here's the deadlock.
+And for the IO test you provided, it's client->nb_requests that behaves
+similarly to blk->in_flight here.
+
+The issue also reproduces easily when issuing the following QMP command
+in a loop while doing IO on a device:
+
+>
+void qmp_block_locked_drain(const char *node_name, Error **errp)
+>
+{
+>
+BlockDriverState *bs;
+>
+>
+bs = bdrv_find_node(node_name);
+>
+if (!bs) {
+>
+error_setg(errp, "node not found");
+>
+return;
+>
+}
+>
+>
+bdrv_graph_wrlock();
+>
+bdrv_drained_begin(bs);
+>
+bdrv_drained_end(bs);
+>
+bdrv_graph_wrunlock();
+>
+}
+It seems like either it would be necessary to require:
+1. not draining inside an exclusively locked section
+or
+2. making sure that variables used by drained_poll routines are only set
+while holding the reader lock
+?
+
+Those seem to require rather involved changes, so a third option might
+be to make draining inside an exclusively locked section possible, by
+embedding such locked sections in a drained section:
+
+>
+diff --git a/blockjob.c b/blockjob.c
+>
+index 32007f31a9..9b2f3b3ea9 100644
+>
+--- a/blockjob.c
+>
++++ b/blockjob.c
+>
+@@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
+>
+* one to make sure that such a concurrent access does not attempt
+>
+* to process an already freed BdrvChild.
+>
+*/
+>
++    bdrv_drain_all_begin();
+>
+bdrv_graph_wrlock();
+>
+while (job->nodes) {
+>
+GSList *l = job->nodes;
+>
+@@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
+>
+g_slist_free_1(l);
+>
+}
+>
+bdrv_graph_wrunlock();
+>
++    bdrv_drain_all_end();
+>
+}
+>
+>
+bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
+This seems to fix the issue at hand. I can send a patch if this is
+considered an acceptable approach.
+
+Best Regards,
+Fiona
+
+On 4/30/25 11:47 AM, Fiona Ebner wrote:
+>
+Am 24.04.25 um 19:32 schrieb Andrey Drobyshev:
+>
+> So it looks like main thread is processing job-dismiss request and is
+>
+> holding write lock taken in block_job_remove_all_bdrv() (frame #20
+>
+> above).  At the same time iothread spawns a coroutine which performs IO
+>
+> request.  Before the coroutine is spawned, blk_aio_prwv() increases
+>
+> 'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
+>
+> trying to acquire the read lock.  But main thread isn't releasing the
+>
+> lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
+>
+> Here's the deadlock.
+>
+>
+And for the IO test you provided, it's client->nb_requests that behaves
+>
+similarly to blk->in_flight here.
+>
+>
+The issue also reproduces easily when issuing the following QMP command
+>
+in a loop while doing IO on a device:
+>
+>
+> void qmp_block_locked_drain(const char *node_name, Error **errp)
+>
+> {
+>
+>     BlockDriverState *bs;
+>
+>
+>
+>     bs = bdrv_find_node(node_name);
+>
+>     if (!bs) {
+>
+>         error_setg(errp, "node not found");
+>
+>         return;
+>
+>     }
+>
+>
+>
+>     bdrv_graph_wrlock();
+>
+>     bdrv_drained_begin(bs);
+>
+>     bdrv_drained_end(bs);
+>
+>     bdrv_graph_wrunlock();
+>
+> }
+>
+>
+It seems like either it would be necessary to require:
+>
+1. not draining inside an exclusively locked section
+>
+or
+>
+2. making sure that variables used by drained_poll routines are only set
+>
+while holding the reader lock
+>
+?
+>
+>
+Those seem to require rather involved changes, so a third option might
+>
+be to make draining inside an exclusively locked section possible, by
+>
+embedding such locked sections in a drained section:
+>
+>
+> diff --git a/blockjob.c b/blockjob.c
+>
+> index 32007f31a9..9b2f3b3ea9 100644
+>
+> --- a/blockjob.c
+>
+> +++ b/blockjob.c
+>
+> @@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
+>
+>       * one to make sure that such a concurrent access does not attempt
+>
+>       * to process an already freed BdrvChild.
+>
+>       */
+>
+> +    bdrv_drain_all_begin();
+>
+>      bdrv_graph_wrlock();
+>
+>      while (job->nodes) {
+>
+>          GSList *l = job->nodes;
+>
+> @@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
+>
+>          g_slist_free_1(l);
+>
+>      }
+>
+>      bdrv_graph_wrunlock();
+>
+> +    bdrv_drain_all_end();
+>
+>  }
+>
+>
+>
+>  bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
+>
+>
+This seems to fix the issue at hand. I can send a patch if this is
+>
+considered an acceptable approach.
+>
+>
+Best Regards,
+>
+Fiona
+>
+Hello Fiona,
+
+Thanks for looking into it.  I've tried your 3rd option above and can
+confirm it does fix the deadlock, at least I can't reproduce it.  Other
+iotests also don't seem to be breaking.  So I personally am fine with
+that patch.  Would be nice to hear a word from the maintainers though on
+whether there're any caveats with such approach.
+
+Andrey
+
+On Wed, Apr 30, 2025 at 10:11 AM Andrey Drobyshev
+<andrey.drobyshev@virtuozzo.com> wrote:
+>
+>
+On 4/30/25 11:47 AM, Fiona Ebner wrote:
+>
+> Am 24.04.25 um 19:32 schrieb Andrey Drobyshev:
+>
+>> So it looks like main thread is processing job-dismiss request and is
+>
+>> holding write lock taken in block_job_remove_all_bdrv() (frame #20
+>
+>> above).  At the same time iothread spawns a coroutine which performs IO
+>
+>> request.  Before the coroutine is spawned, blk_aio_prwv() increases
+>
+>> 'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
+>
+>> trying to acquire the read lock.  But main thread isn't releasing the
+>
+>> lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
+>
+>> Here's the deadlock.
+>
+>
+>
+> And for the IO test you provided, it's client->nb_requests that behaves
+>
+> similarly to blk->in_flight here.
+>
+>
+>
+> The issue also reproduces easily when issuing the following QMP command
+>
+> in a loop while doing IO on a device:
+>
+>
+>
+>> void qmp_block_locked_drain(const char *node_name, Error **errp)
+>
+>> {
+>
+>>     BlockDriverState *bs;
+>
+>>
+>
+>>     bs = bdrv_find_node(node_name);
+>
+>>     if (!bs) {
+>
+>>         error_setg(errp, "node not found");
+>
+>>         return;
+>
+>>     }
+>
+>>
+>
+>>     bdrv_graph_wrlock();
+>
+>>     bdrv_drained_begin(bs);
+>
+>>     bdrv_drained_end(bs);
+>
+>>     bdrv_graph_wrunlock();
+>
+>> }
+>
+>
+>
+> It seems like either it would be necessary to require:
+>
+> 1. not draining inside an exclusively locked section
+>
+> or
+>
+> 2. making sure that variables used by drained_poll routines are only set
+>
+> while holding the reader lock
+>
+> ?
+>
+>
+>
+> Those seem to require rather involved changes, so a third option might
+>
+> be to make draining inside an exclusively locked section possible, by
+>
+> embedding such locked sections in a drained section:
+>
+>
+>
+>> diff --git a/blockjob.c b/blockjob.c
+>
+>> index 32007f31a9..9b2f3b3ea9 100644
+>
+>> --- a/blockjob.c
+>
+>> +++ b/blockjob.c
+>
+>> @@ -198,6 +198,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
+>
+>>       * one to make sure that such a concurrent access does not attempt
+>
+>>       * to process an already freed BdrvChild.
+>
+>>       */
+>
+>> +    bdrv_drain_all_begin();
+>
+>>      bdrv_graph_wrlock();
+>
+>>      while (job->nodes) {
+>
+>>          GSList *l = job->nodes;
+>
+>> @@ -211,6 +212,7 @@ void block_job_remove_all_bdrv(BlockJob *job)
+>
+>>          g_slist_free_1(l);
+>
+>>      }
+>
+>>      bdrv_graph_wrunlock();
+>
+>> +    bdrv_drain_all_end();
+>
+>>  }
+>
+>>
+>
+>>  bool block_job_has_bdrv(BlockJob *job, BlockDriverState *bs)
+>
+>
+>
+> This seems to fix the issue at hand. I can send a patch if this is
+>
+> considered an acceptable approach.
+Kevin is aware of this thread but it's a public holiday tomorrow so it
+may be a little longer.
+
+Stefan
+
+Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben:
+>
+Hi all,
+>
+>
+There's a bug in block layer which leads to block graph deadlock.
+>
+Notably, it takes place when blockdev IO is processed within a separate
+>
+iothread.
+>
+>
+This was initially caught by our tests, and I was able to reduce it to a
+>
+relatively simple reproducer.  Such deadlocks are probably supposed to
+>
+be covered in iotests/graph-changes-while-io, but this deadlock isn't.
+>
+>
+Basically what the reproducer does is launches QEMU with a drive having
+>
+'iothread' option set, creates a chain of 2 snapshots, launches
+>
+block-commit job for a snapshot and then dismisses the job, starting
+>
+from the lower snapshot.  If the guest is issuing IO at the same time,
+>
+there's a race in acquiring block graph lock and a potential deadlock.
+>
+>
+Here's how it can be reproduced:
+>
+>
+1. Run QEMU:
+>
+> SRCDIR=/path/to/srcdir
+>
+>
+>
+>
+>
+>
+>
+>
+>
+> $SRCDIR/build/qemu-system-x86_64 -enable-kvm \
+>
+>
+>
+>   -machine q35 -cpu Nehalem \
+>
+>
+>
+>   -name guest=alma8-vm,debug-threads=on \
+>
+>
+>
+>   -m 2g -smp 2 \
+>
+>
+>
+>   -nographic -nodefaults \
+>
+>
+>
+>   -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \
+>
+>
+>
+>   -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \
+>
+>
+>
+>   -object iothread,id=iothread0 \
+>
+>
+>
+>   -blockdev
+>
+> node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2
+>
+>  \
+>
+>   -device virtio-blk-pci,drive=disk,iothread=iothread0
+>
+>
+2. Launch IO (random reads) from within the guest:
+>
+> nc -U /var/run/alma8-serial.sock
+>
+> ...
+>
+> [root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k
+>
+> --size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting
+>
+> --rw=randread --iodepth=1 --filename=/testfile
+>
+>
+3. Run snapshots creation & removal of lower snapshot operation in a
+>
+loop (script attached):
+>
+> while /bin/true ; do ./remove_lower_snap.sh ; done
+>
+>
+And then it occasionally hangs.
+>
+>
+Note: I've tried bisecting this, and looks like deadlock occurs starting
+>
+from the following commit:
+>
+>
+(BAD)  5bdbaebcce virtio: Re-enable notifications after drain
+>
+(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll
+>
+>
+On the latest v10.0.0 it does hang as well.
+>
+>
+>
+Here's backtrace of the main thread:
+>
+>
+> #0  0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1,
+>
+> timeout=<optimized out>, sigmask=0x0) at
+>
+> ../sysdeps/unix/sysv/linux/ppoll.c:43
+>
+> #1  0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1,
+>
+> timeout=-1) at ../util/qemu-timer.c:329
+>
+> #2  0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20,
+>
+> ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79
+>
+> #3  0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at
+>
+> ../util/aio-posix.c:730
+>
+> #4  0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950,
+>
+> parent=0x0, poll=true) at ../block/io.c:378
+>
+> #5  0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at
+>
+> ../block/io.c:391
+>
+> #6  0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7682
+>
+> #7  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7608
+>
+> #8  0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7668
+>
+> #9  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7608
+>
+> #10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7668
+>
+> #11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7608
+>
+> #12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../blockjob.c:157
+>
+> #13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7592
+>
+> #14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7661
+>
+> #15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx
+>
+>     (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 =
+>
+> {...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234
+>
+> #16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7592
+>
+> #17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0,
+>
+> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160,
+>
+> errp=0x0)
+>
+>     at ../block.c:7661
+>
+> #18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0,
+>
+> ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715
+>
+> #19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at
+>
+> ../block.c:3317
+>
+> #20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at
+>
+> ../blockjob.c:209
+>
+> #21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at
+>
+> ../blockjob.c:82
+>
+> #22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at
+>
+> ../job.c:474
+>
+> #23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at
+>
+> ../job.c:771
+>
+> #24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400,
+>
+> errp=0x7ffd94b4f488) at ../job.c:783
+>
+> --Type <RET> for more, q to quit, c to continue without paging--
+>
+> #25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0
+>
+> "commit-snap1", errp=0x7ffd94b4f488) at ../job-qmp.c:138
+>
+> #26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0,
+>
+> ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221
+>
+> #27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at
+>
+> ../qapi/qmp-dispatch.c:128
+>
+> #28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at
+>
+> ../util/async.c:172
+>
+> #29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at
+>
+> ../util/async.c:219
+>
+> #30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at
+>
+> ../util/aio-posix.c:436
+>
+> #31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200,
+>
+> callback=0x0, user_data=0x0) at ../util/async.c:361
+>
+> #32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at
+>
+> ../glib/gmain.c:3364
+>
+> #33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079
+>
+> #34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287
+>
+> #35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at
+>
+> ../util/main-loop.c:310
+>
+> #36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at
+>
+> ../util/main-loop.c:589
+>
+> #37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835
+>
+> #38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at
+>
+> ../system/main.c:50
+>
+> #39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at
+>
+> ../system/main.c:80
+>
+>
+>
+And here's coroutine trying to acquire read lock:
+>
+>
+> (gdb) qemu coroutine reader_queue->entries.sqh_first
+>
+> #0  0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0,
+>
+> to_=0x7fc537fff508, action=COROUTINE_YIELD) at
+>
+> ../util/coroutine-ucontext.c:321
+>
+> #1  0x0000557eb47d4d4a in qemu_coroutine_yield () at
+>
+> ../util/qemu-coroutine.c:339
+>
+> #2  0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0
+>
+> <reader_queue>, lock=0x7fc53c57de50, flags=0) at
+>
+> ../util/qemu-coroutine-lock.c:60
+>
+> #3  0x0000557eb461fea7 in bdrv_graph_co_rdlock () at
+>
+> ../block/graph-lock.c:231
+>
+> #4  0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at
+>
+> /home/root/src/qemu/master/include/block/graph-lock.h:213
+>
+> #5  0x0000557eb460fa41 in blk_co_do_preadv_part
+>
+>     (blk=0x557eb84c0810, offset=6890553344, bytes=4096,
+>
+> qiov=0x7fc530006988, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at
+>
+> ../block/block-backend.c:1339
+>
+> #6  0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at
+>
+> ../block/block-backend.c:1619
+>
+> #7  0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886)
+>
+> at ../util/coroutine-ucontext.c:175
+>
+> #8  0x00007fc547c2a360 in __start_context () at
+>
+> ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
+>
+> #9  0x00007ffd94b4ea40 in  ()
+>
+> #10 0x0000000000000000 in  ()
+>
+>
+>
+So it looks like main thread is processing job-dismiss request and is
+>
+holding write lock taken in block_job_remove_all_bdrv() (frame #20
+>
+above).  At the same time iothread spawns a coroutine which performs IO
+>
+request.  Before the coroutine is spawned, blk_aio_prwv() increases
+>
+'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
+>
+trying to acquire the read lock.  But main thread isn't releasing the
+>
+lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
+>
+Here's the deadlock.
+>
+>
+Any comments and suggestions on the subject are welcomed.  Thanks!
+I think this is what the blk_wait_while_drained() call was supposed to
+address in blk_co_do_preadv_part(). However, with the use of multiple
+I/O threads, this is racy.
+
+Do you think that in your case we hit the small race window between the
+checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there
+another reason why blk_wait_while_drained() didn't do its job?
+
+Kevin
+
+On 5/2/25 19:34, Kevin Wolf wrote:
+Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben:
+Hi all,
+
+There's a bug in block layer which leads to block graph deadlock.
+Notably, it takes place when blockdev IO is processed within a separate
+iothread.
+
+This was initially caught by our tests, and I was able to reduce it to a
+relatively simple reproducer.  Such deadlocks are probably supposed to
+be covered in iotests/graph-changes-while-io, but this deadlock isn't.
+
+Basically what the reproducer does is launches QEMU with a drive having
+'iothread' option set, creates a chain of 2 snapshots, launches
+block-commit job for a snapshot and then dismisses the job, starting
+from the lower snapshot.  If the guest is issuing IO at the same time,
+there's a race in acquiring block graph lock and a potential deadlock.
+
+Here's how it can be reproduced:
+
+1. Run QEMU:
+SRCDIR=/path/to/srcdir
+$SRCDIR/build/qemu-system-x86_64 -enable-kvm \
+-machine q35 -cpu Nehalem \
+   -name guest=alma8-vm,debug-threads=on \
+   -m 2g -smp 2 \
+   -nographic -nodefaults \
+   -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \
+   -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \
+   -object iothread,id=iothread0 \
+   -blockdev 
+node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2
+ \
+   -device virtio-blk-pci,drive=disk,iothread=iothread0
+2. Launch IO (random reads) from within the guest:
+nc -U /var/run/alma8-serial.sock
+...
+[root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k 
+--size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting 
+--rw=randread --iodepth=1 --filename=/testfile
+3. Run snapshots creation & removal of lower snapshot operation in a
+loop (script attached):
+while /bin/true ; do ./remove_lower_snap.sh ; done
+And then it occasionally hangs.
+
+Note: I've tried bisecting this, and looks like deadlock occurs starting
+from the following commit:
+
+(BAD)  5bdbaebcce virtio: Re-enable notifications after drain
+(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll
+
+On the latest v10.0.0 it does hang as well.
+
+
+Here's backtrace of the main thread:
+#0  0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, timeout=<optimized 
+out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43
+#1  0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, timeout=-1) 
+at ../util/qemu-timer.c:329
+#2  0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, 
+ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79
+#3  0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at 
+../util/aio-posix.c:730
+#4  0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, parent=0x0, 
+poll=true) at ../block/io.c:378
+#5  0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at 
+../block/io.c:391
+#6  0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7682
+#7  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7608
+#8  0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7668
+#9  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7608
+#10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7668
+#11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7608
+#12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../blockjob.c:157
+#13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7592
+#14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7661
+#15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx
+     (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, 
+tran=0x557eb7a87160, errp=0x0) at ../block.c:1234
+#16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7592
+#17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, 
+ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
+errp=0x0)
+     at ../block.c:7661
+#18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, 
+ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715
+#19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at 
+../block.c:3317
+#20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at 
+../blockjob.c:209
+#21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at 
+../blockjob.c:82
+#22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at ../job.c:474
+#23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at 
+../job.c:771
+#24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, 
+errp=0x7ffd94b4f488) at ../job.c:783
+--Type <RET> for more, q to quit, c to continue without paging--
+#25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1", 
+errp=0x7ffd94b4f488) at ../job-qmp.c:138
+#26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, 
+ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221
+#27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at 
+../qapi/qmp-dispatch.c:128
+#28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at ../util/async.c:172
+#29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at 
+../util/async.c:219
+#30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at 
+../util/aio-posix.c:436
+#31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, 
+callback=0x0, user_data=0x0) at ../util/async.c:361
+#32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at 
+../glib/gmain.c:3364
+#33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079
+#34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287
+#35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at 
+../util/main-loop.c:310
+#36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at 
+../util/main-loop.c:589
+#37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835
+#38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at ../system/main.c:50
+#39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at 
+../system/main.c:80
+And here's coroutine trying to acquire read lock:
+(gdb) qemu coroutine reader_queue->entries.sqh_first
+#0  0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, 
+to_=0x7fc537fff508, action=COROUTINE_YIELD) at ../util/coroutine-ucontext.c:321
+#1  0x0000557eb47d4d4a in qemu_coroutine_yield () at 
+../util/qemu-coroutine.c:339
+#2  0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 
+<reader_queue>, lock=0x7fc53c57de50, flags=0) at 
+../util/qemu-coroutine-lock.c:60
+#3  0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231
+#4  0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at 
+/home/root/src/qemu/master/include/block/graph-lock.h:213
+#5  0x0000557eb460fa41 in blk_co_do_preadv_part
+     (blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988, 
+qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339
+#6  0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at 
+../block/block-backend.c:1619
+#7  0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at 
+../util/coroutine-ucontext.c:175
+#8  0x00007fc547c2a360 in __start_context () at 
+../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
+#9  0x00007ffd94b4ea40 in  ()
+#10 0x0000000000000000 in  ()
+So it looks like main thread is processing job-dismiss request and is
+holding write lock taken in block_job_remove_all_bdrv() (frame #20
+above).  At the same time iothread spawns a coroutine which performs IO
+request.  Before the coroutine is spawned, blk_aio_prwv() increases
+'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
+trying to acquire the read lock.  But main thread isn't releasing the
+lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
+Here's the deadlock.
+
+Any comments and suggestions on the subject are welcomed.  Thanks!
+I think this is what the blk_wait_while_drained() call was supposed to
+address in blk_co_do_preadv_part(). However, with the use of multiple
+I/O threads, this is racy.
+
+Do you think that in your case we hit the small race window between the
+checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there
+another reason why blk_wait_while_drained() didn't do its job?
+
+Kevin
+At my opinion there is very big race window. Main thread has
+eaten graph write lock. After that another coroutine is stalled
+within GRAPH_RDLOCK_GUARD() as there is no drain at the moment and only
+after that main thread has started drain. That is why Fiona's idea is
+looking working. Though this would mean that normally we should always
+do that at the moment when we acquire write lock. May be even inside
+this function. Den
+
+Am 02.05.2025 um 19:52 hat Denis V. Lunev geschrieben:
+>
+On 5/2/25 19:34, Kevin Wolf wrote:
+>
+> Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben:
+>
+> > Hi all,
+>
+> >
+>
+> > There's a bug in block layer which leads to block graph deadlock.
+>
+> > Notably, it takes place when blockdev IO is processed within a separate
+>
+> > iothread.
+>
+> >
+>
+> > This was initially caught by our tests, and I was able to reduce it to a
+>
+> > relatively simple reproducer.  Such deadlocks are probably supposed to
+>
+> > be covered in iotests/graph-changes-while-io, but this deadlock isn't.
+>
+> >
+>
+> > Basically what the reproducer does is launches QEMU with a drive having
+>
+> > 'iothread' option set, creates a chain of 2 snapshots, launches
+>
+> > block-commit job for a snapshot and then dismisses the job, starting
+>
+> > from the lower snapshot.  If the guest is issuing IO at the same time,
+>
+> > there's a race in acquiring block graph lock and a potential deadlock.
+>
+> >
+>
+> > Here's how it can be reproduced:
+>
+> >
+>
+> > 1. Run QEMU:
+>
+> > > SRCDIR=/path/to/srcdir
+>
+> > > $SRCDIR/build/qemu-system-x86_64 -enable-kvm \
+>
+> > >    -machine q35 -cpu Nehalem \
+>
+> > >    -name guest=alma8-vm,debug-threads=on \
+>
+> > >    -m 2g -smp 2 \
+>
+> > >    -nographic -nodefaults \
+>
+> > >    -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \
+>
+> > >    -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \
+>
+> > >    -object iothread,id=iothread0 \
+>
+> > >    -blockdev
+>
+> > > node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2
+>
+> > >  \
+>
+> > >    -device virtio-blk-pci,drive=disk,iothread=iothread0
+>
+> > 2. Launch IO (random reads) from within the guest:
+>
+> > > nc -U /var/run/alma8-serial.sock
+>
+> > > ...
+>
+> > > [root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1
+>
+> > > --bs=4k --size=1G --numjobs=1 --time_based=1 --runtime=300
+>
+> > > --group_reporting --rw=randread --iodepth=1 --filename=/testfile
+>
+> > 3. Run snapshots creation & removal of lower snapshot operation in a
+>
+> > loop (script attached):
+>
+> > > while /bin/true ; do ./remove_lower_snap.sh ; done
+>
+> > And then it occasionally hangs.
+>
+> >
+>
+> > Note: I've tried bisecting this, and looks like deadlock occurs starting
+>
+> > from the following commit:
+>
+> >
+>
+> > (BAD)  5bdbaebcce virtio: Re-enable notifications after drain
+>
+> > (GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll
+>
+> >
+>
+> > On the latest v10.0.0 it does hang as well.
+>
+> >
+>
+> >
+>
+> > Here's backtrace of the main thread:
+>
+> >
+>
+> > > #0  0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1,
+>
+> > > timeout=<optimized out>, sigmask=0x0) at
+>
+> > > ../sysdeps/unix/sysv/linux/ppoll.c:43
+>
+> > > #1  0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1,
+>
+> > > timeout=-1) at ../util/qemu-timer.c:329
+>
+> > > #2  0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20,
+>
+> > > ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79
+>
+> > > #3  0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true)
+>
+> > > at ../util/aio-posix.c:730
+>
+> > > #4  0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950,
+>
+> > > parent=0x0, poll=true) at ../block/io.c:378
+>
+> > > #5  0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at
+>
+> > > ../block/io.c:391
+>
+> > > #6  0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950,
+>
+> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7682
+>
+> > > #7  0x0000557eb45ebf2b in bdrv_child_change_aio_context
+>
+> > > (c=0x557eb7964250, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7608
+>
+> > > #8  0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0,
+>
+> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7668
+>
+> > > #9  0x0000557eb45ebf2b in bdrv_child_change_aio_context
+>
+> > > (c=0x557eb7e59110, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7608
+>
+> > > #10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960,
+>
+> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7668
+>
+> > > #11 0x0000557eb45ebf2b in bdrv_child_change_aio_context
+>
+> > > (c=0x557eb814ed80, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7608
+>
+> > > #12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0,
+>
+> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../blockjob.c:157
+>
+> > > #13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context
+>
+> > > (c=0x557eb7c9d3f0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7592
+>
+> > > #14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310,
+>
+> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7661
+>
+> > > #15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx
+>
+> > >      (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60
+>
+> > > = {...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234
+>
+> > > #16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context
+>
+> > > (c=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7592
+>
+> > > #17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0,
+>
+> > > ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...},
+>
+> > > tran=0x557eb7a87160, errp=0x0)
+>
+> > >      at ../block.c:7661
+>
+> > > #18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context
+>
+> > > (bs=0x557eb79575e0, ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at
+>
+> > > ../block.c:7715
+>
+> > > #19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30)
+>
+> > > at ../block.c:3317
+>
+> > > #20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv
+>
+> > > (job=0x557eb7952800) at ../blockjob.c:209
+>
+> > > #21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at
+>
+> > > ../blockjob.c:82
+>
+> > > #22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at
+>
+> > > ../job.c:474
+>
+> > > #23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at
+>
+> > > ../job.c:771
+>
+> > > #24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400,
+>
+> > > errp=0x7ffd94b4f488) at ../job.c:783
+>
+> > > --Type <RET> for more, q to quit, c to continue without paging--
+>
+> > > #25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0
+>
+> > > "commit-snap1", errp=0x7ffd94b4f488) at ../job-qmp.c:138
+>
+> > > #26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0,
+>
+> > > ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221
+>
+> > > #27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at
+>
+> > > ../qapi/qmp-dispatch.c:128
+>
+> > > #28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at
+>
+> > > ../util/async.c:172
+>
+> > > #29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at
+>
+> > > ../util/async.c:219
+>
+> > > #30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at
+>
+> > > ../util/aio-posix.c:436
+>
+> > > #31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200,
+>
+> > > callback=0x0, user_data=0x0) at ../util/async.c:361
+>
+> > > #32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at
+>
+> > > ../glib/gmain.c:3364
+>
+> > > #33 g_main_context_dispatch (context=0x557eb76c6430) at
+>
+> > > ../glib/gmain.c:4079
+>
+> > > #34 0x0000557eb47d3ab1 in glib_pollfds_poll () at
+>
+> > > ../util/main-loop.c:287
+>
+> > > #35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at
+>
+> > > ../util/main-loop.c:310
+>
+> > > #36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at
+>
+> > > ../util/main-loop.c:589
+>
+> > > #37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835
+>
+> > > #38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at
+>
+> > > ../system/main.c:50
+>
+> > > #39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at
+>
+> > > ../system/main.c:80
+>
+> >
+>
+> > And here's coroutine trying to acquire read lock:
+>
+> >
+>
+> > > (gdb) qemu coroutine reader_queue->entries.sqh_first
+>
+> > > #0  0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0,
+>
+> > > to_=0x7fc537fff508, action=COROUTINE_YIELD) at
+>
+> > > ../util/coroutine-ucontext.c:321
+>
+> > > #1  0x0000557eb47d4d4a in qemu_coroutine_yield () at
+>
+> > > ../util/qemu-coroutine.c:339
+>
+> > > #2  0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0
+>
+> > > <reader_queue>, lock=0x7fc53c57de50, flags=0) at
+>
+> > > ../util/qemu-coroutine-lock.c:60
+>
+> > > #3  0x0000557eb461fea7 in bdrv_graph_co_rdlock () at
+>
+> > > ../block/graph-lock.c:231
+>
+> > > #4  0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3)
+>
+> > > at /home/root/src/qemu/master/include/block/graph-lock.h:213
+>
+> > > #5  0x0000557eb460fa41 in blk_co_do_preadv_part
+>
+> > >      (blk=0x557eb84c0810, offset=6890553344, bytes=4096,
+>
+> > > qiov=0x7fc530006988, qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at
+>
+> > > ../block/block-backend.c:1339
+>
+> > > #6  0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at
+>
+> > > ../block/block-backend.c:1619
+>
+> > > #7  0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040,
+>
+> > > i1=21886) at ../util/coroutine-ucontext.c:175
+>
+> > > #8  0x00007fc547c2a360 in __start_context () at
+>
+> > > ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
+>
+> > > #9  0x00007ffd94b4ea40 in  ()
+>
+> > > #10 0x0000000000000000 in  ()
+>
+> >
+>
+> > So it looks like main thread is processing job-dismiss request and is
+>
+> > holding write lock taken in block_job_remove_all_bdrv() (frame #20
+>
+> > above).  At the same time iothread spawns a coroutine which performs IO
+>
+> > request.  Before the coroutine is spawned, blk_aio_prwv() increases
+>
+> > 'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
+>
+> > trying to acquire the read lock.  But main thread isn't releasing the
+>
+> > lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
+>
+> > Here's the deadlock.
+>
+> >
+>
+> > Any comments and suggestions on the subject are welcomed.  Thanks!
+>
+> I think this is what the blk_wait_while_drained() call was supposed to
+>
+> address in blk_co_do_preadv_part(). However, with the use of multiple
+>
+> I/O threads, this is racy.
+>
+>
+>
+> Do you think that in your case we hit the small race window between the
+>
+> checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there
+>
+> another reason why blk_wait_while_drained() didn't do its job?
+>
+>
+>
+At my opinion there is very big race window. Main thread has
+>
+eaten graph write lock. After that another coroutine is stalled
+>
+within GRAPH_RDLOCK_GUARD() as there is no drain at the moment and only
+>
+after that main thread has started drain.
+You're right, I confused taking the write lock with draining there.
+
+>
+That is why Fiona's idea is looking working. Though this would mean
+>
+that normally we should always do that at the moment when we acquire
+>
+write lock. May be even inside this function.
+I actually see now that not all of my graph locking patches were merged.
+At least I did have the thought that bdrv_drained_begin() must be marked
+GRAPH_UNLOCKED because it polls. That means that calling it from inside
+bdrv_try_change_aio_context() is actually forbidden (and that's the part
+I didn't see back then because it doesn't have TSA annotations).
+
+If you refactor the code to move the drain out to before the lock is
+taken, I think you end up with Fiona's patch, except you'll remove the
+forbidden inner drain and add more annotations for some functions and
+clarify the rules around them. I don't know, but I wouldn't be surprised
+if along the process we find other bugs, too.
+
+So Fiona's drain looks right to me, but we should probably approach it
+more systematically.
+
+Kevin
+
diff --git a/results/classifier/zero-shot/012/debug/53568181 b/results/classifier/zero-shot/012/debug/53568181
new file mode 100644
index 00000000..c79e6f59
--- /dev/null
+++ b/results/classifier/zero-shot/012/debug/53568181
@@ -0,0 +1,96 @@
+debug: 0.968
+permissions: 0.965
+performance: 0.948
+architecture: 0.945
+semantic: 0.943
+graphic: 0.940
+PID: 0.938
+assembly: 0.936
+device: 0.936
+vnc: 0.935
+x86: 0.932
+register: 0.930
+network: 0.925
+other: 0.921
+TCG: 0.921
+arm: 0.918
+risc-v: 0.907
+files: 0.890
+kernel virtual machine: 0.877
+boot: 0.876
+socket: 0.875
+mistranslation: 0.854
+
+[BUG] x86/PAT handling severely crippled AMD-V SVM KVM performance
+
+Hi, I maintain an out-of-tree 3D APIs pass-through QEMU device models at
+https://github.com/kjliew/qemu-3dfx
+that provide 3D acceleration for legacy
+32-bit Windows guests (Win98SE, WinME, Win2k and WinXP) with the focus on
+playing old legacy games from 1996-2003. It currently supports the now-defunct
+3Dfx propriety API called Glide and an alternative OpenGL pass-through based on
+MESA implementation.
+
+The basic concept of both implementations create memory-mapped virtual
+interfaces consist of host/guest shared memory with guest-push model instead of
+a more common host-pull model for typical QEMU device model implementation.
+Guest uses shared memory as FIFOs for drawing commands and data to bulk up the
+operations until serialization event that flushes the FIFOs into host. This
+achieves extremely good performance since virtual CPUs are fast with hardware
+acceleration (Intel VT/AMD-V) and reduces the overhead of frequent VMEXITs to
+service the device emulation. Both implementations work on Windows 10 with WHPX
+and HAXM accelerators as well as KVM in Linux.
+
+On Windows 10, QEMU WHPX implementation does not sync MSR_IA32_PAT during
+host/guest states sync. There is no visibility into the closed-source WHPX on
+how things are managed behind the scene, but from measuring performance figures
+I can conclude that it didn't handle the MSR_IA32_PAT correctly for both Intel
+and AMD. Call this fair enough, if you will, it didn't flag any concerns, in
+fact games such as Quake2 and Quake3 were still within playable frame rate of
+40~60FPS on Win2k/XP guest. Until the same games were run on Win98/ME guest and
+the frame rate blew off the roof (300~500FPS) on the same CPU and GPU. In fact,
+the later seemed to be more inlined with runnng the games bare-metal with vsync
+off.
+
+On Linux (at the time of writing kernel 5.6.7/Mesa 20.0), the difference
+prevailed. Intel CPUs (and it so happened that I was on laptop with Intel GPU),
+the VMX-based kvm_intel got it right while SVM-based kvm_amd did not.
+To put this in simple exaggeration, an aging Core i3-4010U/HD Graphics 4400
+(Haswell GT2) exhibited an insane performance in Quake2/Quake3 timedemos that
+totally crushed more recent AMD Ryzen 2500U APU/Vega 8 Graphics and AMD
+FX8300/NVIDIA GT730 on desktop. Simply unbelievable!
+
+It turned out that there was something to do with AMD-V NPT. By loading kvm_amd
+with npt=0, AMD Ryzen APU and FX8300 regained a huge performance leap. However,
+AMD NPT issue with KVM was supposedly fixed in 2017 kernel commits. NPT=0 would
+actually incur performance loss for VM due to intervention required by
+hypervisors to maintain the shadow page tables.  Finally, I was able to find the
+pointer that pointed to MSR_IA32_PAT register. By updating the MSR_IA32_PAT to
+0x0606xxxx0606xxxxULL, AMD CPUs now regain their rightful performance without
+taking the hit of NPT=0 for Linux KVM. Taking the same solution into Windows,
+both Intel and AMD CPUs no longer require Win98/ME guest to unleash the full
+performance potentials and performance figures based on games measured on WHPX
+were not very far behind Linux KVM.
+
+So I guess the problem lies in host/guest shared memory regions mapped as
+uncacheable from virtual CPU perspective. As virtual CPUs now completely execute
+in hardware context with x86 hardware virtualiztion extensions, the cacheability
+of memory types would severely impact the performance on guests. WHPX didn't
+handle it for both Intel EPT and AMD NPT, but KVM seems to do it right for Intel
+EPT. I don't have the correct fix for QEMU. But what I can do for my 3D APIs
+pass-through device models is to implement host-side hooks to reprogram and
+restore MSR_IA32_PAT upon activation/deactivation of the 3D APIs. Perhaps there
+is also a better solution of having the proper kernel drivers for virtual
+interfaces to manage the memory types of host/guest shared memory in kernel
+space, but to do that and the needs of Microsoft tools/DDKs, I will just forget
+it. The guest stubs uses the same kernel drivers included in 3Dfx drivers for
+memory mapping and the virtual interfaces remain driver-less from Windows OS
+perspective. Considering the current state of halting progress for QEMU native
+virgil3D to support Windows OS, I am just being pragmatic. I understand that
+QEMU virgil3D will eventually bring 3D acceleration for Windows guests, but I do
+not expect anything to support legacy 32-bit Windows OSes which have out-grown
+their commercial usefulness.
+
+Regards,
+KJ Liew
+
diff --git a/results/classifier/zero-shot/012/debug/64571620 b/results/classifier/zero-shot/012/debug/64571620
new file mode 100644
index 00000000..534d8903
--- /dev/null
+++ b/results/classifier/zero-shot/012/debug/64571620
@@ -0,0 +1,803 @@
+debug: 0.927
+other: 0.922
+mistranslation: 0.917
+assembly: 0.913
+arm: 0.910
+risc-v: 0.909
+semantic: 0.903
+permissions: 0.902
+device: 0.899
+performance: 0.897
+graphic: 0.897
+architecture: 0.895
+PID: 0.887
+boot: 0.879
+register: 0.871
+files: 0.855
+socket: 0.855
+network: 0.853
+kernel virtual machine: 0.852
+TCG: 0.843
+vnc: 0.819
+x86: 0.794
+
+[BUG] Migration hv_time rollback
+
+Hi,
+
+We are experiencing timestamp rollbacks during live-migration of
+Windows 10 guests with the following qemu configuration (linux 5.4.46
+and qemu master):
+```
+$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+```
+
+I have tracked the bug to the fact that `kvmclock` is not exposed and
+disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+
+I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+present and add Hyper-V support for the `kvmclock_current_nsec`
+function.
+
+I'm asking for advice because I am unsure this is the _right_ approach
+and how to keep migration compatibility between qemu versions.
+
+Thank you all,
+
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+cc'ing in Vitaly who knows about the hv stuff.
+
+* Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+Hi,
+>
+>
+We are experiencing timestamp rollbacks during live-migration of
+>
+Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+and qemu master):
+>
+```
+>
+$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+```
+How big a jump are you seeing, and how did you notice it in the guest?
+
+Dave
+
+>
+I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+function.
+>
+>
+I'm asking for advice because I am unsure this is the _right_ approach
+>
+and how to keep migration compatibility between qemu versions.
+>
+>
+Thank you all,
+>
+>
+--
+>
+Antoine 'xdbob' Damhet
+-- 
+Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
+
+"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
+
+>
+cc'ing in Vitaly who knows about the hv stuff.
+>
+cc'ing Marcelo who knows about clocksources :-)
+
+>
+* Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+> Hi,
+>
+>
+>
+> We are experiencing timestamp rollbacks during live-migration of
+>
+> Windows 10 guests
+Are you migrating to the same hardware (with the same TSC frequency)? Is
+TSC used as the clocksource on the host?
+
+>
+>  with the following qemu configuration (linux 5.4.46
+>
+> and qemu master):
+>
+> ```
+>
+> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+> ```
+Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is
+not going to check for KVM identification anyway so we pretend we're
+Hyper-V. 
+
+Also, have you tried adding more Hyper-V enlightenments? 
+
+>
+>
+How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+Dave
+>
+>
+> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+>
+> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+> function.
+AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by
+the guest:
+
+   if (!(env->system_time_msr & 1ULL)) {
+        /* KVM clock not active */
+        return 0;
+    }
+
+and this is (and way) always false for Windows guests.
+
+>
+>
+>
+> I'm asking for advice because I am unsure this is the _right_ approach
+>
+> and how to keep migration compatibility between qemu versions.
+>
+>
+>
+> Thank you all,
+>
+>
+>
+> --
+>
+> Antoine 'xdbob' Damhet
+-- 
+Vitaly
+
+On Wed, Sep 16, 2020 at 01:59:43PM +0200, Vitaly Kuznetsov wrote:
+>
+"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
+>
+>
+> cc'ing in Vitaly who knows about the hv stuff.
+>
+>
+>
+>
+cc'ing Marcelo who knows about clocksources :-)
+>
+>
+> * Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+>> Hi,
+>
+>>
+>
+>> We are experiencing timestamp rollbacks during live-migration of
+>
+>> Windows 10 guests
+>
+>
+Are you migrating to the same hardware (with the same TSC frequency)? Is
+>
+TSC used as the clocksource on the host?
+Yes we are migrating to the exact same hardware. And yes TSC is used as
+a clocksource in the host (but the bug is still happening with `hpet` as
+a clocksource).
+
+>
+>
+>>  with the following qemu configuration (linux 5.4.46
+>
+>> and qemu master):
+>
+>> ```
+>
+>> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+>> ```
+>
+>
+Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is
+>
+not going to check for KVM identification anyway so we pretend we're
+>
+Hyper-V.
+Some softwares explicitly checks for the presence of KVM and then crash
+if they find it in CPUID :/
+
+>
+>
+Also, have you tried adding more Hyper-V enlightenments?
+Yes, I published a stripped-down command-line for a minimal reproducer
+but even `hv-frequencies` and `hv-reenlightenment` don't help.
+
+>
+>
+>
+>
+> How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+>
+> Dave
+>
+>
+>
+>> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+>> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>>
+>
+>> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+>> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+>> function.
+>
+>
+AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by
+>
+the guest:
+>
+>
+if (!(env->system_time_msr & 1ULL)) {
+>
+/* KVM clock not active */
+>
+return 0;
+>
+}
+>
+>
+and this is (and way) always false for Windows guests.
+Hooo, I missed this piece. When is `clock_is_reliable` expected to be
+false ? Because if it is I still think we should be able to query at
+least `HV_X64_MSR_REFERENCE_TSC`
+
+>
+>
+>>
+>
+>> I'm asking for advice because I am unsure this is the _right_ approach
+>
+>> and how to keep migration compatibility between qemu versions.
+>
+>>
+>
+>> Thank you all,
+>
+>>
+>
+>> --
+>
+>> Antoine 'xdbob' Damhet
+>
+>
+--
+>
+Vitaly
+>
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote:
+>
+cc'ing in Vitaly who knows about the hv stuff.
+Thanks
+
+>
+>
+* Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+> Hi,
+>
+>
+>
+> We are experiencing timestamp rollbacks during live-migration of
+>
+> Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+> and qemu master):
+>
+> ```
+>
+> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+> ```
+>
+>
+How big a jump are you seeing, and how did you notice it in the guest?
+I'm seeing jumps of about the guest uptime (indicating a reset of the
+counter). It's expected because we won't call `KVM_SET_CLOCK` to
+restore any value.
+
+We first noticed it because after some migrations `dwm.exe` crashes with
+the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in
+the past." error code.
+
+I can also confirm the following hack makes the behavior disappear:
+
+```
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+index 64283358f9..f334bdf35f 100644
+--- a/hw/i386/kvm/clock.c
++++ b/hw/i386/kvm/clock.c
+@@ -332,11 +332,7 @@ void kvmclock_create(void)
+ {
+     X86CPU *cpu = X86_CPU(first_cpu);
+
+-    if (kvm_enabled() &&
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+-        sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+-    }
++    sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+ }
+
+ static void kvmclock_register_types(void)
+diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
+index 32b1453e6a..11d980ba85 100644
+--- a/hw/i386/pc_piix.c
++++ b/hw/i386/pc_piix.c
+@@ -158,9 +158,7 @@ static void pc_init1(MachineState *machine,
+
+     x86_cpus_init(x86ms, pcmc->default_cpu_version);
+
+-    if (kvm_enabled() && pcmc->kvmclock_enabled) {
+-        kvmclock_create();
+-    }
++    kvmclock_create();
+
+     if (pcmc->pci_enabled) {
+         pci_memory = g_new(MemoryRegion, 1);
+```
+
+>
+>
+Dave
+>
+>
+> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+>
+> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+> function.
+>
+>
+>
+> I'm asking for advice because I am unsure this is the _right_ approach
+>
+> and how to keep migration compatibility between qemu versions.
+>
+>
+>
+> Thank you all,
+>
+>
+>
+> --
+>
+> Antoine 'xdbob' Damhet
+>
+>
+>
+--
+>
+Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
+>
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+Antoine Damhet <antoine.damhet@blade-group.com> writes:
+
+>
+On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote:
+>
+> cc'ing in Vitaly who knows about the hv stuff.
+>
+>
+Thanks
+>
+>
+>
+>
+> * Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+> > Hi,
+>
+> >
+>
+> > We are experiencing timestamp rollbacks during live-migration of
+>
+> > Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+> > and qemu master):
+>
+> > ```
+>
+> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+> > ```
+>
+>
+>
+> How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+I'm seeing jumps of about the guest uptime (indicating a reset of the
+>
+counter). It's expected because we won't call `KVM_SET_CLOCK` to
+>
+restore any value.
+>
+>
+We first noticed it because after some migrations `dwm.exe` crashes with
+>
+the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in
+>
+the past." error code.
+>
+>
+I can also confirm the following hack makes the behavior disappear:
+>
+>
+```
+>
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+>
+index 64283358f9..f334bdf35f 100644
+>
+--- a/hw/i386/kvm/clock.c
+>
++++ b/hw/i386/kvm/clock.c
+>
+@@ -332,11 +332,7 @@ void kvmclock_create(void)
+>
+{
+>
+X86CPU *cpu = X86_CPU(first_cpu);
+>
+>
+-    if (kvm_enabled() &&
+>
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+>
+-        sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+-    }
+>
++    sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+}
+>
+Oh, I think I see what's going on. When you add 'kvm=off'
+cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so
+kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on
+migration.
+
+In case we really want to support 'kvm=off' I think we can add Hyper-V
+features check here along with KVM, this should do the job.
+
+-- 
+Vitaly
+
+Vitaly Kuznetsov <vkuznets@redhat.com> writes:
+
+>
+Antoine Damhet <antoine.damhet@blade-group.com> writes:
+>
+>
+> On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote:
+>
+>> cc'ing in Vitaly who knows about the hv stuff.
+>
+>
+>
+> Thanks
+>
+>
+>
+>>
+>
+>> * Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+>> > Hi,
+>
+>> >
+>
+>> > We are experiencing timestamp rollbacks during live-migration of
+>
+>> > Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+>> > and qemu master):
+>
+>> > ```
+>
+>> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+>> > ```
+>
+>>
+>
+>> How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+>
+> I'm seeing jumps of about the guest uptime (indicating a reset of the
+>
+> counter). It's expected because we won't call `KVM_SET_CLOCK` to
+>
+> restore any value.
+>
+>
+>
+> We first noticed it because after some migrations `dwm.exe` crashes with
+>
+> the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in
+>
+> the past." error code.
+>
+>
+>
+> I can also confirm the following hack makes the behavior disappear:
+>
+>
+>
+> ```
+>
+> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+>
+> index 64283358f9..f334bdf35f 100644
+>
+> --- a/hw/i386/kvm/clock.c
+>
+> +++ b/hw/i386/kvm/clock.c
+>
+> @@ -332,11 +332,7 @@ void kvmclock_create(void)
+>
+>  {
+>
+>      X86CPU *cpu = X86_CPU(first_cpu);
+>
+>
+>
+> -    if (kvm_enabled() &&
+>
+> -        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
+> -                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2)))
+>
+> {
+>
+> -        sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+> -    }
+>
+> +    sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+>  }
+>
+>
+>
+>
+>
+Oh, I think I see what's going on. When you add 'kvm=off'
+>
+cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so
+>
+kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on
+>
+migration.
+>
+>
+In case we really want to support 'kvm=off' I think we can add Hyper-V
+>
+features check here along with KVM, this should do the job.
+Does the untested
+
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+index 64283358f91d..e03b2ca6d8f6 100644
+--- a/hw/i386/kvm/clock.c
++++ b/hw/i386/kvm/clock.c
+@@ -333,8 +333,9 @@ void kvmclock_create(void)
+     X86CPU *cpu = X86_CPU(first_cpu);
+ 
+     if (kvm_enabled() &&
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
++        ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
++                                         (1ULL << KVM_FEATURE_CLOCKSOURCE2))) 
+||
++         (cpu->env.features[FEAT_HYPERV_EAX] & HV_TIME_REF_COUNT_AVAILABLE))) {
+         sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+     }
+ }
+
+help?
+
+(I don't think we need to remove all 'if (kvm_enabled())' checks from
+machine types as 'kvm=off' should not be related).
+
+-- 
+Vitaly
+
+On Wed, Sep 16, 2020 at 02:50:56PM +0200, Vitaly Kuznetsov wrote:
+[...]
+
+>
+>>
+>
+>
+>
+>
+>
+> Oh, I think I see what's going on. When you add 'kvm=off'
+>
+> cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so
+>
+> kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on
+>
+> migration.
+>
+>
+>
+> In case we really want to support 'kvm=off' I think we can add Hyper-V
+>
+> features check here along with KVM, this should do the job.
+>
+>
+Does the untested
+>
+>
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+>
+index 64283358f91d..e03b2ca6d8f6 100644
+>
+--- a/hw/i386/kvm/clock.c
+>
++++ b/hw/i386/kvm/clock.c
+>
+@@ -333,8 +333,9 @@ void kvmclock_create(void)
+>
+X86CPU *cpu = X86_CPU(first_cpu);
+>
+>
+if (kvm_enabled() &&
+>
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+>
++        ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
++                                         (1ULL <<
+>
+KVM_FEATURE_CLOCKSOURCE2))) ||
+>
++         (cpu->env.features[FEAT_HYPERV_EAX] &
+>
+HV_TIME_REF_COUNT_AVAILABLE))) {
+>
+sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+}
+>
+}
+>
+>
+help?
+It appears to work :)
+
+>
+>
+(I don't think we need to remove all 'if (kvm_enabled())' checks from
+>
+machine types as 'kvm=off' should not be related).
+Indeed (I didn't look at the macro, it was just quick & dirty).
+
+>
+>
+--
+>
+Vitaly
+>
+>
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+On 16/09/20 13:29, Dr. David Alan Gilbert wrote:
+>
+> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+>
+> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+> function.
+Yes, this seems correct.  I would have to check but it may even be
+better to _always_ send kvmclock data in the live migration stream.
+
+Paolo
+
+Paolo Bonzini <pbonzini@redhat.com> writes:
+
+>
+On 16/09/20 13:29, Dr. David Alan Gilbert wrote:
+>
+>> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+>> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>>
+>
+>> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+>> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+>> function.
+>
+>
+Yes, this seems correct.  I would have to check but it may even be
+>
+better to _always_ send kvmclock data in the live migration stream.
+>
+The question I have is: with 'kvm=off', do we actually restore TSC
+reading on migration? (and I guess the answer is 'no' or Hyper-V TSC
+page would 'just work' I guess). So yea, maybe dropping the
+'cpu->env.features[FEAT_KVM]' check is the right fix.
+
+-- 
+Vitaly
+