summary refs log tree commit diff stats
path: root/classification_output/04/mistranslation
diff options
context:
space:
mode:
authorChristian Krinitsin <mail@krinitsin.com>2025-06-01 21:35:14 +0200
committerChristian Krinitsin <mail@krinitsin.com>2025-06-01 21:35:14 +0200
commit3e4c5a6261770bced301b5e74233e7866166ea5b (patch)
tree9379fddaba693ef8a045da06efee8529baa5f6f4 /classification_output/04/mistranslation
parente5634e2806195bee44407853c4bf8776f7abfa4f (diff)
downloadqemu-analysis-3e4c5a6261770bced301b5e74233e7866166ea5b.tar.gz
qemu-analysis-3e4c5a6261770bced301b5e74233e7866166ea5b.zip
clean up repository
Diffstat (limited to 'classification_output/04/mistranslation')
-rw-r--r--classification_output/04/mistranslation/14887122266
-rw-r--r--classification_output/04/mistranslation/23270873700
-rw-r--r--classification_output/04/mistranslation/25842545210
-rw-r--r--classification_output/04/mistranslation/365680444589
-rw-r--r--classification_output/04/mistranslation/6432299562
-rw-r--r--classification_output/04/mistranslation/702942551069
-rw-r--r--classification_output/04/mistranslation/744669631886
-rw-r--r--classification_output/04/mistranslation/74545755352
-rw-r--r--classification_output/04/mistranslation/806043141488
9 files changed, 0 insertions, 10622 deletions
diff --git a/classification_output/04/mistranslation/14887122 b/classification_output/04/mistranslation/14887122
deleted file mode 100644
index 1a87937b0..000000000
--- a/classification_output/04/mistranslation/14887122
+++ /dev/null
@@ -1,266 +0,0 @@
-mistranslation: 0.930
-semantic: 0.928
-device: 0.919
-assembly: 0.918
-socket: 0.914
-graphic: 0.910
-instruction: 0.905
-other: 0.890
-vnc: 0.871
-network: 0.855
-boot: 0.831
-KVM: 0.814
-
-[BUG][RFC] CPR transfer Issues: Socket permissions and PID files
-
-Hello,
-
-While testing CPR transfer I encountered two issues. The first is that the 
-transfer fails when running with pidfiles due to the destination qemu process 
-attempting to create the pidfile while it is still locked by the source 
-process. The second is that the transfer fails when running with the -run-with 
-user=$USERID parameter. This is because the destination qemu process creates 
-the UNIX sockets used for the CPR transfer before dropping to the lower 
-permissioned user, which causes them to be owned by the original user. The 
-source qemu process then does not have permission to connect to it because it 
-is already running as the lesser permissioned user.
-
-Reproducing the first issue:
-
-Create a source and destination qemu instance associated with the same VM where 
-both processes have the -pidfile parameter passed on the command line. You 
-should see the following error on the command line of the second process:
-
-qemu-system-x86_64: cannot create PID file: Cannot lock pid file: Resource 
-temporarily unavailable
-
-Reproducing the second issue:
-
-Create a source and destination qemu instance associated with the same VM where 
-both processes have -run-with user=$USERID passed on the command line, where 
-$USERID is a different user from the one launching the processes. Then attempt 
-a CPR transfer using UNIX sockets for the main and cpr sockets. You should 
-receive the following error via QMP:
-{"error": {"class": "GenericError", "desc": "Failed to connect to 'cpr.sock': 
-Permission denied"}}
-
-I provided a minimal patch that works around the second issue.
-
-Thank you,
-Ben Chaney
-
----
-include/system/os-posix.h | 4 ++++
-os-posix.c | 8 --------
-util/qemu-sockets.c | 21 +++++++++++++++++++++
-3 files changed, 25 insertions(+), 8 deletions(-)
-
-diff --git a/include/system/os-posix.h b/include/system/os-posix.h
-index ce5b3bccf8..2a414a914a 100644
---- a/include/system/os-posix.h
-+++ b/include/system/os-posix.h
-@@ -55,6 +55,10 @@ void os_setup_limits(void);
-void os_setup_post(void);
-int os_mlock(bool on_fault);
-
-+extern struct passwd *user_pwd;
-+extern uid_t user_uid;
-+extern gid_t user_gid;
-+
-/**
-* qemu_alloc_stack:
-* @sz: pointer to a size_t holding the requested usable stack size
-diff --git a/os-posix.c b/os-posix.c
-index 52925c23d3..9369b312a0 100644
---- a/os-posix.c
-+++ b/os-posix.c
-@@ -86,14 +86,6 @@ void os_set_proc_name(const char *s)
-}
-
-
--/*
-- * Must set all three of these at once.
-- * Legal combinations are unset by name by uid
-- */
--static struct passwd *user_pwd; /* NULL non-NULL NULL */
--static uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
--static gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
--
-/*
-* Prepare to change user ID. user_id can be one of 3 forms:
-* - a username, in which case user ID will be changed to its uid,
-diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
-index 77477c1cd5..987977ead9 100644
---- a/util/qemu-sockets.c
-+++ b/util/qemu-sockets.c
-@@ -871,6 +871,14 @@ static bool saddr_is_tight(UnixSocketAddress *saddr)
-#endif
-}
-
-+/*
-+ * Must set all three of these at once.
-+ * Legal combinations are unset by name by uid
-+ */
-+struct passwd *user_pwd; /* NULL non-NULL NULL */
-+uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
-+gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
-+
-static int unix_listen_saddr(UnixSocketAddress *saddr,
-int num,
-Error **errp)
-@@ -947,6 +955,19 @@ static int unix_listen_saddr(UnixSocketAddress *saddr,
-error_setg_errno(errp, errno, "Failed to bind socket to %s", path);
-goto err;
-}
-+ if (user_pwd) {
-+ if (chown(un.sun_path, user_pwd->pw_uid, user_pwd->pw_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", 
-path);
-+ goto err;
-+ }
-+ }
-+ else if (user_uid != -1 && user_gid != -1) {
-+ if (chown(un.sun_path, user_uid, user_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", 
-path);
-+ goto err;
-+ }
-+ }
-+
-if (listen(sock, num) < 0) {
-error_setg_errno(errp, errno, "Failed to listen on socket");
-goto err;
---
-2.40.1
-
-Thank you Ben.  I appreciate you testing CPR and shaking out the bugs.
-I will study these and propose patches.
-
-My initial reaction to the pidfile issue is that the orchestration layer must
-pass a different filename when starting the destination qemu instance.  When
-using live update without containers, these types of resource conflicts in the
-global namespaces are a known issue.
-
-- Steve
-
-On 3/14/2025 2:33 PM, Chaney, Ben wrote:
-Hello,
-
-While testing CPR transfer I encountered two issues. The first is that the 
-transfer fails when running with pidfiles due to the destination qemu process 
-attempting to create the pidfile while it is still locked by the source 
-process. The second is that the transfer fails when running with the -run-with 
-user=$USERID parameter. This is because the destination qemu process creates 
-the UNIX sockets used for the CPR transfer before dropping to the lower 
-permissioned user, which causes them to be owned by the original user. The 
-source qemu process then does not have permission to connect to it because it 
-is already running as the lesser permissioned user.
-
-Reproducing the first issue:
-
-Create a source and destination qemu instance associated with the same VM where 
-both processes have the -pidfile parameter passed on the command line. You 
-should see the following error on the command line of the second process:
-
-qemu-system-x86_64: cannot create PID file: Cannot lock pid file: Resource 
-temporarily unavailable
-
-Reproducing the second issue:
-
-Create a source and destination qemu instance associated with the same VM where 
-both processes have -run-with user=$USERID passed on the command line, where 
-$USERID is a different user from the one launching the processes. Then attempt 
-a CPR transfer using UNIX sockets for the main and cpr sockets. You should 
-receive the following error via QMP:
-{"error": {"class": "GenericError", "desc": "Failed to connect to 'cpr.sock': 
-Permission denied"}}
-
-I provided a minimal patch that works around the second issue.
-
-Thank you,
-Ben Chaney
-
----
-include/system/os-posix.h | 4 ++++
-os-posix.c | 8 --------
-util/qemu-sockets.c | 21 +++++++++++++++++++++
-3 files changed, 25 insertions(+), 8 deletions(-)
-
-diff --git a/include/system/os-posix.h b/include/system/os-posix.h
-index ce5b3bccf8..2a414a914a 100644
---- a/include/system/os-posix.h
-+++ b/include/system/os-posix.h
-@@ -55,6 +55,10 @@ void os_setup_limits(void);
-void os_setup_post(void);
-int os_mlock(bool on_fault);
-
-+extern struct passwd *user_pwd;
-+extern uid_t user_uid;
-+extern gid_t user_gid;
-+
-/**
-* qemu_alloc_stack:
-* @sz: pointer to a size_t holding the requested usable stack size
-diff --git a/os-posix.c b/os-posix.c
-index 52925c23d3..9369b312a0 100644
---- a/os-posix.c
-+++ b/os-posix.c
-@@ -86,14 +86,6 @@ void os_set_proc_name(const char *s)
-}
-
-
--/*
-- * Must set all three of these at once.
-- * Legal combinations are unset by name by uid
-- */
--static struct passwd *user_pwd; /* NULL non-NULL NULL */
--static uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
--static gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
--
-/*
-* Prepare to change user ID. user_id can be one of 3 forms:
-* - a username, in which case user ID will be changed to its uid,
-diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
-index 77477c1cd5..987977ead9 100644
---- a/util/qemu-sockets.c
-+++ b/util/qemu-sockets.c
-@@ -871,6 +871,14 @@ static bool saddr_is_tight(UnixSocketAddress *saddr)
-#endif
-}
-
-+/*
-+ * Must set all three of these at once.
-+ * Legal combinations are unset by name by uid
-+ */
-+struct passwd *user_pwd; /* NULL non-NULL NULL */
-+uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
-+gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
-+
-static int unix_listen_saddr(UnixSocketAddress *saddr,
-int num,
-Error **errp)
-@@ -947,6 +955,19 @@ static int unix_listen_saddr(UnixSocketAddress *saddr,
-error_setg_errno(errp, errno, "Failed to bind socket to %s", path);
-goto err;
-}
-+ if (user_pwd) {
-+ if (chown(un.sun_path, user_pwd->pw_uid, user_pwd->pw_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", 
-path);
-+ goto err;
-+ }
-+ }
-+ else if (user_uid != -1 && user_gid != -1) {
-+ if (chown(un.sun_path, user_uid, user_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", 
-path);
-+ goto err;
-+ }
-+ }
-+
-if (listen(sock, num) < 0) {
-error_setg_errno(errp, errno, "Failed to listen on socket");
-goto err;
---
-2.40.1
-
diff --git a/classification_output/04/mistranslation/23270873 b/classification_output/04/mistranslation/23270873
deleted file mode 100644
index 4d8b927f0..000000000
--- a/classification_output/04/mistranslation/23270873
+++ /dev/null
@@ -1,700 +0,0 @@
-mistranslation: 0.881
-other: 0.839
-boot: 0.830
-vnc: 0.820
-device: 0.810
-KVM: 0.803
-assembly: 0.768
-network: 0.768
-graphic: 0.763
-socket: 0.758
-instruction: 0.755
-semantic: 0.752
-
-[Qemu-devel] [BUG?] aio_get_linux_aio: Assertion `ctx->linux_aio' failed
-
-Hi,
-
-I am seeing some strange QEMU assertion failures for qemu on s390x,
-which prevents a guest from starting.
-
-Git bisecting points to the following commit as the source of the error.
-
-commit ed6e2161715c527330f936d44af4c547f25f687e
-Author: Nishanth Aravamudan <address@hidden>
-Date:   Fri Jun 22 12:37:00 2018 -0700
-
-    linux-aio: properly bubble up errors from initialization
-
-    laio_init() can fail for a couple of reasons, which will lead to a NULL
-    pointer dereference in laio_attach_aio_context().
-
-    To solve this, add a aio_setup_linux_aio() function which is called
-    early in raw_open_common. If this fails, propagate the error up. The
-    signature of aio_get_linux_aio() was not modified, because it seems
-    preferable to return the actual errno from the possible failing
-    initialization calls.
-
-    Additionally, when the AioContext changes, we need to associate a
-    LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
-    callback and call the new aio_setup_linux_aio(), which will allocate a
-new AioContext if needed, and return errors on failures. If it
-fails for
-any reason, fallback to threaded AIO with an error message, as the
-    device is already in-use by the guest.
-
-    Add an assert that aio_get_linux_aio() cannot return NULL.
-
-    Signed-off-by: Nishanth Aravamudan <address@hidden>
-    Message-id: address@hidden
-    Signed-off-by: Stefan Hajnoczi <address@hidden>
-Not sure what is causing this assertion to fail. Here is the qemu
-command line of the guest, from qemu log, which throws this error:
-LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
-QEMU_AUDIO_DRV=none /usr/local/bin/qemu-system-s390x -name
-guest=rt_vm1,debug-threads=on -S -object
-secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-21-rt_vm1/master-key.aes
--machine s390-ccw-virtio-2.12,accel=kvm,usb=off,dump-guest-core=off -m
-1024 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object
-iothread,id=iothread1 -uuid 0cde16cd-091d-41bd-9ac2-5243df5c9a0d
--display none -no-user-config -nodefaults -chardev
-socket,id=charmonitor,fd=28,server,nowait -mon
-chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
--boot strict=on -drive
-file=/dev/mapper/360050763998b0883980000002a000031,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native
--device
-virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0001,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on
--netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device
-virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:3a:c8:67:95:84,devno=fe.0.0000
--netdev tap,fd=32,id=hostnet1,vhost=on,vhostfd=33 -device
-virtio-net-ccw,netdev=hostnet1,id=net1,mac=52:54:00:2a:e5:08,devno=fe.0.0002
--chardev pty,id=charconsole0 -device
-sclpconsole,chardev=charconsole0,id=console0 -device
-virtio-balloon-ccw,id=balloon0,devno=fe.3.ffba -sandbox
-on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
--msg timestamp=on
-2018-07-17 15:48:42.252+0000: Domain id=21 is tainted: high-privileges
-2018-07-17T15:48:42.279380Z qemu-system-s390x: -chardev
-pty,id=charconsole0: char device redirected to /dev/pts/3 (label
-charconsole0)
-qemu-system-s390x: util/async.c:339: aio_get_linux_aio: Assertion
-`ctx->linux_aio' failed.
-2018-07-17 15:48:43.309+0000: shutting down, reason=failed
-
-
-Any help debugging this would be greatly appreciated.
-
-Thank you
-Farhan
-
-On 17.07.2018 [13:25:53 -0400], Farhan Ali wrote:
->
-Hi,
->
->
-I am seeing some strange QEMU assertion failures for qemu on s390x,
->
-which prevents a guest from starting.
->
->
-Git bisecting points to the following commit as the source of the error.
->
->
-commit ed6e2161715c527330f936d44af4c547f25f687e
->
-Author: Nishanth Aravamudan <address@hidden>
->
-Date:   Fri Jun 22 12:37:00 2018 -0700
->
->
-linux-aio: properly bubble up errors from initialization
->
->
-laio_init() can fail for a couple of reasons, which will lead to a NULL
->
-pointer dereference in laio_attach_aio_context().
->
->
-To solve this, add a aio_setup_linux_aio() function which is called
->
-early in raw_open_common. If this fails, propagate the error up. The
->
-signature of aio_get_linux_aio() was not modified, because it seems
->
-preferable to return the actual errno from the possible failing
->
-initialization calls.
->
->
-Additionally, when the AioContext changes, we need to associate a
->
-LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
->
-callback and call the new aio_setup_linux_aio(), which will allocate a
->
-new AioContext if needed, and return errors on failures. If it fails for
->
-any reason, fallback to threaded AIO with an error message, as the
->
-device is already in-use by the guest.
->
->
-Add an assert that aio_get_linux_aio() cannot return NULL.
->
->
-Signed-off-by: Nishanth Aravamudan <address@hidden>
->
-Message-id: address@hidden
->
-Signed-off-by: Stefan Hajnoczi <address@hidden>
->
->
->
-Not sure what is causing this assertion to fail. Here is the qemu command
->
-line of the guest, from qemu log, which throws this error:
->
->
->
-LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
->
-QEMU_AUDIO_DRV=none /usr/local/bin/qemu-system-s390x -name
->
-guest=rt_vm1,debug-threads=on -S -object
->
-secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-21-rt_vm1/master-key.aes
->
--machine s390-ccw-virtio-2.12,accel=kvm,usb=off,dump-guest-core=off -m 1024
->
--realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object
->
-iothread,id=iothread1 -uuid 0cde16cd-091d-41bd-9ac2-5243df5c9a0d -display
->
-none -no-user-config -nodefaults -chardev
->
-socket,id=charmonitor,fd=28,server,nowait -mon
->
-chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot
->
-strict=on -drive
->
-file=/dev/mapper/360050763998b0883980000002a000031,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native
->
--device
->
-virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0001,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on
->
--netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device
->
-virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:3a:c8:67:95:84,devno=fe.0.0000
->
--netdev tap,fd=32,id=hostnet1,vhost=on,vhostfd=33 -device
->
-virtio-net-ccw,netdev=hostnet1,id=net1,mac=52:54:00:2a:e5:08,devno=fe.0.0002
->
--chardev pty,id=charconsole0 -device
->
-sclpconsole,chardev=charconsole0,id=console0 -device
->
-virtio-balloon-ccw,id=balloon0,devno=fe.3.ffba -sandbox
->
-on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg
->
-timestamp=on
->
->
->
->
-2018-07-17 15:48:42.252+0000: Domain id=21 is tainted: high-privileges
->
-2018-07-17T15:48:42.279380Z qemu-system-s390x: -chardev pty,id=charconsole0:
->
-char device redirected to /dev/pts/3 (label charconsole0)
->
-qemu-system-s390x: util/async.c:339: aio_get_linux_aio: Assertion
->
-`ctx->linux_aio' failed.
->
-2018-07-17 15:48:43.309+0000: shutting down, reason=failed
->
->
->
-Any help debugging this would be greatly appreciated.
-iiuc, this possibly implies AIO was not actually used previously on this
-guest (it might have silently been falling back to threaded IO?). I
-don't have access to s390x, but would it be possible to run qemu under
-gdb and see if aio_setup_linux_aio is being called at all (I think it
-might not be, but I'm not sure why), and if so, if it's for the context
-in question?
-
-If it's not being called first, could you see what callpath is calling
-aio_get_linux_aio when this assertion trips?
-
-Thanks!
--Nish
-
-On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
-iiuc, this possibly implies AIO was not actually used previously on this
-guest (it might have silently been falling back to threaded IO?). I
-don't have access to s390x, but would it be possible to run qemu under
-gdb and see if aio_setup_linux_aio is being called at all (I think it
-might not be, but I'm not sure why), and if so, if it's for the context
-in question?
-
-If it's not being called first, could you see what callpath is calling
-aio_get_linux_aio when this assertion trips?
-
-Thanks!
--Nish
-Hi Nishant,
-From the coredump of the guest this is the call trace that calls
-aio_get_linux_aio:
-Stack trace of thread 145158:
-#0  0x000003ff94dbe274 raise (libc.so.6)
-#1  0x000003ff94da39a8 abort (libc.so.6)
-#2  0x000003ff94db62ce __assert_fail_base (libc.so.6)
-#3  0x000003ff94db634c __assert_fail (libc.so.6)
-#4  0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
-#5  0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
-#6  0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
-#7  0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
-#8  0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
-#9  0x000002aa20db3c34 aio_poll (qemu-system-s390x)
-#10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
-#11 0x000003ff94f879a8 start_thread (libpthread.so.0)
-#12 0x000003ff94e797ee thread_start (libc.so.6)
-
-
-Thanks for taking a look and responding.
-
-Thanks
-Farhan
-
-On 07/18/2018 09:42 AM, Farhan Ali wrote:
-On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
-iiuc, this possibly implies AIO was not actually used previously on this
-guest (it might have silently been falling back to threaded IO?). I
-don't have access to s390x, but would it be possible to run qemu under
-gdb and see if aio_setup_linux_aio is being called at all (I think it
-might not be, but I'm not sure why), and if so, if it's for the context
-in question?
-
-If it's not being called first, could you see what callpath is calling
-aio_get_linux_aio when this assertion trips?
-
-Thanks!
--Nish
-Hi Nishant,
-From the coredump of the guest this is the call trace that calls
-aio_get_linux_aio:
-Stack trace of thread 145158:
-#0  0x000003ff94dbe274 raise (libc.so.6)
-#1  0x000003ff94da39a8 abort (libc.so.6)
-#2  0x000003ff94db62ce __assert_fail_base (libc.so.6)
-#3  0x000003ff94db634c __assert_fail (libc.so.6)
-#4  0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
-#5  0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
-#6  0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
-#7  0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
-#8  0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
-#9  0x000002aa20db3c34 aio_poll (qemu-system-s390x)
-#10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
-#11 0x000003ff94f879a8 start_thread (libpthread.so.0)
-#12 0x000003ff94e797ee thread_start (libc.so.6)
-
-
-Thanks for taking a look and responding.
-
-Thanks
-Farhan
-Trying to debug a little further, the block device in this case is a
-"host device". And looking at your commit carefully you use the
-bdrv_attach_aio_context callback to setup a Linux AioContext.
-For some reason the "host device" struct (BlockDriver bdrv_host_device
-in block/file-posix.c) does not have a bdrv_attach_aio_context defined.
-So a simple change of adding the callback to the struct solves the issue
-and the guest starts fine.
-diff --git a/block/file-posix.c b/block/file-posix.c
-index 28824aa..b8d59fb 100644
---- a/block/file-posix.c
-+++ b/block/file-posix.c
-@@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = {
-     .bdrv_refresh_limits = raw_refresh_limits,
-     .bdrv_io_plug = raw_aio_plug,
-     .bdrv_io_unplug = raw_aio_unplug,
-+    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
-
-     .bdrv_co_truncate       = raw_co_truncate,
-     .bdrv_getlength    = raw_getlength,
-I am not too familiar with block device code in QEMU, so not sure if
-this is the right fix or if there are some underlying problems.
-Thanks
-Farhan
-
-On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote:
->
->
->
-On 07/18/2018 09:42 AM, Farhan Ali wrote:
->
->
->
->
->
-> On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
->
-> > iiuc, this possibly implies AIO was not actually used previously on this
->
-> > guest (it might have silently been falling back to threaded IO?). I
->
-> > don't have access to s390x, but would it be possible to run qemu under
->
-> > gdb and see if aio_setup_linux_aio is being called at all (I think it
->
-> > might not be, but I'm not sure why), and if so, if it's for the context
->
-> > in question?
->
-> >
->
-> > If it's not being called first, could you see what callpath is calling
->
-> > aio_get_linux_aio when this assertion trips?
->
-> >
->
-> > Thanks!
->
-> > -Nish
->
->
->
->
->
-> Hi Nishant,
->
->
->
->  From the coredump of the guest this is the call trace that calls
->
-> aio_get_linux_aio:
->
->
->
->
->
-> Stack trace of thread 145158:
->
-> #0  0x000003ff94dbe274 raise (libc.so.6)
->
-> #1  0x000003ff94da39a8 abort (libc.so.6)
->
-> #2  0x000003ff94db62ce __assert_fail_base (libc.so.6)
->
-> #3  0x000003ff94db634c __assert_fail (libc.so.6)
->
-> #4  0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
->
-> #5  0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
->
-> #6  0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
->
-> #7  0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
->
-> #8  0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
->
-> #9  0x000002aa20db3c34 aio_poll (qemu-system-s390x)
->
-> #10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
->
-> #11 0x000003ff94f879a8 start_thread (libpthread.so.0)
->
-> #12 0x000003ff94e797ee thread_start (libc.so.6)
->
->
->
->
->
-> Thanks for taking a look and responding.
->
->
->
-> Thanks
->
-> Farhan
->
->
->
->
->
->
->
->
-Trying to debug a little further, the block device in this case is a "host
->
-device". And looking at your commit carefully you use the
->
-bdrv_attach_aio_context callback to setup a Linux AioContext.
->
->
-For some reason the "host device" struct (BlockDriver bdrv_host_device in
->
-block/file-posix.c) does not have a bdrv_attach_aio_context defined.
->
-So a simple change of adding the callback to the struct solves the issue and
->
-the guest starts fine.
->
->
->
-diff --git a/block/file-posix.c b/block/file-posix.c
->
-index 28824aa..b8d59fb 100644
->
---- a/block/file-posix.c
->
-+++ b/block/file-posix.c
->
-@@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = {
->
-.bdrv_refresh_limits = raw_refresh_limits,
->
-.bdrv_io_plug = raw_aio_plug,
->
-.bdrv_io_unplug = raw_aio_unplug,
->
-+    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
->
->
-.bdrv_co_truncate       = raw_co_truncate,
->
-.bdrv_getlength    = raw_getlength,
->
->
->
->
-I am not too familiar with block device code in QEMU, so not sure if
->
-this is the right fix or if there are some underlying problems.
-Oh this is quite embarassing! I only added the bdrv_attach_aio_context
-callback for the file-backed device. Your fix is definitely corect for
-host device. Let me make sure there weren't any others missed and I will
-send out a properly formatted patch. Thank you for the quick testing and
-turnaround!
-
--Nish
-
-On 07/18/2018 08:52 PM, Nishanth Aravamudan wrote:
->
-On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote:
->
->
->
->
->
-> On 07/18/2018 09:42 AM, Farhan Ali wrote:
->
->>
->
->>
->
->> On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
->
->>> iiuc, this possibly implies AIO was not actually used previously on this
->
->>> guest (it might have silently been falling back to threaded IO?). I
->
->>> don't have access to s390x, but would it be possible to run qemu under
->
->>> gdb and see if aio_setup_linux_aio is being called at all (I think it
->
->>> might not be, but I'm not sure why), and if so, if it's for the context
->
->>> in question?
->
->>>
->
->>> If it's not being called first, could you see what callpath is calling
->
->>> aio_get_linux_aio when this assertion trips?
->
->>>
->
->>> Thanks!
->
->>> -Nish
->
->>
->
->>
->
->> Hi Nishant,
->
->>
->
->>  From the coredump of the guest this is the call trace that calls
->
->> aio_get_linux_aio:
->
->>
->
->>
->
->> Stack trace of thread 145158:
->
->> #0  0x000003ff94dbe274 raise (libc.so.6)
->
->> #1  0x000003ff94da39a8 abort (libc.so.6)
->
->> #2  0x000003ff94db62ce __assert_fail_base (libc.so.6)
->
->> #3  0x000003ff94db634c __assert_fail (libc.so.6)
->
->> #4  0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
->
->> #5  0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
->
->> #6  0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
->
->> #7  0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
->
->> #8  0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
->
->> #9  0x000002aa20db3c34 aio_poll (qemu-system-s390x)
->
->> #10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
->
->> #11 0x000003ff94f879a8 start_thread (libpthread.so.0)
->
->> #12 0x000003ff94e797ee thread_start (libc.so.6)
->
->>
->
->>
->
->> Thanks for taking a look and responding.
->
->>
->
->> Thanks
->
->> Farhan
->
->>
->
->>
->
->>
->
->
->
-> Trying to debug a little further, the block device in this case is a "host
->
-> device". And looking at your commit carefully you use the
->
-> bdrv_attach_aio_context callback to setup a Linux AioContext.
->
->
->
-> For some reason the "host device" struct (BlockDriver bdrv_host_device in
->
-> block/file-posix.c) does not have a bdrv_attach_aio_context defined.
->
-> So a simple change of adding the callback to the struct solves the issue and
->
-> the guest starts fine.
->
->
->
->
->
-> diff --git a/block/file-posix.c b/block/file-posix.c
->
-> index 28824aa..b8d59fb 100644
->
-> --- a/block/file-posix.c
->
-> +++ b/block/file-posix.c
->
-> @@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = {
->
->      .bdrv_refresh_limits = raw_refresh_limits,
->
->      .bdrv_io_plug = raw_aio_plug,
->
->      .bdrv_io_unplug = raw_aio_unplug,
->
-> +    .bdrv_attach_aio_context = raw_aio_attach_aio_context,
->
->
->
->      .bdrv_co_truncate       = raw_co_truncate,
->
->      .bdrv_getlength    = raw_getlength,
->
->
->
->
->
->
->
-> I am not too familiar with block device code in QEMU, so not sure if
->
-> this is the right fix or if there are some underlying problems.
->
->
-Oh this is quite embarassing! I only added the bdrv_attach_aio_context
->
-callback for the file-backed device. Your fix is definitely corect for
->
-host device. Let me make sure there weren't any others missed and I will
->
-send out a properly formatted patch. Thank you for the quick testing and
->
-turnaround!
-Farhan, can you respin your patch with proper sign-off and patch description?
-Adding qemu-block.
-
-Hi Christian,
-
-On 19.07.2018 [08:55:20 +0200], Christian Borntraeger wrote:
->
->
->
-On 07/18/2018 08:52 PM, Nishanth Aravamudan wrote:
->
-> On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote:
->
->>
->
->>
->
->> On 07/18/2018 09:42 AM, Farhan Ali wrote:
-<snip>
-
->
->> I am not too familiar with block device code in QEMU, so not sure if
->
->> this is the right fix or if there are some underlying problems.
->
->
->
-> Oh this is quite embarassing! I only added the bdrv_attach_aio_context
->
-> callback for the file-backed device. Your fix is definitely corect for
->
-> host device. Let me make sure there weren't any others missed and I will
->
-> send out a properly formatted patch. Thank you for the quick testing and
->
-> turnaround!
->
->
-Farhan, can you respin your patch with proper sign-off and patch description?
->
-Adding qemu-block.
-I sent it yesterday, sorry I didn't cc everyone from this e-mail:
-http://lists.nongnu.org/archive/html/qemu-block/2018-07/msg00516.html
-Thanks,
-Nish
-
diff --git a/classification_output/04/mistranslation/25842545 b/classification_output/04/mistranslation/25842545
deleted file mode 100644
index 088ed7a19..000000000
--- a/classification_output/04/mistranslation/25842545
+++ /dev/null
@@ -1,210 +0,0 @@
-mistranslation: 0.928
-other: 0.912
-KVM: 0.867
-vnc: 0.862
-device: 0.847
-instruction: 0.835
-semantic: 0.829
-boot: 0.824
-assembly: 0.824
-graphic: 0.822
-socket: 0.808
-network: 0.796
-
-[Qemu-devel] [Bug?] Guest pause because VMPTRLD failed in KVM
-
-Hello,
-
-  We encountered a problem that a guest paused because the KMOD report VMPTRLD 
-failed.
-
-The related information is as follows:
-
-1) Qemu command:
-   /usr/bin/qemu-kvm -name omu1 -S -machine pc-i440fx-2.3,accel=kvm,usb=off -cpu
-host -m 15625 -realtime mlock=off -smp 8,sockets=1,cores=8,threads=1 -uuid
-a2aacfff-6583-48b4-b6a4-e6830e519931 -no-user-config -nodefaults -chardev
-socket,id=charmonitor,path=/var/lib/libvirt/qemu/omu1.monitor,server,nowait
--mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
--boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
-virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive
-file=/home/env/guest1.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native
-  -device
-virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0
-  -drive
-file=/home/env/guest_300G.img,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native
-  -device
-virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1
-  -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device
-virtio-net-pci,netdev=hostnet0,id=net0,mac=00:00:80:05:00:00,bus=pci.0,addr=0x3
--netdev tap,fd=27,id=hostnet1,vhost=on,vhostfd=28 -device
-virtio-net-pci,netdev=hostnet1,id=net1,mac=00:00:80:05:00:01,bus=pci.0,addr=0x4
--chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0
--device usb-tablet,id=input0 -vnc 0.0.0.0:0 -device
-cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device
-virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
-
-   2) Qemu log:
-   KVM: entry failed, hardware error 0x4
-   RAX=00000000ffffffed RBX=ffff8803fa2d7fd8 RCX=0100000000000000
-RDX=0000000000000000
-   RSI=0000000000000000 RDI=0000000000000046 RBP=ffff8803fa2d7e90
-RSP=ffff8803fa2efe90
-   R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000
-R11=000000000000b69a
-   R12=0000000000000001 R13=ffffffff81a25b40 R14=0000000000000000
-R15=ffff8803fa2d7fd8
-   RIP=ffffffff81053e16 RFL=00000286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
-   ES =0000 0000000000000000 ffffffff 00c00000
-   CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
-   SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
-   DS =0000 0000000000000000 ffffffff 00c00000
-   FS =0000 0000000000000000 ffffffff 00c00000
-   GS =0000 ffff88040f540000 ffffffff 00c00000
-   LDT=0000 0000000000000000 ffffffff 00c00000
-   TR =0040 ffff88040f550a40 00002087 00008b00 DPL=0 TSS64-busy
-   GDT=     ffff88040f549000 0000007f
-   IDT=     ffffffffff529000 00000fff
-   CR0=80050033 CR2=00007f81ca0c5000 CR3=00000003f5081000 CR4=000407e0
-   DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
-DR3=0000000000000000
-   DR6=00000000ffff0ff0 DR7=0000000000000400
-   EFER=0000000000000d01
-   Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ??
-?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
-
-   3) Demsg
-   [347315.028339] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed
-   klogd 1.4.1, ---------- state change ----------
-   [347315.039506] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed
-   [347315.051728] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed
-   [347315.057472] vmwrite error: reg 6c0a value ffff88307e66e480 (err
-2120672384)
-   [347315.064567] Pid: 69523, comm: qemu-kvm Tainted: GF           X
-3.0.93-0.8-default #1
-   [347315.064569] Call Trace:
-   [347315.064587]  [<ffffffff810049d5>] dump_trace+0x75/0x300
-   [347315.064595]  [<ffffffff8145e3e3>] dump_stack+0x69/0x6f
-   [347315.064617]  [<ffffffffa03738de>] vmx_vcpu_load+0x11e/0x1d0 [kvm_intel]
-   [347315.064647]  [<ffffffffa029a204>] kvm_arch_vcpu_load+0x44/0x1d0 [kvm]
-   [347315.064669]  [<ffffffff81054ee1>] finish_task_switch+0x81/0xe0
-   [347315.064676]  [<ffffffff8145f0b4>] thread_return+0x3b/0x2a7
-   [347315.064687]  [<ffffffffa028d9b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
-   [347315.064703]  [<ffffffffa02a16d1>] __vcpu_run+0xd1/0x260 [kvm]
-   [347315.064732]  [<ffffffffa02a2418>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 
-[kvm]
-   [347315.064759]  [<ffffffffa028ecee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
-   [347315.064771]  [<ffffffff8116bdfb>] do_vfs_ioctl+0x8b/0x3b0
-   [347315.064776]  [<ffffffff8116c1c1>] sys_ioctl+0xa1/0xb0
-   [347315.064783]  [<ffffffff81469272>] system_call_fastpath+0x16/0x1b
-   [347315.064797]  [<00007fee51969ce7>] 0x7fee51969ce6
-   [347315.064799] vmwrite error: reg 6c0c value ffff88307e664000 (err
-2120630272)
-   [347315.064802] Pid: 69523, comm: qemu-kvm Tainted: GF           X
-3.0.93-0.8-default #1
-   [347315.064803] Call Trace:
-   [347315.064807]  [<ffffffff810049d5>] dump_trace+0x75/0x300
-   [347315.064811]  [<ffffffff8145e3e3>] dump_stack+0x69/0x6f
-   [347315.064817]  [<ffffffffa03738ec>] vmx_vcpu_load+0x12c/0x1d0 [kvm_intel]
-   [347315.064832]  [<ffffffffa029a204>] kvm_arch_vcpu_load+0x44/0x1d0 [kvm]
-   [347315.064851]  [<ffffffff81054ee1>] finish_task_switch+0x81/0xe0
-   [347315.064855]  [<ffffffff8145f0b4>] thread_return+0x3b/0x2a7
-   [347315.064865]  [<ffffffffa028d9b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
-   [347315.064880]  [<ffffffffa02a16d1>] __vcpu_run+0xd1/0x260 [kvm]
-   [347315.064907]  [<ffffffffa02a2418>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 
-[kvm]
-   [347315.064933]  [<ffffffffa028ecee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
-   [347315.064943]  [<ffffffff8116bdfb>] do_vfs_ioctl+0x8b/0x3b0
-   [347315.064947]  [<ffffffff8116c1c1>] sys_ioctl+0xa1/0xb0
-   [347315.064951]  [<ffffffff81469272>] system_call_fastpath+0x16/0x1b
-   [347315.064957]  [<00007fee51969ce7>] 0x7fee51969ce6
-   [347315.064959] vmwrite error: reg 6c10 value 0 (err 0)
-
-   4) The isssue can't be reporduced. I search the Intel VMX sepc about reaseons
-of vmptrld failure:
-   The instruction fails if its operand is not properly aligned, sets
-unsupported physical-address bits, or is equal to the VMXON
-   pointer. In addition, the instruction fails if the 32 bits in memory
-referenced by the operand do not match the VMCS
-   revision identifier supported by this processor.
-
-   But I can't find any cues from the KVM source code. It seems each
-   error conditions is impossible in theory. :(
-
-Any suggestions will be appreciated! Paolo?
-
--- 
-Regards,
--Gonglei
-
-On 10/11/2016 15:10, gong lei wrote:
->
-4) The isssue can't be reporduced. I search the Intel VMX sepc about
->
-reaseons
->
-of vmptrld failure:
->
-The instruction fails if its operand is not properly aligned, sets
->
-unsupported physical-address bits, or is equal to the VMXON
->
-pointer. In addition, the instruction fails if the 32 bits in memory
->
-referenced by the operand do not match the VMCS
->
-revision identifier supported by this processor.
->
->
-But I can't find any cues from the KVM source code. It seems each
->
-error conditions is impossible in theory. :(
-Yes, it should not happen. :(
-
-If it's not reproducible, it's really hard to say what it was, except a
-random memory corruption elsewhere or even a bit flip (!).
-
-Paolo
-
-On 2016/11/17 20:39, Paolo Bonzini wrote:
->
->
-On 10/11/2016 15:10, gong lei wrote:
->
->     4) The isssue can't be reporduced. I search the Intel VMX sepc about
->
-> reaseons
->
-> of vmptrld failure:
->
->     The instruction fails if its operand is not properly aligned, sets
->
-> unsupported physical-address bits, or is equal to the VMXON
->
->     pointer. In addition, the instruction fails if the 32 bits in memory
->
-> referenced by the operand do not match the VMCS
->
->     revision identifier supported by this processor.
->
->
->
->     But I can't find any cues from the KVM source code. It seems each
->
->     error conditions is impossible in theory. :(
->
-Yes, it should not happen. :(
->
->
-If it's not reproducible, it's really hard to say what it was, except a
->
-random memory corruption elsewhere or even a bit flip (!).
->
->
-Paolo
-Thanks for your reply, Paolo :)
-
--- 
-Regards,
--Gonglei
-
diff --git a/classification_output/04/mistranslation/36568044 b/classification_output/04/mistranslation/36568044
deleted file mode 100644
index ba6cad70a..000000000
--- a/classification_output/04/mistranslation/36568044
+++ /dev/null
@@ -1,4589 +0,0 @@
-mistranslation: 0.962
-device: 0.931
-graphic: 0.931
-instruction: 0.930
-other: 0.930
-assembly: 0.926
-semantic: 0.923
-KVM: 0.914
-socket: 0.907
-vnc: 0.905
-network: 0.904
-boot: 0.895
-
-[BUG, RFC] cpr-transfer: qxl guest driver crashes after migration
-
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
->
-EMULATOR=/path/to/emulator
->
-ROOTFS=/path/to/image
->
-QMPSOCK=/var/run/alma8qmp-src.sock
->
->
-$EMULATOR -enable-kvm \
->
--machine q35 \
->
--cpu host -smp 2 -m 2G \
->
--object
->
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
->
--machine memory-backend=ram0 \
->
--machine aux-ram-share=on \
->
--drive file=$ROOTFS,media=disk,if=virtio \
->
--qmp unix:$QMPSOCK,server=on,wait=off \
->
--nographic \
->
--device qxl-vga
-Run migration target:
->
-EMULATOR=/path/to/emulator
->
-ROOTFS=/path/to/image
->
-QMPSOCK=/var/run/alma8qmp-dst.sock
->
->
->
->
-$EMULATOR -enable-kvm \
->
--machine q35 \
->
--cpu host -smp 2 -m 2G \
->
--object
->
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
->
--machine memory-backend=ram0 \
->
--machine aux-ram-share=on \
->
--drive file=$ROOTFS,media=disk,if=virtio \
->
--qmp unix:$QMPSOCK,server=on,wait=off \
->
--nographic \
->
--device qxl-vga \
->
--incoming tcp:0:44444 \
->
--incoming '{"channel-type": "cpr", "addr": { "transport": "socket",
->
-"type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
->
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
->
-QMPSOCK=/var/run/alma8qmp-src.sock
->
->
-$QMPSHELL -p $QMPSOCK <<EOF
->
-migrate-set-parameters mode=cpr-transfer
->
-migrate
->
-channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
->
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
->
-[   73.962002] [TTM] Buffer eviction failed
->
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
->
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate
->
-VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which speeds up
-the crash in the guest):
->
-#!/bin/bash
->
->
-chvt 3
->
->
-for j in $(seq 80); do
->
-echo "$(date) starting round $j"
->
-if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != ""
->
-]; then
->
-echo "bug was reproduced after $j tries"
->
-exit 1
->
-fi
->
-for i in $(seq 100); do
->
-dmesg > /dev/tty3
->
-done
->
-done
->
->
-echo "bug could not be reproduced"
->
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into this?  Any
-suggestions would be appreciated.  Thanks!
-
-Andrey
-
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-     -machine q35 \
-     -cpu host -smp 2 -m 2G \
-     -object 
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
-     -machine memory-backend=ram0 \
-     -machine aux-ram-share=on \
-     -drive file=$ROOTFS,media=disk,if=virtio \
-     -qmp unix:$QMPSOCK,server=on,wait=off \
-     -nographic \
-     -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
--machine q35 \
-     -cpu host -smp 2 -m 2G \
-     -object 
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
-     -machine memory-backend=ram0 \
-     -machine aux-ram-share=on \
-     -drive file=$ROOTFS,media=disk,if=virtio \
-     -qmp unix:$QMPSOCK,server=on,wait=off \
-     -nographic \
-     -device qxl-vga \
-     -incoming tcp:0:44444 \
-     -incoming '{"channel-type": "cpr", "addr": { "transport": "socket", "type": "unix", 
-"path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-     migrate-set-parameters mode=cpr-transfer
-     migrate 
-channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate 
-VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-         echo "$(date) starting round $j"
-         if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" 
-]; then
-                 echo "bug was reproduced after $j tries"
-                 exit 1
-         fi
-         for i in $(seq 100); do
-                 dmesg > /dev/tty3
-         done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-
-- Steve
-
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-     -machine q35 \
-     -cpu host -smp 2 -m 2G \
-     -object 
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
-     -machine memory-backend=ram0 \
-     -machine aux-ram-share=on \
-     -drive file=$ROOTFS,media=disk,if=virtio \
-     -qmp unix:$QMPSOCK,server=on,wait=off \
-     -nographic \
-     -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
-     -machine q35 \
-     -cpu host -smp 2 -m 2G \
-     -object 
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
-     -machine memory-backend=ram0 \
-     -machine aux-ram-share=on \
-     -drive file=$ROOTFS,media=disk,if=virtio \
-     -qmp unix:$QMPSOCK,server=on,wait=off \
-     -nographic \
-     -device qxl-vga \
-     -incoming tcp:0:44444 \
-     -incoming '{"channel-type": "cpr", "addr": { "transport": "socket", "type": "unix", 
-"path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-     migrate-set-parameters mode=cpr-transfer
-     migrate 
-channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate 
-VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-         echo "$(date) starting round $j"
-         if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" 
-]; then
-                 echo "bug was reproduced after $j tries"
-                 exit 1
-         fi
-         for i in $(seq 100); do
-                 dmesg > /dev/tty3
-         done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-Also try adding this patch to see if it flags any ram blocks as not
-compatible with cpr.  A message is printed at migration start time.
-1740667681-257312-1-git-send-email-steven.sistare@oracle.com
-/">https://lore.kernel.org/qemu-devel/
-1740667681-257312-1-git-send-email-steven.sistare@oracle.com
-/
-- Steve
-
-On 2/28/25 8:20 PM, Steven Sistare wrote:
->
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
->
-> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
->
->> Hi all,
->
->>
->
->> We've been experimenting with cpr-transfer migration mode recently and
->
->> have discovered the following issue with the guest QXL driver:
->
->>
->
->> Run migration source:
->
->>> EMULATOR=/path/to/emulator
->
->>> ROOTFS=/path/to/image
->
->>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>
->
->>> $EMULATOR -enable-kvm \
->
->>>      -machine q35 \
->
->>>      -cpu host -smp 2 -m 2G \
->
->>>      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
->
->>> ram0,share=on\
->
->>>      -machine memory-backend=ram0 \
->
->>>      -machine aux-ram-share=on \
->
->>>      -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>      -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>      -nographic \
->
->>>      -device qxl-vga
->
->>
->
->> Run migration target:
->
->>> EMULATOR=/path/to/emulator
->
->>> ROOTFS=/path/to/image
->
->>> QMPSOCK=/var/run/alma8qmp-dst.sock
->
->>> $EMULATOR -enable-kvm \
->
->>>      -machine q35 \
->
->>>      -cpu host -smp 2 -m 2G \
->
->>>      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
->
->>> ram0,share=on\
->
->>>      -machine memory-backend=ram0 \
->
->>>      -machine aux-ram-share=on \
->
->>>      -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>      -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>      -nographic \
->
->>>      -device qxl-vga \
->
->>>      -incoming tcp:0:44444 \
->
->>>      -incoming '{"channel-type": "cpr", "addr": { "transport":
->
->>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
->
->>
->
->>
->
->> Launch the migration:
->
->>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
->
->>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>
->
->>> $QMPSHELL -p $QMPSOCK <<EOF
->
->>>      migrate-set-parameters mode=cpr-transfer
->
->>>      migrate channels=[{"channel-type":"main","addr":
->
->>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
->
->>> {"channel-type":"cpr","addr":
->
->>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
->
->>> dst.sock"}}]
->
->>> EOF
->
->>
->
->> Then, after a while, QXL guest driver on target crashes spewing the
->
->> following messages:
->
->>> [   73.962002] [TTM] Buffer eviction failed
->
->>> [   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
->
->>> 0x00000001)
->
->>> [   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
->
->>> allocate VRAM BO
->
->>
->
->> That seems to be a known kernel QXL driver bug:
->
->>
->
->>
-https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
->
->>
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
->
->>
->
->> (the latter discussion contains that reproduce script which speeds up
->
->> the crash in the guest):
->
->>> #!/bin/bash
->
->>>
->
->>> chvt 3
->
->>>
->
->>> for j in $(seq 80); do
->
->>>          echo "$(date) starting round $j"
->
->>>          if [ "$(journalctl --boot | grep "failed to allocate VRAM
->
->>> BO")" != "" ]; then
->
->>>                  echo "bug was reproduced after $j tries"
->
->>>                  exit 1
->
->>>          fi
->
->>>          for i in $(seq 100); do
->
->>>                  dmesg > /dev/tty3
->
->>>          done
->
->>> done
->
->>>
->
->>> echo "bug could not be reproduced"
->
->>> exit 0
->
->>
->
->> The bug itself seems to remain unfixed, as I was able to reproduce that
->
->> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
->
->> cpr-transfer code also seems to be buggy as it triggers the crash -
->
->> without the cpr-transfer migration the above reproduce doesn't lead to
->
->> crash on the source VM.
->
->>
->
->> I suspect that, as cpr-transfer doesn't migrate the guest memory, but
->
->> rather passes it through the memory backend object, our code might
->
->> somehow corrupt the VRAM.  However, I wasn't able to trace the
->
->> corruption so far.
->
->>
->
->> Could somebody help the investigation and take a look into this?  Any
->
->> suggestions would be appreciated.  Thanks!
->
->
->
-> Possibly some memory region created by qxl is not being preserved.
->
-> Try adding these traces to see what is preserved:
->
->
->
-> -trace enable='*cpr*'
->
-> -trace enable='*ram_alloc*'
->
->
-Also try adding this patch to see if it flags any ram blocks as not
->
-compatible with cpr.  A message is printed at migration start time.
->

-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-
->
-steven.sistare@oracle.com/
->
->
-- Steve
->
-With the traces enabled + the "migration: ram block cpr blockers" patch
-applied:
-
-Source:
->
-cpr_find_fd pc.bios, id 0 returns -1
->
-cpr_save_fd pc.bios, id 0, fd 22
->
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
->
-0x7fec18e00000
->
-cpr_find_fd pc.rom, id 0 returns -1
->
-cpr_save_fd pc.rom, id 0, fd 23
->
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
->
-0x7fec18c00000
->
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
->
-cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
->
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
->
-24 host 0x7fec18a00000
->
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
->
-cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
->
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
->
-fd 25 host 0x7feb77e00000
->
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
->
-cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
->
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27
->
-host 0x7fec18800000
->
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
->
-cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
->
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
->
-fd 28 host 0x7feb73c00000
->
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
->
-cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
->
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34
->
-host 0x7fec18600000
->
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
->
-cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
->
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 35
->
-host 0x7fec18200000
->
-cpr_find_fd /rom@etc/table-loader, id 0 returns -1
->
-cpr_save_fd /rom@etc/table-loader, id 0, fd 36
->
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36
->
-host 0x7feb8b600000
->
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
->
-cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
->
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host
->
-0x7feb8b400000
->
->
-cpr_state_save cpr-transfer mode
->
-cpr_transfer_output /var/run/alma8cpr-dst.sock
-Target:
->
-cpr_transfer_input /var/run/alma8cpr-dst.sock
->
-cpr_state_load cpr-transfer mode
->
-cpr_find_fd pc.bios, id 0 returns 20
->
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
->
-0x7fcdc9800000
->
-cpr_find_fd pc.rom, id 0 returns 19
->
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
->
-0x7fcdc9600000
->
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
->
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
->
-18 host 0x7fcdc9400000
->
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
->
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
->
-fd 17 host 0x7fcd27e00000
->
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
->
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16
->
-host 0x7fcdc9200000
->
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
->
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
->
-fd 15 host 0x7fcd23c00000
->
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
->
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14
->
-host 0x7fcdc8800000
->
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
->
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 13
->
-host 0x7fcdc8400000
->
-cpr_find_fd /rom@etc/table-loader, id 0 returns 11
->
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11
->
-host 0x7fcdc8200000
->
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
->
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host
->
-0x7fcd3be00000
-Looks like both vga.vram and qxl.vram are being preserved (with the same
-addresses), and no incompatible ram blocks are found during migration.
-
-Andrey
-
-On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
->
-On 2/28/25 8:20 PM, Steven Sistare wrote:
->
-> On 2/28/2025 1:13 PM, Steven Sistare wrote:
->
->> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
->
->>> Hi all,
->
->>>
->
->>> We've been experimenting with cpr-transfer migration mode recently and
->
->>> have discovered the following issue with the guest QXL driver:
->
->>>
->
->>> Run migration source:
->
->>>> EMULATOR=/path/to/emulator
->
->>>> ROOTFS=/path/to/image
->
->>>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>>
->
->>>> $EMULATOR -enable-kvm \
->
->>>>      -machine q35 \
->
->>>>      -cpu host -smp 2 -m 2G \
->
->>>>      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
->
->>>> ram0,share=on\
->
->>>>      -machine memory-backend=ram0 \
->
->>>>      -machine aux-ram-share=on \
->
->>>>      -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>>      -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>>      -nographic \
->
->>>>      -device qxl-vga
->
->>>
->
->>> Run migration target:
->
->>>> EMULATOR=/path/to/emulator
->
->>>> ROOTFS=/path/to/image
->
->>>> QMPSOCK=/var/run/alma8qmp-dst.sock
->
->>>> $EMULATOR -enable-kvm \
->
->>>>      -machine q35 \
->
->>>>      -cpu host -smp 2 -m 2G \
->
->>>>      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
->
->>>> ram0,share=on\
->
->>>>      -machine memory-backend=ram0 \
->
->>>>      -machine aux-ram-share=on \
->
->>>>      -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>>      -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>>      -nographic \
->
->>>>      -device qxl-vga \
->
->>>>      -incoming tcp:0:44444 \
->
->>>>      -incoming '{"channel-type": "cpr", "addr": { "transport":
->
->>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
->
->>>
->
->>>
->
->>> Launch the migration:
->
->>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
->
->>>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>>
->
->>>> $QMPSHELL -p $QMPSOCK <<EOF
->
->>>>      migrate-set-parameters mode=cpr-transfer
->
->>>>      migrate channels=[{"channel-type":"main","addr":
->
->>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
->
->>>> {"channel-type":"cpr","addr":
->
->>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
->
->>>> dst.sock"}}]
->
->>>> EOF
->
->>>
->
->>> Then, after a while, QXL guest driver on target crashes spewing the
->
->>> following messages:
->
->>>> [   73.962002] [TTM] Buffer eviction failed
->
->>>> [   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
->
->>>> 0x00000001)
->
->>>> [   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
->
->>>> allocate VRAM BO
->
->>>
->
->>> That seems to be a known kernel QXL driver bug:
->
->>>
->
->>>
-https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
->
->>>
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
->
->>>
->
->>> (the latter discussion contains that reproduce script which speeds up
->
->>> the crash in the guest):
->
->>>> #!/bin/bash
->
->>>>
->
->>>> chvt 3
->
->>>>
->
->>>> for j in $(seq 80); do
->
->>>>          echo "$(date) starting round $j"
->
->>>>          if [ "$(journalctl --boot | grep "failed to allocate VRAM
->
->>>> BO")" != "" ]; then
->
->>>>                  echo "bug was reproduced after $j tries"
->
->>>>                  exit 1
->
->>>>          fi
->
->>>>          for i in $(seq 100); do
->
->>>>                  dmesg > /dev/tty3
->
->>>>          done
->
->>>> done
->
->>>>
->
->>>> echo "bug could not be reproduced"
->
->>>> exit 0
->
->>>
->
->>> The bug itself seems to remain unfixed, as I was able to reproduce that
->
->>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
->
->>> cpr-transfer code also seems to be buggy as it triggers the crash -
->
->>> without the cpr-transfer migration the above reproduce doesn't lead to
->
->>> crash on the source VM.
->
->>>
->
->>> I suspect that, as cpr-transfer doesn't migrate the guest memory, but
->
->>> rather passes it through the memory backend object, our code might
->
->>> somehow corrupt the VRAM.  However, I wasn't able to trace the
->
->>> corruption so far.
->
->>>
->
->>> Could somebody help the investigation and take a look into this?  Any
->
->>> suggestions would be appreciated.  Thanks!
->
->>
->
->> Possibly some memory region created by qxl is not being preserved.
->
->> Try adding these traces to see what is preserved:
->
->>
->
->> -trace enable='*cpr*'
->
->> -trace enable='*ram_alloc*'
->
->
->
-> Also try adding this patch to see if it flags any ram blocks as not
->
-> compatible with cpr.  A message is printed at migration start time.
->
-> Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-
->
-> steven.sistare@oracle.com/
->
->
->
-> - Steve
->
->
->
->
-With the traces enabled + the "migration: ram block cpr blockers" patch
->
-applied:
->
->
-Source:
->
-> cpr_find_fd pc.bios, id 0 returns -1
->
-> cpr_save_fd pc.bios, id 0, fd 22
->
-> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
->
-> 0x7fec18e00000
->
-> cpr_find_fd pc.rom, id 0 returns -1
->
-> cpr_save_fd pc.rom, id 0, fd 23
->
-> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
->
-> 0x7fec18c00000
->
-> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
->
-> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
->
-> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
->
-> 24 host 0x7fec18a00000
->
-> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
->
-> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
->
-> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
->
-> fd 25 host 0x7feb77e00000
->
-> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
->
-> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
->
-> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27
->
-> host 0x7fec18800000
->
-> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
->
-> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
->
-> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
->
-> fd 28 host 0x7feb73c00000
->
-> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
->
-> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
->
-> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34
->
-> host 0x7fec18600000
->
-> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
->
-> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
->
-> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd
->
-> 35 host 0x7fec18200000
->
-> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
->
-> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
->
-> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36
->
-> host 0x7feb8b600000
->
-> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
->
-> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
->
-> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host
->
-> 0x7feb8b400000
->
->
->
-> cpr_state_save cpr-transfer mode
->
-> cpr_transfer_output /var/run/alma8cpr-dst.sock
->
->
-Target:
->
-> cpr_transfer_input /var/run/alma8cpr-dst.sock
->
-> cpr_state_load cpr-transfer mode
->
-> cpr_find_fd pc.bios, id 0 returns 20
->
-> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
->
-> 0x7fcdc9800000
->
-> cpr_find_fd pc.rom, id 0 returns 19
->
-> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
->
-> 0x7fcdc9600000
->
-> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
->
-> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
->
-> 18 host 0x7fcdc9400000
->
-> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
->
-> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
->
-> fd 17 host 0x7fcd27e00000
->
-> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
->
-> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16
->
-> host 0x7fcdc9200000
->
-> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
->
-> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
->
-> fd 15 host 0x7fcd23c00000
->
-> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
->
-> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14
->
-> host 0x7fcdc8800000
->
-> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
->
-> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd
->
-> 13 host 0x7fcdc8400000
->
-> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
->
-> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11
->
-> host 0x7fcdc8200000
->
-> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
->
-> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host
->
-> 0x7fcd3be00000
->
->
-Looks like both vga.vram and qxl.vram are being preserved (with the same
->
-addresses), and no incompatible ram blocks are found during migration.
->
-Sorry, addressed are not the same, of course.  However corresponding ram
-blocks do seem to be preserved and initialized.
-
-On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:20 PM, Steven Sistare wrote:
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-      -machine q35 \
-      -cpu host -smp 2 -m 2G \
-      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
-ram0,share=on\
-      -machine memory-backend=ram0 \
-      -machine aux-ram-share=on \
-      -drive file=$ROOTFS,media=disk,if=virtio \
-      -qmp unix:$QMPSOCK,server=on,wait=off \
-      -nographic \
-      -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
-      -machine q35 \
-      -cpu host -smp 2 -m 2G \
-      -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
-ram0,share=on\
-      -machine memory-backend=ram0 \
-      -machine aux-ram-share=on \
-      -drive file=$ROOTFS,media=disk,if=virtio \
-      -qmp unix:$QMPSOCK,server=on,wait=off \
-      -nographic \
-      -device qxl-vga \
-      -incoming tcp:0:44444 \
-      -incoming '{"channel-type": "cpr", "addr": { "transport":
-"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-      migrate-set-parameters mode=cpr-transfer
-      migrate channels=[{"channel-type":"main","addr":
-{"transport":"socket","type":"inet","host":"0","port":"44444"}},
-{"channel-type":"cpr","addr":
-{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
-0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
-allocate VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-          echo "$(date) starting round $j"
-          if [ "$(journalctl --boot | grep "failed to allocate VRAM
-BO")" != "" ]; then
-                  echo "bug was reproduced after $j tries"
-                  exit 1
-          fi
-          for i in $(seq 100); do
-                  dmesg > /dev/tty3
-          done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-Also try adding this patch to see if it flags any ram blocks as not
-compatible with cpr.  A message is printed at migration start time.
- Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-
-steven.sistare@oracle.com/
-
-- Steve
-With the traces enabled + the "migration: ram block cpr blockers" patch
-applied:
-
-Source:
-cpr_find_fd pc.bios, id 0 returns -1
-cpr_save_fd pc.bios, id 0, fd 22
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host 
-0x7fec18e00000
-cpr_find_fd pc.rom, id 0 returns -1
-cpr_save_fd pc.rom, id 0, fd 23
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host 
-0x7fec18c00000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
-cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd 24 
-host 0x7fec18a00000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 fd 
-25 host 0x7feb77e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27 host 
-0x7fec18800000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 fd 
-28 host 0x7feb73c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34 host 
-0x7fec18600000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 35 
-host 0x7fec18200000
-cpr_find_fd /rom@etc/table-loader, id 0 returns -1
-cpr_save_fd /rom@etc/table-loader, id 0, fd 36
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36 host 
-0x7feb8b600000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host 
-0x7feb8b400000
-
-cpr_state_save cpr-transfer mode
-cpr_transfer_output /var/run/alma8cpr-dst.sock
-Target:
-cpr_transfer_input /var/run/alma8cpr-dst.sock
-cpr_state_load cpr-transfer mode
-cpr_find_fd pc.bios, id 0 returns 20
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host 
-0x7fcdc9800000
-cpr_find_fd pc.rom, id 0 returns 19
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host 
-0x7fcdc9600000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd 18 
-host 0x7fcdc9400000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 fd 
-17 host 0x7fcd27e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16 host 
-0x7fcdc9200000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 fd 
-15 host 0x7fcd23c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14 host 
-0x7fcdc8800000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 13 
-host 0x7fcdc8400000
-cpr_find_fd /rom@etc/table-loader, id 0 returns 11
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11 host 
-0x7fcdc8200000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host 
-0x7fcd3be00000
-Looks like both vga.vram and qxl.vram are being preserved (with the same
-addresses), and no incompatible ram blocks are found during migration.
-Sorry, addressed are not the same, of course.  However corresponding ram
-blocks do seem to be preserved and initialized.
-So far, I have not reproduced the guest driver failure.
-
-However, I have isolated places where new QEMU improperly writes to
-the qxl memory regions prior to starting the guest, by mmap'ing them
-readonly after cpr:
-
-  qemu_ram_alloc_internal()
-    if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
-        ram_flags |= RAM_READONLY;
-    new_block = qemu_ram_alloc_from_fd(...)
-
-I have attached a draft fix; try it and let me know.
-My console window looks fine before and after cpr, using
--vnc $hostip:0 -vga qxl
-
-- Steve
-0001-hw-qxl-cpr-support-preliminary.patch
-Description:
-Text document
-
-On 3/4/25 9:05 PM, Steven Sistare wrote:
->
-On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
->
-> On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
->
->> On 2/28/25 8:20 PM, Steven Sistare wrote:
->
->>> On 2/28/2025 1:13 PM, Steven Sistare wrote:
->
->>>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
->
->>>>> Hi all,
->
->>>>>
->
->>>>> We've been experimenting with cpr-transfer migration mode recently
->
->>>>> and
->
->>>>> have discovered the following issue with the guest QXL driver:
->
->>>>>
->
->>>>> Run migration source:
->
->>>>>> EMULATOR=/path/to/emulator
->
->>>>>> ROOTFS=/path/to/image
->
->>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>>>>
->
->>>>>> $EMULATOR -enable-kvm \
->
->>>>>>       -machine q35 \
->
->>>>>>       -cpu host -smp 2 -m 2G \
->
->>>>>>       -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
->
->>>>>> ram0,share=on\
->
->>>>>>       -machine memory-backend=ram0 \
->
->>>>>>       -machine aux-ram-share=on \
->
->>>>>>       -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>>>>       -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>>>>       -nographic \
->
->>>>>>       -device qxl-vga
->
->>>>>
->
->>>>> Run migration target:
->
->>>>>> EMULATOR=/path/to/emulator
->
->>>>>> ROOTFS=/path/to/image
->
->>>>>> QMPSOCK=/var/run/alma8qmp-dst.sock
->
->>>>>> $EMULATOR -enable-kvm \
->
->>>>>>       -machine q35 \
->
->>>>>>       -cpu host -smp 2 -m 2G \
->
->>>>>>       -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
->
->>>>>> ram0,share=on\
->
->>>>>>       -machine memory-backend=ram0 \
->
->>>>>>       -machine aux-ram-share=on \
->
->>>>>>       -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>>>>       -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>>>>       -nographic \
->
->>>>>>       -device qxl-vga \
->
->>>>>>       -incoming tcp:0:44444 \
->
->>>>>>       -incoming '{"channel-type": "cpr", "addr": { "transport":
->
->>>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
->
->>>>>
->
->>>>>
->
->>>>> Launch the migration:
->
->>>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
->
->>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>>>>
->
->>>>>> $QMPSHELL -p $QMPSOCK <<EOF
->
->>>>>>       migrate-set-parameters mode=cpr-transfer
->
->>>>>>       migrate channels=[{"channel-type":"main","addr":
->
->>>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
->
->>>>>> {"channel-type":"cpr","addr":
->
->>>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
->
->>>>>> dst.sock"}}]
->
->>>>>> EOF
->
->>>>>
->
->>>>> Then, after a while, QXL guest driver on target crashes spewing the
->
->>>>> following messages:
->
->>>>>> [   73.962002] [TTM] Buffer eviction failed
->
->>>>>> [   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
->
->>>>>> 0x00000001)
->
->>>>>> [   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
->
->>>>>> allocate VRAM BO
->
->>>>>
->
->>>>> That seems to be a known kernel QXL driver bug:
->
->>>>>
->
->>>>>
-https://lore.kernel.org/all/20220907094423.93581-1-
->
->>>>> min_halo@163.com/T/
->
->>>>>
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
->
->>>>>
->
->>>>> (the latter discussion contains that reproduce script which speeds up
->
->>>>> the crash in the guest):
->
->>>>>> #!/bin/bash
->
->>>>>>
->
->>>>>> chvt 3
->
->>>>>>
->
->>>>>> for j in $(seq 80); do
->
->>>>>>           echo "$(date) starting round $j"
->
->>>>>>           if [ "$(journalctl --boot | grep "failed to allocate VRAM
->
->>>>>> BO")" != "" ]; then
->
->>>>>>                   echo "bug was reproduced after $j tries"
->
->>>>>>                   exit 1
->
->>>>>>           fi
->
->>>>>>           for i in $(seq 100); do
->
->>>>>>                   dmesg > /dev/tty3
->
->>>>>>           done
->
->>>>>> done
->
->>>>>>
->
->>>>>> echo "bug could not be reproduced"
->
->>>>>> exit 0
->
->>>>>
->
->>>>> The bug itself seems to remain unfixed, as I was able to reproduce
->
->>>>> that
->
->>>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
->
->>>>> cpr-transfer code also seems to be buggy as it triggers the crash -
->
->>>>> without the cpr-transfer migration the above reproduce doesn't
->
->>>>> lead to
->
->>>>> crash on the source VM.
->
->>>>>
->
->>>>> I suspect that, as cpr-transfer doesn't migrate the guest memory, but
->
->>>>> rather passes it through the memory backend object, our code might
->
->>>>> somehow corrupt the VRAM.  However, I wasn't able to trace the
->
->>>>> corruption so far.
->
->>>>>
->
->>>>> Could somebody help the investigation and take a look into this?  Any
->
->>>>> suggestions would be appreciated.  Thanks!
->
->>>>
->
->>>> Possibly some memory region created by qxl is not being preserved.
->
->>>> Try adding these traces to see what is preserved:
->
->>>>
->
->>>> -trace enable='*cpr*'
->
->>>> -trace enable='*ram_alloc*'
->
->>>
->
->>> Also try adding this patch to see if it flags any ram blocks as not
->
->>> compatible with cpr.  A message is printed at migration start time.
->
->>>  Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
->
->>> email-
->
->>> steven.sistare@oracle.com/
->
->>>
->
->>> - Steve
->
->>>
->
->>
->
->> With the traces enabled + the "migration: ram block cpr blockers" patch
->
->> applied:
->
->>
->
->> Source:
->
->>> cpr_find_fd pc.bios, id 0 returns -1
->
->>> cpr_save_fd pc.bios, id 0, fd 22
->
->>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
->
->>> 0x7fec18e00000
->
->>> cpr_find_fd pc.rom, id 0 returns -1
->
->>> cpr_save_fd pc.rom, id 0, fd 23
->
->>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
->
->>> 0x7fec18c00000
->
->>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
->
->>> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
->
->>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
->
->>> 262144 fd 24 host 0x7fec18a00000
->
->>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
->
->>> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
->
->>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
->
->>> 67108864 fd 25 host 0x7feb77e00000
->
->>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
->
->>> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
->
->>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
->
->>> fd 27 host 0x7fec18800000
->
->>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
->
->>> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
->
->>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
->
->>> 67108864 fd 28 host 0x7feb73c00000
->
->>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
->
->>> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
->
->>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
->
->>> fd 34 host 0x7fec18600000
->
->>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
->
->>> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
->
->>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
->
->>> 2097152 fd 35 host 0x7fec18200000
->
->>> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
->
->>> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
->
->>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
->
->>> fd 36 host 0x7feb8b600000
->
->>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
->
->>> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
->
->>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
->
->>> 37 host 0x7feb8b400000
->
->>>
->
->>> cpr_state_save cpr-transfer mode
->
->>> cpr_transfer_output /var/run/alma8cpr-dst.sock
->
->>
->
->> Target:
->
->>> cpr_transfer_input /var/run/alma8cpr-dst.sock
->
->>> cpr_state_load cpr-transfer mode
->
->>> cpr_find_fd pc.bios, id 0 returns 20
->
->>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
->
->>> 0x7fcdc9800000
->
->>> cpr_find_fd pc.rom, id 0 returns 19
->
->>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
->
->>> 0x7fcdc9600000
->
->>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
->
->>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
->
->>> 262144 fd 18 host 0x7fcdc9400000
->
->>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
->
->>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
->
->>> 67108864 fd 17 host 0x7fcd27e00000
->
->>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
->
->>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
->
->>> fd 16 host 0x7fcdc9200000
->
->>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
->
->>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
->
->>> 67108864 fd 15 host 0x7fcd23c00000
->
->>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
->
->>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
->
->>> fd 14 host 0x7fcdc8800000
->
->>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
->
->>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
->
->>> 2097152 fd 13 host 0x7fcdc8400000
->
->>> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
->
->>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
->
->>> fd 11 host 0x7fcdc8200000
->
->>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
->
->>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
->
->>> 10 host 0x7fcd3be00000
->
->>
->
->> Looks like both vga.vram and qxl.vram are being preserved (with the same
->
->> addresses), and no incompatible ram blocks are found during migration.
->
->
->
-> Sorry, addressed are not the same, of course.  However corresponding ram
->
-> blocks do seem to be preserved and initialized.
->
->
-So far, I have not reproduced the guest driver failure.
->
->
-However, I have isolated places where new QEMU improperly writes to
->
-the qxl memory regions prior to starting the guest, by mmap'ing them
->
-readonly after cpr:
->
->
-  qemu_ram_alloc_internal()
->
-    if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
->
-        ram_flags |= RAM_READONLY;
->
-    new_block = qemu_ram_alloc_from_fd(...)
->
->
-I have attached a draft fix; try it and let me know.
->
-My console window looks fine before and after cpr, using
->
--vnc $hostip:0 -vga qxl
->
->
-- Steve
-Regarding the reproduce: when I launch the buggy version with the same
-options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
-my VNC client silently hangs on the target after a while.  Could it
-happen on your stand as well?  Could you try launching VM with
-"-nographic -device qxl-vga"?  That way VM's serial console is given you
-directly in the shell, so when qxl driver crashes you're still able to
-inspect the kernel messages.
-
-As for your patch, I can report that it doesn't resolve the issue as it
-is.  But I was able to track down another possible memory corruption
-using your approach with readonly mmap'ing:
-
->
-Program terminated with signal SIGSEGV, Segmentation fault.
->
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
->
-412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
->
-[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
->
-(gdb) bt
->
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
->
-#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
->
-errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
->
-#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
->
-errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
->
-#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
->
-errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
->
-#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
->
-value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
->
-#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
->
-v=0x5638996f3770, name=0x56389759b141 "realized", opaque=0x5638987893d0,
->
-errp=0x7ffd3c2b84e0)
->
-at ../qom/object.c:2374
->
-#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
->
-name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
->
-at ../qom/object.c:1449
->
-#7  0x00005638970f8586 in object_property_set_qobject (obj=0x5638996e0e70,
->
-name=0x56389759b141 "realized", value=0x5638996df900, errp=0x7ffd3c2b84e0)
->
-at ../qom/qom-qobject.c:28
->
-#8  0x00005638970f3d8d in object_property_set_bool (obj=0x5638996e0e70,
->
-name=0x56389759b141 "realized", value=true, errp=0x7ffd3c2b84e0)
->
-at ../qom/object.c:1519
->
-#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
->
-bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
->
-#10 0x0000563896dba675 in qdev_device_add_from_qdict (opts=0x5638996dfe50,
->
-from_json=false, errp=0x7ffd3c2b84e0) at ../system/qdev-monitor.c:714
->
-#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
->
-errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
->
-#12 0x0000563896dc48f1 in device_init_func (opaque=0x0, opts=0x563898786150,
->
-errp=0x56389855dc40 <error_fatal>) at ../system/vl.c:1207
->
-#13 0x000056389737a6cc in qemu_opts_foreach
->
-(list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
->
-<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
->
-at ../util/qemu-option.c:1135
->
-#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/vl.c:2745
->
-#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
->
-<error_fatal>) at ../system/vl.c:2806
->
-#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948) at
->
-../system/vl.c:3838
->
-#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at
->
-../system/main.c:72
-So the attached adjusted version of your patch does seem to help.  At
-least I can't reproduce the crash on my stand.
-
-I'm wondering, could it be useful to explicitly mark all the reused
-memory regions readonly upon cpr-transfer, and then make them writable
-back again after the migration is done?  That way we will be segfaulting
-early on instead of debugging tricky memory corruptions.
-
-Andrey
-0001-hw-qxl-cpr-support-preliminary.patch
-Description:
-Text Data
-
-On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
-On 3/4/25 9:05 PM, Steven Sistare wrote:
-On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:20 PM, Steven Sistare wrote:
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently
-and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-       -machine q35 \
-       -cpu host -smp 2 -m 2G \
-       -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
-ram0,share=on\
-       -machine memory-backend=ram0 \
-       -machine aux-ram-share=on \
-       -drive file=$ROOTFS,media=disk,if=virtio \
-       -qmp unix:$QMPSOCK,server=on,wait=off \
-       -nographic \
-       -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
-       -machine q35 \
-       -cpu host -smp 2 -m 2G \
-       -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
-ram0,share=on\
-       -machine memory-backend=ram0 \
-       -machine aux-ram-share=on \
-       -drive file=$ROOTFS,media=disk,if=virtio \
-       -qmp unix:$QMPSOCK,server=on,wait=off \
-       -nographic \
-       -device qxl-vga \
-       -incoming tcp:0:44444 \
-       -incoming '{"channel-type": "cpr", "addr": { "transport":
-"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-       migrate-set-parameters mode=cpr-transfer
-       migrate channels=[{"channel-type":"main","addr":
-{"transport":"socket","type":"inet","host":"0","port":"44444"}},
-{"channel-type":"cpr","addr":
-{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
-0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
-allocate VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-
-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-           echo "$(date) starting round $j"
-           if [ "$(journalctl --boot | grep "failed to allocate VRAM
-BO")" != "" ]; then
-                   echo "bug was reproduced after $j tries"
-                   exit 1
-           fi
-           for i in $(seq 100); do
-                   dmesg > /dev/tty3
-           done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce
-that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't
-lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-Also try adding this patch to see if it flags any ram blocks as not
-compatible with cpr.  A message is printed at migration start time.
-  Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
-email-
-steven.sistare@oracle.com/
-
-- Steve
-With the traces enabled + the "migration: ram block cpr blockers" patch
-applied:
-
-Source:
-cpr_find_fd pc.bios, id 0 returns -1
-cpr_save_fd pc.bios, id 0, fd 22
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
-0x7fec18e00000
-cpr_find_fd pc.rom, id 0 returns -1
-cpr_save_fd pc.rom, id 0, fd 23
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
-0x7fec18c00000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
-cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 24 host 0x7fec18a00000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 25 host 0x7feb77e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 27 host 0x7fec18800000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 28 host 0x7feb73c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 34 host 0x7fec18600000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 35 host 0x7fec18200000
-cpr_find_fd /rom@etc/table-loader, id 0 returns -1
-cpr_save_fd /rom@etc/table-loader, id 0, fd 36
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 36 host 0x7feb8b600000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-37 host 0x7feb8b400000
-
-cpr_state_save cpr-transfer mode
-cpr_transfer_output /var/run/alma8cpr-dst.sock
-Target:
-cpr_transfer_input /var/run/alma8cpr-dst.sock
-cpr_state_load cpr-transfer mode
-cpr_find_fd pc.bios, id 0 returns 20
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
-0x7fcdc9800000
-cpr_find_fd pc.rom, id 0 returns 19
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
-0x7fcdc9600000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 18 host 0x7fcdc9400000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 17 host 0x7fcd27e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 16 host 0x7fcdc9200000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 15 host 0x7fcd23c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 14 host 0x7fcdc8800000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 13 host 0x7fcdc8400000
-cpr_find_fd /rom@etc/table-loader, id 0 returns 11
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 11 host 0x7fcdc8200000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-10 host 0x7fcd3be00000
-Looks like both vga.vram and qxl.vram are being preserved (with the same
-addresses), and no incompatible ram blocks are found during migration.
-Sorry, addressed are not the same, of course.  However corresponding ram
-blocks do seem to be preserved and initialized.
-So far, I have not reproduced the guest driver failure.
-
-However, I have isolated places where new QEMU improperly writes to
-the qxl memory regions prior to starting the guest, by mmap'ing them
-readonly after cpr:
-
-   qemu_ram_alloc_internal()
-     if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
-         ram_flags |= RAM_READONLY;
-     new_block = qemu_ram_alloc_from_fd(...)
-
-I have attached a draft fix; try it and let me know.
-My console window looks fine before and after cpr, using
--vnc $hostip:0 -vga qxl
-
-- Steve
-Regarding the reproduce: when I launch the buggy version with the same
-options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
-my VNC client silently hangs on the target after a while.  Could it
-happen on your stand as well?
-cpr does not preserve the vnc connection and session.  To test, I specify
-port 0 for the source VM and port 1 for the dest.  When the src vnc goes
-dormant the dest vnc becomes active.
-Could you try launching VM with
-"-nographic -device qxl-vga"?  That way VM's serial console is given you
-directly in the shell, so when qxl driver crashes you're still able to
-inspect the kernel messages.
-I have been running like that, but have not reproduced the qxl driver crash,
-and I suspect my guest image+kernel is too old.  However, once I realized the
-issue was post-cpr modification of qxl memory, I switched my attention to the
-fix.
-As for your patch, I can report that it doesn't resolve the issue as it
-is.  But I was able to track down another possible memory corruption
-using your approach with readonly mmap'ing:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
-[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
-(gdb) bt
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70, 
-errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
-#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70, 
-errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
-#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70, 
-errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
-#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70, value=true, 
-errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
-#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70, v=0x5638996f3770, 
-name=0x56389759b141 "realized", opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
-     at ../qom/object.c:2374
-#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70, name=0x56389759b141 
-"realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
-     at ../qom/object.c:1449
-#7  0x00005638970f8586 in object_property_set_qobject (obj=0x5638996e0e70, 
-name=0x56389759b141 "realized", value=0x5638996df900, errp=0x7ffd3c2b84e0)
-     at ../qom/qom-qobject.c:28
-#8  0x00005638970f3d8d in object_property_set_bool (obj=0x5638996e0e70, 
-name=0x56389759b141 "realized", value=true, errp=0x7ffd3c2b84e0)
-     at ../qom/object.c:1519
-#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70, bus=0x563898cf3c20, 
-errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
-#10 0x0000563896dba675 in qdev_device_add_from_qdict (opts=0x5638996dfe50, 
-from_json=false, errp=0x7ffd3c2b84e0) at ../system/qdev-monitor.c:714
-#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150, errp=0x56389855dc40 
-<error_fatal>) at ../system/qdev-monitor.c:733
-#12 0x0000563896dc48f1 in device_init_func (opaque=0x0, opts=0x563898786150, 
-errp=0x56389855dc40 <error_fatal>) at ../system/vl.c:1207
-#13 0x000056389737a6cc in qemu_opts_foreach
-     (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca <device_init_func>, 
-opaque=0x0, errp=0x56389855dc40 <error_fatal>)
-     at ../util/qemu-option.c:1135
-#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/vl.c:2745
-#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40 
-<error_fatal>) at ../system/vl.c:2806
-#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948) at 
-../system/vl.c:3838
-#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at 
-../system/main.c:72
-So the attached adjusted version of your patch does seem to help.  At
-least I can't reproduce the crash on my stand.
-Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram are
-definitely harmful.  Try V2 of the patch, attached, which skips the lines
-of init_qxl_ram that modify guest memory.
-I'm wondering, could it be useful to explicitly mark all the reused
-memory regions readonly upon cpr-transfer, and then make them writable
-back again after the migration is done?  That way we will be segfaulting
-early on instead of debugging tricky memory corruptions.
-It's a useful debugging technique, but changing protection on a large memory 
-region
-can be too expensive for production due to TLB shootdowns.
-
-Also, there are cases where writes are performed but the value is guaranteed to
-be the same:
-  qxl_post_load()
-    qxl_set_mode()
-      d->rom->mode = cpu_to_le32(modenr);
-The value is the same because mode and shadow_rom.mode were passed in vmstate
-from old qemu.
-
-- Steve
-0001-hw-qxl-cpr-support-preliminary-V2.patch
-Description:
-Text document
-
-On 3/5/25 22:19, Steven Sistare wrote:
-On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
-On 3/4/25 9:05 PM, Steven Sistare wrote:
-On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:20 PM, Steven Sistare wrote:
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently
-and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-       -machine q35 \
-       -cpu host -smp 2 -m 2G \
-       -object
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
-ram0,share=on\
-       -machine memory-backend=ram0 \
-       -machine aux-ram-share=on \
-       -drive file=$ROOTFS,media=disk,if=virtio \
-       -qmp unix:$QMPSOCK,server=on,wait=off \
-       -nographic \
-       -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
-       -machine q35 \
-       -cpu host -smp 2 -m 2G \
-       -object
-memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
-ram0,share=on\
-       -machine memory-backend=ram0 \
-       -machine aux-ram-share=on \
-       -drive file=$ROOTFS,media=disk,if=virtio \
-       -qmp unix:$QMPSOCK,server=on,wait=off \
-       -nographic \
-       -device qxl-vga \
-       -incoming tcp:0:44444 \
-       -incoming '{"channel-type": "cpr", "addr": { "transport":
-"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-       migrate-set-parameters mode=cpr-transfer
-       migrate channels=[{"channel-type":"main","addr":
-{"transport":"socket","type":"inet","host":"0","port":"44444"}},
-{"channel-type":"cpr","addr":
-{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing
-the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
-0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR*
-failed to
-allocate VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-
-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which
-speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-           echo "$(date) starting round $j"
-           if [ "$(journalctl --boot | grep "failed to
-allocate VRAM
-BO")" != "" ]; then
-                   echo "bug was reproduced after $j tries"
-                   exit 1
-           fi
-           for i in $(seq 100); do
-                   dmesg > /dev/tty3
-           done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce
-that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the
-crash -
-without the cpr-transfer migration the above reproduce doesn't
-lead to
-crash on the source VM.
-I suspect that, as cpr-transfer doesn't migrate the guest
-memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-Could somebody help the investigation and take a look into
-this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-Also try adding this patch to see if it flags any ram blocks as not
-compatible with cpr.  A message is printed at migration start time.
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
-email-
-steven.sistare@oracle.com/
-
-- Steve
-With the traces enabled + the "migration: ram block cpr blockers"
-patch
-applied:
-
-Source:
-cpr_find_fd pc.bios, id 0 returns -1
-cpr_save_fd pc.bios, id 0, fd 22
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
-0x7fec18e00000
-cpr_find_fd pc.rom, id 0 returns -1
-cpr_save_fd pc.rom, id 0, fd 23
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
-0x7fec18c00000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
-cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 24 host 0x7fec18a00000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 25 host 0x7feb77e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 27 host 0x7fec18800000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 28 host 0x7feb73c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 34 host 0x7fec18600000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 35 host 0x7fec18200000
-cpr_find_fd /rom@etc/table-loader, id 0 returns -1
-cpr_save_fd /rom@etc/table-loader, id 0, fd 36
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 36 host 0x7feb8b600000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-37 host 0x7feb8b400000
-
-cpr_state_save cpr-transfer mode
-cpr_transfer_output /var/run/alma8cpr-dst.sock
-Target:
-cpr_transfer_input /var/run/alma8cpr-dst.sock
-cpr_state_load cpr-transfer mode
-cpr_find_fd pc.bios, id 0 returns 20
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
-0x7fcdc9800000
-cpr_find_fd pc.rom, id 0 returns 19
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
-0x7fcdc9600000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 18 host 0x7fcdc9400000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 17 host 0x7fcd27e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 16 host 0x7fcdc9200000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 15 host 0x7fcd23c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 14 host 0x7fcdc8800000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 13 host 0x7fcdc8400000
-cpr_find_fd /rom@etc/table-loader, id 0 returns 11
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 11 host 0x7fcdc8200000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-10 host 0x7fcd3be00000
-Looks like both vga.vram and qxl.vram are being preserved (with
-the same
-addresses), and no incompatible ram blocks are found during
-migration.
-Sorry, addressed are not the same, of course.  However
-corresponding ram
-blocks do seem to be preserved and initialized.
-So far, I have not reproduced the guest driver failure.
-
-However, I have isolated places where new QEMU improperly writes to
-the qxl memory regions prior to starting the guest, by mmap'ing them
-readonly after cpr:
-
-   qemu_ram_alloc_internal()
-     if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
-         ram_flags |= RAM_READONLY;
-     new_block = qemu_ram_alloc_from_fd(...)
-
-I have attached a draft fix; try it and let me know.
-My console window looks fine before and after cpr, using
--vnc $hostip:0 -vga qxl
-
-- Steve
-Regarding the reproduce: when I launch the buggy version with the same
-options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
-my VNC client silently hangs on the target after a while.  Could it
-happen on your stand as well?
-cpr does not preserve the vnc connection and session.  To test, I specify
-port 0 for the source VM and port 1 for the dest.  When the src vnc goes
-dormant the dest vnc becomes active.
-Could you try launching VM with
-"-nographic -device qxl-vga"?  That way VM's serial console is given you
-directly in the shell, so when qxl driver crashes you're still able to
-inspect the kernel messages.
-I have been running like that, but have not reproduced the qxl driver
-crash,
-and I suspect my guest image+kernel is too old.  However, once I
-realized the
-issue was post-cpr modification of qxl memory, I switched my attention
-to the
-fix.
-As for your patch, I can report that it doesn't resolve the issue as it
-is.  But I was able to track down another possible memory corruption
-using your approach with readonly mmap'ing:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
-[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
-(gdb) bt
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
-errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
-#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
-errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
-#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
-errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
-#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
-value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
-#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
-v=0x5638996f3770, name=0x56389759b141 "realized",
-opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
-     at ../qom/object.c:2374
-#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
-name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
-     at ../qom/object.c:1449
-#7  0x00005638970f8586 in object_property_set_qobject
-(obj=0x5638996e0e70, name=0x56389759b141 "realized",
-value=0x5638996df900, errp=0x7ffd3c2b84e0)
-     at ../qom/qom-qobject.c:28
-#8  0x00005638970f3d8d in object_property_set_bool
-(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
-errp=0x7ffd3c2b84e0)
-     at ../qom/object.c:1519
-#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
-bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
-#10 0x0000563896dba675 in qdev_device_add_from_qdict
-(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at
-../system/qdev-monitor.c:714
-#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
-errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
-#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
-opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at
-../system/vl.c:1207
-#13 0x000056389737a6cc in qemu_opts_foreach
-     (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
-<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
-     at ../util/qemu-option.c:1135
-#14 0x0000563896dc89b5 in qemu_create_cli_devices () at
-../system/vl.c:2745
-#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
-<error_fatal>) at ../system/vl.c:2806
-#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
-at ../system/vl.c:3838
-#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at
-../system/main.c:72
-So the attached adjusted version of your patch does seem to help.  At
-least I can't reproduce the crash on my stand.
-Thanks for the stack trace; the calls to SPICE_RING_INIT in
-init_qxl_ram are
-definitely harmful.  Try V2 of the patch, attached, which skips the lines
-of init_qxl_ram that modify guest memory.
-I'm wondering, could it be useful to explicitly mark all the reused
-memory regions readonly upon cpr-transfer, and then make them writable
-back again after the migration is done?  That way we will be segfaulting
-early on instead of debugging tricky memory corruptions.
-It's a useful debugging technique, but changing protection on a large
-memory region
-can be too expensive for production due to TLB shootdowns.
-Good point. Though we could move this code under non-default option to
-avoid re-writing.
-
-Den
-
-On 3/5/25 11:19 PM, Steven Sistare wrote:
->
-On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
->
-> On 3/4/25 9:05 PM, Steven Sistare wrote:
->
->> On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
->
->>> On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
->
->>>> On 2/28/25 8:20 PM, Steven Sistare wrote:
->
->>>>> On 2/28/2025 1:13 PM, Steven Sistare wrote:
->
->>>>>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
->
->>>>>>> Hi all,
->
->>>>>>>
->
->>>>>>> We've been experimenting with cpr-transfer migration mode recently
->
->>>>>>> and
->
->>>>>>> have discovered the following issue with the guest QXL driver:
->
->>>>>>>
->
->>>>>>> Run migration source:
->
->>>>>>>> EMULATOR=/path/to/emulator
->
->>>>>>>> ROOTFS=/path/to/image
->
->>>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>>>>>>
->
->>>>>>>> $EMULATOR -enable-kvm \
->
->>>>>>>>        -machine q35 \
->
->>>>>>>>        -cpu host -smp 2 -m 2G \
->
->>>>>>>>        -object memory-backend-file,id=ram0,size=2G,mem-path=/
->
->>>>>>>> dev/shm/
->
->>>>>>>> ram0,share=on\
->
->>>>>>>>        -machine memory-backend=ram0 \
->
->>>>>>>>        -machine aux-ram-share=on \
->
->>>>>>>>        -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>>>>>>        -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>>>>>>        -nographic \
->
->>>>>>>>        -device qxl-vga
->
->>>>>>>
->
->>>>>>> Run migration target:
->
->>>>>>>> EMULATOR=/path/to/emulator
->
->>>>>>>> ROOTFS=/path/to/image
->
->>>>>>>> QMPSOCK=/var/run/alma8qmp-dst.sock
->
->>>>>>>> $EMULATOR -enable-kvm \
->
->>>>>>>>        -machine q35 \
->
->>>>>>>>        -cpu host -smp 2 -m 2G \
->
->>>>>>>>        -object memory-backend-file,id=ram0,size=2G,mem-path=/
->
->>>>>>>> dev/shm/
->
->>>>>>>> ram0,share=on\
->
->>>>>>>>        -machine memory-backend=ram0 \
->
->>>>>>>>        -machine aux-ram-share=on \
->
->>>>>>>>        -drive file=$ROOTFS,media=disk,if=virtio \
->
->>>>>>>>        -qmp unix:$QMPSOCK,server=on,wait=off \
->
->>>>>>>>        -nographic \
->
->>>>>>>>        -device qxl-vga \
->
->>>>>>>>        -incoming tcp:0:44444 \
->
->>>>>>>>        -incoming '{"channel-type": "cpr", "addr": { "transport":
->
->>>>>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
->
->>>>>>>
->
->>>>>>>
->
->>>>>>> Launch the migration:
->
->>>>>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
->
->>>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
->
->>>>>>>>
->
->>>>>>>> $QMPSHELL -p $QMPSOCK <<EOF
->
->>>>>>>>        migrate-set-parameters mode=cpr-transfer
->
->>>>>>>>        migrate channels=[{"channel-type":"main","addr":
->
->>>>>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
->
->>>>>>>> {"channel-type":"cpr","addr":
->
->>>>>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
->
->>>>>>>> dst.sock"}}]
->
->>>>>>>> EOF
->
->>>>>>>
->
->>>>>>> Then, after a while, QXL guest driver on target crashes spewing the
->
->>>>>>> following messages:
->
->>>>>>>> [   73.962002] [TTM] Buffer eviction failed
->
->>>>>>>> [   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
->
->>>>>>>> 0x00000001)
->
->>>>>>>> [   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
->
->>>>>>>> allocate VRAM BO
->
->>>>>>>
->
->>>>>>> That seems to be a known kernel QXL driver bug:
->
->>>>>>>
->
->>>>>>>
-https://lore.kernel.org/all/20220907094423.93581-1-
->
->>>>>>> min_halo@163.com/T/
->
->>>>>>>
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
->
->>>>>>>
->
->>>>>>> (the latter discussion contains that reproduce script which
->
->>>>>>> speeds up
->
->>>>>>> the crash in the guest):
->
->>>>>>>> #!/bin/bash
->
->>>>>>>>
->
->>>>>>>> chvt 3
->
->>>>>>>>
->
->>>>>>>> for j in $(seq 80); do
->
->>>>>>>>            echo "$(date) starting round $j"
->
->>>>>>>>            if [ "$(journalctl --boot | grep "failed to allocate
->
->>>>>>>> VRAM
->
->>>>>>>> BO")" != "" ]; then
->
->>>>>>>>                    echo "bug was reproduced after $j tries"
->
->>>>>>>>                    exit 1
->
->>>>>>>>            fi
->
->>>>>>>>            for i in $(seq 100); do
->
->>>>>>>>                    dmesg > /dev/tty3
->
->>>>>>>>            done
->
->>>>>>>> done
->
->>>>>>>>
->
->>>>>>>> echo "bug could not be reproduced"
->
->>>>>>>> exit 0
->
->>>>>>>
->
->>>>>>> The bug itself seems to remain unfixed, as I was able to reproduce
->
->>>>>>> that
->
->>>>>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
->
->>>>>>> cpr-transfer code also seems to be buggy as it triggers the crash -
->
->>>>>>> without the cpr-transfer migration the above reproduce doesn't
->
->>>>>>> lead to
->
->>>>>>> crash on the source VM.
->
->>>>>>>
->
->>>>>>> I suspect that, as cpr-transfer doesn't migrate the guest
->
->>>>>>> memory, but
->
->>>>>>> rather passes it through the memory backend object, our code might
->
->>>>>>> somehow corrupt the VRAM.  However, I wasn't able to trace the
->
->>>>>>> corruption so far.
->
->>>>>>>
->
->>>>>>> Could somebody help the investigation and take a look into
->
->>>>>>> this?  Any
->
->>>>>>> suggestions would be appreciated.  Thanks!
->
->>>>>>
->
->>>>>> Possibly some memory region created by qxl is not being preserved.
->
->>>>>> Try adding these traces to see what is preserved:
->
->>>>>>
->
->>>>>> -trace enable='*cpr*'
->
->>>>>> -trace enable='*ram_alloc*'
->
->>>>>
->
->>>>> Also try adding this patch to see if it flags any ram blocks as not
->
->>>>> compatible with cpr.  A message is printed at migration start time.
->
->>>>>   Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
->
->>>>> email-
->
->>>>> steven.sistare@oracle.com/
->
->>>>>
->
->>>>> - Steve
->
->>>>>
->
->>>>
->
->>>> With the traces enabled + the "migration: ram block cpr blockers"
->
->>>> patch
->
->>>> applied:
->
->>>>
->
->>>> Source:
->
->>>>> cpr_find_fd pc.bios, id 0 returns -1
->
->>>>> cpr_save_fd pc.bios, id 0, fd 22
->
->>>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
->
->>>>> 0x7fec18e00000
->
->>>>> cpr_find_fd pc.rom, id 0 returns -1
->
->>>>> cpr_save_fd pc.rom, id 0, fd 23
->
->>>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
->
->>>>> 0x7fec18c00000
->
->>>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
->
->>>>> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
->
->>>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
->
->>>>> 262144 fd 24 host 0x7fec18a00000
->
->>>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
->
->>>>> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
->
->>>>> 67108864 fd 25 host 0x7feb77e00000
->
->>>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
->
->>>>> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
->
->>>>> fd 27 host 0x7fec18800000
->
->>>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
->
->>>>> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
->
->>>>> 67108864 fd 28 host 0x7feb73c00000
->
->>>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
->
->>>>> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
->
->>>>> fd 34 host 0x7fec18600000
->
->>>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
->
->>>>> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
->
->>>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
->
->>>>> 2097152 fd 35 host 0x7fec18200000
->
->>>>> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
->
->>>>> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
->
->>>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
->
->>>>> fd 36 host 0x7feb8b600000
->
->>>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
->
->>>>> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
->
->>>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
->
->>>>> 37 host 0x7feb8b400000
->
->>>>>
->
->>>>> cpr_state_save cpr-transfer mode
->
->>>>> cpr_transfer_output /var/run/alma8cpr-dst.sock
->
->>>>
->
->>>> Target:
->
->>>>> cpr_transfer_input /var/run/alma8cpr-dst.sock
->
->>>>> cpr_state_load cpr-transfer mode
->
->>>>> cpr_find_fd pc.bios, id 0 returns 20
->
->>>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
->
->>>>> 0x7fcdc9800000
->
->>>>> cpr_find_fd pc.rom, id 0 returns 19
->
->>>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
->
->>>>> 0x7fcdc9600000
->
->>>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
->
->>>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
->
->>>>> 262144 fd 18 host 0x7fcdc9400000
->
->>>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
->
->>>>> 67108864 fd 17 host 0x7fcd27e00000
->
->>>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
->
->>>>> fd 16 host 0x7fcdc9200000
->
->>>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
->
->>>>> 67108864 fd 15 host 0x7fcd23c00000
->
->>>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
->
->>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
->
->>>>> fd 14 host 0x7fcdc8800000
->
->>>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
->
->>>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
->
->>>>> 2097152 fd 13 host 0x7fcdc8400000
->
->>>>> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
->
->>>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
->
->>>>> fd 11 host 0x7fcdc8200000
->
->>>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
->
->>>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
->
->>>>> 10 host 0x7fcd3be00000
->
->>>>
->
->>>> Looks like both vga.vram and qxl.vram are being preserved (with the
->
->>>> same
->
->>>> addresses), and no incompatible ram blocks are found during migration.
->
->>>
->
->>> Sorry, addressed are not the same, of course.  However corresponding
->
->>> ram
->
->>> blocks do seem to be preserved and initialized.
->
->>
->
->> So far, I have not reproduced the guest driver failure.
->
->>
->
->> However, I have isolated places where new QEMU improperly writes to
->
->> the qxl memory regions prior to starting the guest, by mmap'ing them
->
->> readonly after cpr:
->
->>
->
->>    qemu_ram_alloc_internal()
->
->>      if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
->
->>          ram_flags |= RAM_READONLY;
->
->>      new_block = qemu_ram_alloc_from_fd(...)
->
->>
->
->> I have attached a draft fix; try it and let me know.
->
->> My console window looks fine before and after cpr, using
->
->> -vnc $hostip:0 -vga qxl
->
->>
->
->> - Steve
->
->
->
-> Regarding the reproduce: when I launch the buggy version with the same
->
-> options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
->
-> my VNC client silently hangs on the target after a while.  Could it
->
-> happen on your stand as well?Â
->
->
-cpr does not preserve the vnc connection and session.  To test, I specify
->
-port 0 for the source VM and port 1 for the dest.  When the src vnc goes
->
-dormant the dest vnc becomes active.
->
-Sure, I meant that VNC on the dest (on the port 1) works for a while
-after the migration and then hangs, apparently after the guest QXL crash.
-
->
-> Could you try launching VM with
->
-> "-nographic -device qxl-vga"?  That way VM's serial console is given you
->
-> directly in the shell, so when qxl driver crashes you're still able to
->
-> inspect the kernel messages.
->
->
-I have been running like that, but have not reproduced the qxl driver
->
-crash,
->
-and I suspect my guest image+kernel is too old.
-Yes, that's probably the case.  But the crash occurs on my Fedora 41
-guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
-be buggy.
-
-
->
-However, once I realized the
->
-issue was post-cpr modification of qxl memory, I switched my attention
->
-to the
->
-fix.
->
->
-> As for your patch, I can report that it doesn't resolve the issue as it
->
-> is.  But I was able to track down another possible memory corruption
->
-> using your approach with readonly mmap'ing:
->
->
->
->> Program terminated with signal SIGSEGV, Segmentation fault.
->
->> #0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
->
->> 412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
->
->> [Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
->
->> (gdb) bt
->
->> #0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
->
->> #1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
->
->> errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
->
->> #2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
->
->> errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
->
->> #3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
->
->> errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
->
->> #4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
->
->> value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
->
->> #5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
->
->> v=0x5638996f3770, name=0x56389759b141 "realized",
->
->> opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
->
->>      at ../qom/object.c:2374
->
->> #6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
->
->> name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
->
->>      at ../qom/object.c:1449
->
->> #7  0x00005638970f8586 in object_property_set_qobject
->
->> (obj=0x5638996e0e70, name=0x56389759b141 "realized",
->
->> value=0x5638996df900, errp=0x7ffd3c2b84e0)
->
->>      at ../qom/qom-qobject.c:28
->
->> #8  0x00005638970f3d8d in object_property_set_bool
->
->> (obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
->
->> errp=0x7ffd3c2b84e0)
->
->>      at ../qom/object.c:1519
->
->> #9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
->
->> bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
->
->> #10 0x0000563896dba675 in qdev_device_add_from_qdict
->
->> (opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
->
->> system/qdev-monitor.c:714
->
->> #11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
->
->> errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
->
->> #12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
->
->> opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
->
->> vl.c:1207
->
->> #13 0x000056389737a6cc in qemu_opts_foreach
->
->>      (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
->
->> <device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
->
->>      at ../util/qemu-option.c:1135
->
->> #14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
->
->> vl.c:2745
->
->> #15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
->
->> <error_fatal>) at ../system/vl.c:2806
->
->> #16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
->
->> at ../system/vl.c:3838
->
->> #17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
->
->> system/main.c:72
->
->
->
-> So the attached adjusted version of your patch does seem to help.  At
->
-> least I can't reproduce the crash on my stand.
->
->
-Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
->
-are
->
-definitely harmful.  Try V2 of the patch, attached, which skips the lines
->
-of init_qxl_ram that modify guest memory.
->
-Thanks, your v2 patch does seem to prevent the crash.  Would you re-send
-it to the list as a proper fix?
-
->
-> I'm wondering, could it be useful to explicitly mark all the reused
->
-> memory regions readonly upon cpr-transfer, and then make them writable
->
-> back again after the migration is done?  That way we will be segfaulting
->
-> early on instead of debugging tricky memory corruptions.
->
->
-It's a useful debugging technique, but changing protection on a large
->
-memory region
->
-can be too expensive for production due to TLB shootdowns.
->
->
-Also, there are cases where writes are performed but the value is
->
-guaranteed to
->
-be the same:
->
-  qxl_post_load()
->
-    qxl_set_mode()
->
-      d->rom->mode = cpu_to_le32(modenr);
->
-The value is the same because mode and shadow_rom.mode were passed in
->
-vmstate
->
-from old qemu.
->
-There're also cases where devices' ROM might be re-initialized.  E.g.
-this segfault occures upon further exploration of RO mapped RAM blocks:
-
->
-Program terminated with signal SIGSEGV, Segmentation fault.
->
-#0  __memmove_avx_unaligned_erms () at
->
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
->
-664             rep     movsb
->
-[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
->
-(gdb) bt
->
-#0  __memmove_avx_unaligned_erms () at
->
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
->
-#1  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380,
->
-owner=0x55aa2019ac10, name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
->
-at ../hw/core/loader.c:1032
->
-#2  0x000055aa1d031577 in rom_add_blob
->
-(name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072,
->
-max_len=2097152, addr=18446744073709551615, fw_file_name=0x55aa1da51f13
->
-"etc/acpi/tables", fw_callback=0x55aa1d441f59 <acpi_build_update>,
->
-callback_opaque=0x55aa20ff0010, as=0x0, read_only=true) at
->
-../hw/core/loader.c:1147
->
-#3  0x000055aa1cfd788d in acpi_add_rom_blob
->
-(update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010,
->
-blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at
->
-../hw/acpi/utils.c:46
->
-#4  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
->
-#5  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0)
->
-at ../hw/i386/pc.c:638
->
-#6  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10
->
-<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
->
-#7  0x000055aa1d039ee5 in qdev_machine_creation_done () at
->
-../hw/core/machine.c:1749
->
-#8  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40
->
-<error_fatal>) at ../system/vl.c:2779
->
-#9  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40
->
-<error_fatal>) at ../system/vl.c:2807
->
-#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at
->
-../system/vl.c:3838
->
-#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at
->
-../system/main.c:72
-I'm not sure whether ACPI tables ROM in particular is rewritten with the
-same content, but there might be cases where ROM can be read from file
-system upon initialization.  That is undesirable as guest kernel
-certainly won't be too happy about sudden change of the device's ROM
-content.
-
-So the issue we're dealing with here is any unwanted memory related
-device initialization upon cpr.
-
-For now the only thing that comes to my mind is to make a test where we
-put as many devices as we can into a VM, make ram blocks RO upon cpr
-(and remap them as RW later after migration is done, if needed), and
-catch any unwanted memory violations.  As Den suggested, we might
-consider adding that behaviour as a separate non-default option (or
-"migrate" command flag specific to cpr-transfer), which would only be
-used in the testing.
-
-Andrey
-
-On 3/6/25 16:16, Andrey Drobyshev wrote:
-On 3/5/25 11:19 PM, Steven Sistare wrote:
-On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
-On 3/4/25 9:05 PM, Steven Sistare wrote:
-On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:20 PM, Steven Sistare wrote:
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently
-and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-        -machine q35 \
-        -cpu host -smp 2 -m 2G \
-        -object memory-backend-file,id=ram0,size=2G,mem-path=/
-dev/shm/
-ram0,share=on\
-        -machine memory-backend=ram0 \
-        -machine aux-ram-share=on \
-        -drive file=$ROOTFS,media=disk,if=virtio \
-        -qmp unix:$QMPSOCK,server=on,wait=off \
-        -nographic \
-        -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
-        -machine q35 \
-        -cpu host -smp 2 -m 2G \
-        -object memory-backend-file,id=ram0,size=2G,mem-path=/
-dev/shm/
-ram0,share=on\
-        -machine memory-backend=ram0 \
-        -machine aux-ram-share=on \
-        -drive file=$ROOTFS,media=disk,if=virtio \
-        -qmp unix:$QMPSOCK,server=on,wait=off \
-        -nographic \
-        -device qxl-vga \
-        -incoming tcp:0:44444 \
-        -incoming '{"channel-type": "cpr", "addr": { "transport":
-"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-        migrate-set-parameters mode=cpr-transfer
-        migrate channels=[{"channel-type":"main","addr":
-{"transport":"socket","type":"inet","host":"0","port":"44444"}},
-{"channel-type":"cpr","addr":
-{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
-0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
-allocate VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-
-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which
-speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-            echo "$(date) starting round $j"
-            if [ "$(journalctl --boot | grep "failed to allocate
-VRAM
-BO")" != "" ]; then
-                    echo "bug was reproduced after $j tries"
-                    exit 1
-            fi
-            for i in $(seq 100); do
-                    dmesg > /dev/tty3
-            done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce
-that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't
-lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest
-memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into
-this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-Also try adding this patch to see if it flags any ram blocks as not
-compatible with cpr.  A message is printed at migration start time.
-   Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
-email-
-steven.sistare@oracle.com/
-
-- Steve
-With the traces enabled + the "migration: ram block cpr blockers"
-patch
-applied:
-
-Source:
-cpr_find_fd pc.bios, id 0 returns -1
-cpr_save_fd pc.bios, id 0, fd 22
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
-0x7fec18e00000
-cpr_find_fd pc.rom, id 0 returns -1
-cpr_save_fd pc.rom, id 0, fd 23
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
-0x7fec18c00000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
-cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 24 host 0x7fec18a00000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 25 host 0x7feb77e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 27 host 0x7fec18800000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 28 host 0x7feb73c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 34 host 0x7fec18600000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 35 host 0x7fec18200000
-cpr_find_fd /rom@etc/table-loader, id 0 returns -1
-cpr_save_fd /rom@etc/table-loader, id 0, fd 36
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 36 host 0x7feb8b600000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-37 host 0x7feb8b400000
-
-cpr_state_save cpr-transfer mode
-cpr_transfer_output /var/run/alma8cpr-dst.sock
-Target:
-cpr_transfer_input /var/run/alma8cpr-dst.sock
-cpr_state_load cpr-transfer mode
-cpr_find_fd pc.bios, id 0 returns 20
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
-0x7fcdc9800000
-cpr_find_fd pc.rom, id 0 returns 19
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
-0x7fcdc9600000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 18 host 0x7fcdc9400000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 17 host 0x7fcd27e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 16 host 0x7fcdc9200000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 15 host 0x7fcd23c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 14 host 0x7fcdc8800000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 13 host 0x7fcdc8400000
-cpr_find_fd /rom@etc/table-loader, id 0 returns 11
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 11 host 0x7fcdc8200000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-10 host 0x7fcd3be00000
-Looks like both vga.vram and qxl.vram are being preserved (with the
-same
-addresses), and no incompatible ram blocks are found during migration.
-Sorry, addressed are not the same, of course.  However corresponding
-ram
-blocks do seem to be preserved and initialized.
-So far, I have not reproduced the guest driver failure.
-
-However, I have isolated places where new QEMU improperly writes to
-the qxl memory regions prior to starting the guest, by mmap'ing them
-readonly after cpr:
-
-    qemu_ram_alloc_internal()
-      if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
-          ram_flags |= RAM_READONLY;
-      new_block = qemu_ram_alloc_from_fd(...)
-
-I have attached a draft fix; try it and let me know.
-My console window looks fine before and after cpr, using
--vnc $hostip:0 -vga qxl
-
-- Steve
-Regarding the reproduce: when I launch the buggy version with the same
-options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
-my VNC client silently hangs on the target after a while.  Could it
-happen on your stand as well?
-cpr does not preserve the vnc connection and session.  To test, I specify
-port 0 for the source VM and port 1 for the dest.  When the src vnc goes
-dormant the dest vnc becomes active.
-Sure, I meant that VNC on the dest (on the port 1) works for a while
-after the migration and then hangs, apparently after the guest QXL crash.
-Could you try launching VM with
-"-nographic -device qxl-vga"?  That way VM's serial console is given you
-directly in the shell, so when qxl driver crashes you're still able to
-inspect the kernel messages.
-I have been running like that, but have not reproduced the qxl driver
-crash,
-and I suspect my guest image+kernel is too old.
-Yes, that's probably the case.  But the crash occurs on my Fedora 41
-guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
-be buggy.
-However, once I realized the
-issue was post-cpr modification of qxl memory, I switched my attention
-to the
-fix.
-As for your patch, I can report that it doesn't resolve the issue as it
-is.  But I was able to track down another possible memory corruption
-using your approach with readonly mmap'ing:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
-[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
-(gdb) bt
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
-errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
-#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
-errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
-#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
-errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
-#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
-value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
-#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
-v=0x5638996f3770, name=0x56389759b141 "realized",
-opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:2374
-#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
-name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:1449
-#7  0x00005638970f8586 in object_property_set_qobject
-(obj=0x5638996e0e70, name=0x56389759b141 "realized",
-value=0x5638996df900, errp=0x7ffd3c2b84e0)
-      at ../qom/qom-qobject.c:28
-#8  0x00005638970f3d8d in object_property_set_bool
-(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
-errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:1519
-#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
-bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
-#10 0x0000563896dba675 in qdev_device_add_from_qdict
-(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
-system/qdev-monitor.c:714
-#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
-errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
-#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
-opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
-vl.c:1207
-#13 0x000056389737a6cc in qemu_opts_foreach
-      (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
-<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
-      at ../util/qemu-option.c:1135
-#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
-vl.c:2745
-#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
-<error_fatal>) at ../system/vl.c:2806
-#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
-at ../system/vl.c:3838
-#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
-system/main.c:72
-So the attached adjusted version of your patch does seem to help.  At
-least I can't reproduce the crash on my stand.
-Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
-are
-definitely harmful.  Try V2 of the patch, attached, which skips the lines
-of init_qxl_ram that modify guest memory.
-Thanks, your v2 patch does seem to prevent the crash.  Would you re-send
-it to the list as a proper fix?
-I'm wondering, could it be useful to explicitly mark all the reused
-memory regions readonly upon cpr-transfer, and then make them writable
-back again after the migration is done?  That way we will be segfaulting
-early on instead of debugging tricky memory corruptions.
-It's a useful debugging technique, but changing protection on a large
-memory region
-can be too expensive for production due to TLB shootdowns.
-
-Also, there are cases where writes are performed but the value is
-guaranteed to
-be the same:
-   qxl_post_load()
-     qxl_set_mode()
-       d->rom->mode = cpu_to_le32(modenr);
-The value is the same because mode and shadow_rom.mode were passed in
-vmstate
-from old qemu.
-There're also cases where devices' ROM might be re-initialized.  E.g.
-this segfault occures upon further exploration of RO mapped RAM blocks:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  __memmove_avx_unaligned_erms () at 
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
-664             rep     movsb
-[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
-(gdb) bt
-#0  __memmove_avx_unaligned_erms () at 
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
-#1  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380, owner=0x55aa2019ac10, 
-name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
-     at ../hw/core/loader.c:1032
-#2  0x000055aa1d031577 in rom_add_blob
-     (name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072, max_len=2097152, 
-addr=18446744073709551615, fw_file_name=0x55aa1da51f13 "etc/acpi/tables", 
-fw_callback=0x55aa1d441f59 <acpi_build_update>, callback_opaque=0x55aa20ff0010, as=0x0, 
-read_only=true) at ../hw/core/loader.c:1147
-#3  0x000055aa1cfd788d in acpi_add_rom_blob
-     (update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010, 
-blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at ../hw/acpi/utils.c:46
-#4  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
-#5  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0) 
-at ../hw/i386/pc.c:638
-#6  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10 
-<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
-#7  0x000055aa1d039ee5 in qdev_machine_creation_done () at 
-../hw/core/machine.c:1749
-#8  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40 
-<error_fatal>) at ../system/vl.c:2779
-#9  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40 
-<error_fatal>) at ../system/vl.c:2807
-#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at 
-../system/vl.c:3838
-#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at 
-../system/main.c:72
-I'm not sure whether ACPI tables ROM in particular is rewritten with the
-same content, but there might be cases where ROM can be read from file
-system upon initialization.  That is undesirable as guest kernel
-certainly won't be too happy about sudden change of the device's ROM
-content.
-
-So the issue we're dealing with here is any unwanted memory related
-device initialization upon cpr.
-
-For now the only thing that comes to my mind is to make a test where we
-put as many devices as we can into a VM, make ram blocks RO upon cpr
-(and remap them as RW later after migration is done, if needed), and
-catch any unwanted memory violations.  As Den suggested, we might
-consider adding that behaviour as a separate non-default option (or
-"migrate" command flag specific to cpr-transfer), which would only be
-used in the testing.
-
-Andrey
-No way. ACPI with the source must be used in the same way as BIOSes
-and optional ROMs.
-
-Den
-
-On 3/6/2025 10:52 AM, Denis V. Lunev wrote:
-On 3/6/25 16:16, Andrey Drobyshev wrote:
-On 3/5/25 11:19 PM, Steven Sistare wrote:
-On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
-On 3/4/25 9:05 PM, Steven Sistare wrote:
-On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:20 PM, Steven Sistare wrote:
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently
-and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-        -machine q35 \
-        -cpu host -smp 2 -m 2G \
-        -object memory-backend-file,id=ram0,size=2G,mem-path=/
-dev/shm/
-ram0,share=on\
-        -machine memory-backend=ram0 \
-        -machine aux-ram-share=on \
-        -drive file=$ROOTFS,media=disk,if=virtio \
-        -qmp unix:$QMPSOCK,server=on,wait=off \
-        -nographic \
-        -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
-        -machine q35 \
-        -cpu host -smp 2 -m 2G \
-        -object memory-backend-file,id=ram0,size=2G,mem-path=/
-dev/shm/
-ram0,share=on\
-        -machine memory-backend=ram0 \
-        -machine aux-ram-share=on \
-        -drive file=$ROOTFS,media=disk,if=virtio \
-        -qmp unix:$QMPSOCK,server=on,wait=off \
-        -nographic \
-        -device qxl-vga \
-        -incoming tcp:0:44444 \
-        -incoming '{"channel-type": "cpr", "addr": { "transport":
-"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-        migrate-set-parameters mode=cpr-transfer
-        migrate channels=[{"channel-type":"main","addr":
-{"transport":"socket","type":"inet","host":"0","port":"44444"}},
-{"channel-type":"cpr","addr":
-{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
-0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
-allocate VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-
-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which
-speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-            echo "$(date) starting round $j"
-            if [ "$(journalctl --boot | grep "failed to allocate
-VRAM
-BO")" != "" ]; then
-                    echo "bug was reproduced after $j tries"
-                    exit 1
-            fi
-            for i in $(seq 100); do
-                    dmesg > /dev/tty3
-            done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce
-that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't
-lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest
-memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into
-this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-Also try adding this patch to see if it flags any ram blocks as not
-compatible with cpr.  A message is printed at migration start time.
-   Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
-email-
-steven.sistare@oracle.com/
-
-- Steve
-With the traces enabled + the "migration: ram block cpr blockers"
-patch
-applied:
-
-Source:
-cpr_find_fd pc.bios, id 0 returns -1
-cpr_save_fd pc.bios, id 0, fd 22
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
-0x7fec18e00000
-cpr_find_fd pc.rom, id 0 returns -1
-cpr_save_fd pc.rom, id 0, fd 23
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
-0x7fec18c00000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
-cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 24 host 0x7fec18a00000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 25 host 0x7feb77e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 27 host 0x7fec18800000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 28 host 0x7feb73c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 34 host 0x7fec18600000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 35 host 0x7fec18200000
-cpr_find_fd /rom@etc/table-loader, id 0 returns -1
-cpr_save_fd /rom@etc/table-loader, id 0, fd 36
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 36 host 0x7feb8b600000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-37 host 0x7feb8b400000
-
-cpr_state_save cpr-transfer mode
-cpr_transfer_output /var/run/alma8cpr-dst.sock
-Target:
-cpr_transfer_input /var/run/alma8cpr-dst.sock
-cpr_state_load cpr-transfer mode
-cpr_find_fd pc.bios, id 0 returns 20
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
-0x7fcdc9800000
-cpr_find_fd pc.rom, id 0 returns 19
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
-0x7fcdc9600000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 18 host 0x7fcdc9400000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 17 host 0x7fcd27e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 16 host 0x7fcdc9200000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 15 host 0x7fcd23c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 14 host 0x7fcdc8800000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 13 host 0x7fcdc8400000
-cpr_find_fd /rom@etc/table-loader, id 0 returns 11
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 11 host 0x7fcdc8200000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-10 host 0x7fcd3be00000
-Looks like both vga.vram and qxl.vram are being preserved (with the
-same
-addresses), and no incompatible ram blocks are found during migration.
-Sorry, addressed are not the same, of course.  However corresponding
-ram
-blocks do seem to be preserved and initialized.
-So far, I have not reproduced the guest driver failure.
-
-However, I have isolated places where new QEMU improperly writes to
-the qxl memory regions prior to starting the guest, by mmap'ing them
-readonly after cpr:
-
-    qemu_ram_alloc_internal()
-      if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
-          ram_flags |= RAM_READONLY;
-      new_block = qemu_ram_alloc_from_fd(...)
-
-I have attached a draft fix; try it and let me know.
-My console window looks fine before and after cpr, using
--vnc $hostip:0 -vga qxl
-
-- Steve
-Regarding the reproduce: when I launch the buggy version with the same
-options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
-my VNC client silently hangs on the target after a while.  Could it
-happen on your stand as well?
-cpr does not preserve the vnc connection and session.  To test, I specify
-port 0 for the source VM and port 1 for the dest.  When the src vnc goes
-dormant the dest vnc becomes active.
-Sure, I meant that VNC on the dest (on the port 1) works for a while
-after the migration and then hangs, apparently after the guest QXL crash.
-Could you try launching VM with
-"-nographic -device qxl-vga"?  That way VM's serial console is given you
-directly in the shell, so when qxl driver crashes you're still able to
-inspect the kernel messages.
-I have been running like that, but have not reproduced the qxl driver
-crash,
-and I suspect my guest image+kernel is too old.
-Yes, that's probably the case.  But the crash occurs on my Fedora 41
-guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
-be buggy.
-However, once I realized the
-issue was post-cpr modification of qxl memory, I switched my attention
-to the
-fix.
-As for your patch, I can report that it doesn't resolve the issue as it
-is.  But I was able to track down another possible memory corruption
-using your approach with readonly mmap'ing:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
-[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
-(gdb) bt
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
-errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
-#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
-errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
-#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
-errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
-#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
-value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
-#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
-v=0x5638996f3770, name=0x56389759b141 "realized",
-opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:2374
-#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
-name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:1449
-#7  0x00005638970f8586 in object_property_set_qobject
-(obj=0x5638996e0e70, name=0x56389759b141 "realized",
-value=0x5638996df900, errp=0x7ffd3c2b84e0)
-      at ../qom/qom-qobject.c:28
-#8  0x00005638970f3d8d in object_property_set_bool
-(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
-errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:1519
-#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
-bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
-#10 0x0000563896dba675 in qdev_device_add_from_qdict
-(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
-system/qdev-monitor.c:714
-#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
-errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
-#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
-opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
-vl.c:1207
-#13 0x000056389737a6cc in qemu_opts_foreach
-      (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
-<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
-      at ../util/qemu-option.c:1135
-#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
-vl.c:2745
-#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
-<error_fatal>) at ../system/vl.c:2806
-#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
-at ../system/vl.c:3838
-#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
-system/main.c:72
-So the attached adjusted version of your patch does seem to help.  At
-least I can't reproduce the crash on my stand.
-Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
-are
-definitely harmful.  Try V2 of the patch, attached, which skips the lines
-of init_qxl_ram that modify guest memory.
-Thanks, your v2 patch does seem to prevent the crash.  Would you re-send
-it to the list as a proper fix?
-Yes.  Was waiting for your confirmation.
-I'm wondering, could it be useful to explicitly mark all the reused
-memory regions readonly upon cpr-transfer, and then make them writable
-back again after the migration is done?  That way we will be segfaulting
-early on instead of debugging tricky memory corruptions.
-It's a useful debugging technique, but changing protection on a large
-memory region
-can be too expensive for production due to TLB shootdowns.
-
-Also, there are cases where writes are performed but the value is
-guaranteed to
-be the same:
-   qxl_post_load()
-     qxl_set_mode()
-       d->rom->mode = cpu_to_le32(modenr);
-The value is the same because mode and shadow_rom.mode were passed in
-vmstate
-from old qemu.
-There're also cases where devices' ROM might be re-initialized.  E.g.
-this segfault occures upon further exploration of RO mapped RAM blocks:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  __memmove_avx_unaligned_erms () at 
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
-664             rep     movsb
-[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
-(gdb) bt
-#0  __memmove_avx_unaligned_erms () at 
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
-#1  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380, owner=0x55aa2019ac10, 
-name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
-     at ../hw/core/loader.c:1032
-#2  0x000055aa1d031577 in rom_add_blob
-     (name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072, max_len=2097152, 
-addr=18446744073709551615, fw_file_name=0x55aa1da51f13 "etc/acpi/tables", 
-fw_callback=0x55aa1d441f59 <acpi_build_update>, callback_opaque=0x55aa20ff0010, as=0x0, 
-read_only=true) at ../hw/core/loader.c:1147
-#3  0x000055aa1cfd788d in acpi_add_rom_blob
-     (update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010, 
-blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at ../hw/acpi/utils.c:46
-#4  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
-#5  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0) 
-at ../hw/i386/pc.c:638
-#6  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10 
-<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
-#7  0x000055aa1d039ee5 in qdev_machine_creation_done () at 
-../hw/core/machine.c:1749
-#8  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40 
-<error_fatal>) at ../system/vl.c:2779
-#9  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40 
-<error_fatal>) at ../system/vl.c:2807
-#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at 
-../system/vl.c:3838
-#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at 
-../system/main.c:72
-I'm not sure whether ACPI tables ROM in particular is rewritten with the
-same content, but there might be cases where ROM can be read from file
-system upon initialization.  That is undesirable as guest kernel
-certainly won't be too happy about sudden change of the device's ROM
-content.
-
-So the issue we're dealing with here is any unwanted memory related
-device initialization upon cpr.
-
-For now the only thing that comes to my mind is to make a test where we
-put as many devices as we can into a VM, make ram blocks RO upon cpr
-(and remap them as RW later after migration is done, if needed), and
-catch any unwanted memory violations.  As Den suggested, we might
-consider adding that behaviour as a separate non-default option (or
-"migrate" command flag specific to cpr-transfer), which would only be
-used in the testing.
-I'll look into adding an option, but there may be too many false positives,
-such as the qxl_set_mode case above.  And the maintainers may object to me
-eliminating the false positives by adding more CPR_IN tests, due to gratuitous
-(from their POV) ugliness.
-
-But I will use the technique to look for more write violations.
-Andrey
-No way. ACPI with the source must be used in the same way as BIOSes
-and optional ROMs.
-Yup, its a bug.  Will fix.
-
-- Steve
-
-see
-1741380954-341079-1-git-send-email-steven.sistare@oracle.com
-/">https://lore.kernel.org/qemu-devel/
-1741380954-341079-1-git-send-email-steven.sistare@oracle.com
-/
-- Steve
-
-On 3/6/2025 11:13 AM, Steven Sistare wrote:
-On 3/6/2025 10:52 AM, Denis V. Lunev wrote:
-On 3/6/25 16:16, Andrey Drobyshev wrote:
-On 3/5/25 11:19 PM, Steven Sistare wrote:
-On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
-On 3/4/25 9:05 PM, Steven Sistare wrote:
-On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
-On 2/28/25 8:20 PM, Steven Sistare wrote:
-On 2/28/2025 1:13 PM, Steven Sistare wrote:
-On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
-Hi all,
-
-We've been experimenting with cpr-transfer migration mode recently
-and
-have discovered the following issue with the guest QXL driver:
-
-Run migration source:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$EMULATOR -enable-kvm \
-        -machine q35 \
-        -cpu host -smp 2 -m 2G \
-        -object memory-backend-file,id=ram0,size=2G,mem-path=/
-dev/shm/
-ram0,share=on\
-        -machine memory-backend=ram0 \
-        -machine aux-ram-share=on \
-        -drive file=$ROOTFS,media=disk,if=virtio \
-        -qmp unix:$QMPSOCK,server=on,wait=off \
-        -nographic \
-        -device qxl-vga
-Run migration target:
-EMULATOR=/path/to/emulator
-ROOTFS=/path/to/image
-QMPSOCK=/var/run/alma8qmp-dst.sock
-$EMULATOR -enable-kvm \
-        -machine q35 \
-        -cpu host -smp 2 -m 2G \
-        -object memory-backend-file,id=ram0,size=2G,mem-path=/
-dev/shm/
-ram0,share=on\
-        -machine memory-backend=ram0 \
-        -machine aux-ram-share=on \
-        -drive file=$ROOTFS,media=disk,if=virtio \
-        -qmp unix:$QMPSOCK,server=on,wait=off \
-        -nographic \
-        -device qxl-vga \
-        -incoming tcp:0:44444 \
-        -incoming '{"channel-type": "cpr", "addr": { "transport":
-"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
-Launch the migration:
-QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
-QMPSOCK=/var/run/alma8qmp-src.sock
-
-$QMPSHELL -p $QMPSOCK <<EOF
-        migrate-set-parameters mode=cpr-transfer
-        migrate channels=[{"channel-type":"main","addr":
-{"transport":"socket","type":"inet","host":"0","port":"44444"}},
-{"channel-type":"cpr","addr":
-{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
-dst.sock"}}]
-EOF
-Then, after a while, QXL guest driver on target crashes spewing the
-following messages:
-[   73.962002] [TTM] Buffer eviction failed
-[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
-0x00000001)
-[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
-allocate VRAM BO
-That seems to be a known kernel QXL driver bug:
-https://lore.kernel.org/all/20220907094423.93581-1-
-min_halo@163.com/T/
-https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
-(the latter discussion contains that reproduce script which
-speeds up
-the crash in the guest):
-#!/bin/bash
-
-chvt 3
-
-for j in $(seq 80); do
-            echo "$(date) starting round $j"
-            if [ "$(journalctl --boot | grep "failed to allocate
-VRAM
-BO")" != "" ]; then
-                    echo "bug was reproduced after $j tries"
-                    exit 1
-            fi
-            for i in $(seq 100); do
-                    dmesg > /dev/tty3
-            done
-done
-
-echo "bug could not be reproduced"
-exit 0
-The bug itself seems to remain unfixed, as I was able to reproduce
-that
-with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
-cpr-transfer code also seems to be buggy as it triggers the crash -
-without the cpr-transfer migration the above reproduce doesn't
-lead to
-crash on the source VM.
-
-I suspect that, as cpr-transfer doesn't migrate the guest
-memory, but
-rather passes it through the memory backend object, our code might
-somehow corrupt the VRAM.  However, I wasn't able to trace the
-corruption so far.
-
-Could somebody help the investigation and take a look into
-this?  Any
-suggestions would be appreciated.  Thanks!
-Possibly some memory region created by qxl is not being preserved.
-Try adding these traces to see what is preserved:
-
--trace enable='*cpr*'
--trace enable='*ram_alloc*'
-Also try adding this patch to see if it flags any ram blocks as not
-compatible with cpr.  A message is printed at migration start time.
-   Â
-https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
-email-
-steven.sistare@oracle.com/
-
-- Steve
-With the traces enabled + the "migration: ram block cpr blockers"
-patch
-applied:
-
-Source:
-cpr_find_fd pc.bios, id 0 returns -1
-cpr_save_fd pc.bios, id 0, fd 22
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
-0x7fec18e00000
-cpr_find_fd pc.rom, id 0 returns -1
-cpr_save_fd pc.rom, id 0, fd 23
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
-0x7fec18c00000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
-cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 24 host 0x7fec18a00000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 25 host 0x7feb77e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 27 host 0x7fec18800000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 28 host 0x7feb73c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
-cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 34 host 0x7fec18600000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 35 host 0x7fec18200000
-cpr_find_fd /rom@etc/table-loader, id 0 returns -1
-cpr_save_fd /rom@etc/table-loader, id 0, fd 36
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 36 host 0x7feb8b600000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
-cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-37 host 0x7feb8b400000
-
-cpr_state_save cpr-transfer mode
-cpr_transfer_output /var/run/alma8cpr-dst.sock
-Target:
-cpr_transfer_input /var/run/alma8cpr-dst.sock
-cpr_state_load cpr-transfer mode
-cpr_find_fd pc.bios, id 0 returns 20
-qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
-0x7fcdc9800000
-cpr_find_fd pc.rom, id 0 returns 19
-qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
-0x7fcdc9600000
-cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
-qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
-262144 fd 18 host 0x7fcdc9400000
-cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
-qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
-67108864 fd 17 host 0x7fcd27e00000
-cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
-fd 16 host 0x7fcdc9200000
-cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
-qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
-67108864 fd 15 host 0x7fcd23c00000
-cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
-qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
-fd 14 host 0x7fcdc8800000
-cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
-qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
-2097152 fd 13 host 0x7fcdc8400000
-cpr_find_fd /rom@etc/table-loader, id 0 returns 11
-qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
-fd 11 host 0x7fcdc8200000
-cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
-qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
-10 host 0x7fcd3be00000
-Looks like both vga.vram and qxl.vram are being preserved (with the
-same
-addresses), and no incompatible ram blocks are found during migration.
-Sorry, addressed are not the same, of course.  However corresponding
-ram
-blocks do seem to be preserved and initialized.
-So far, I have not reproduced the guest driver failure.
-
-However, I have isolated places where new QEMU improperly writes to
-the qxl memory regions prior to starting the guest, by mmap'ing them
-readonly after cpr:
-
-    qemu_ram_alloc_internal()
-      if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
-          ram_flags |= RAM_READONLY;
-      new_block = qemu_ram_alloc_from_fd(...)
-
-I have attached a draft fix; try it and let me know.
-My console window looks fine before and after cpr, using
--vnc $hostip:0 -vga qxl
-
-- Steve
-Regarding the reproduce: when I launch the buggy version with the same
-options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
-my VNC client silently hangs on the target after a while.  Could it
-happen on your stand as well?
-cpr does not preserve the vnc connection and session.  To test, I specify
-port 0 for the source VM and port 1 for the dest.  When the src vnc goes
-dormant the dest vnc becomes active.
-Sure, I meant that VNC on the dest (on the port 1) works for a while
-after the migration and then hangs, apparently after the guest QXL crash.
-Could you try launching VM with
-"-nographic -device qxl-vga"?  That way VM's serial console is given you
-directly in the shell, so when qxl driver crashes you're still able to
-inspect the kernel messages.
-I have been running like that, but have not reproduced the qxl driver
-crash,
-and I suspect my guest image+kernel is too old.
-Yes, that's probably the case.  But the crash occurs on my Fedora 41
-guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
-be buggy.
-However, once I realized the
-issue was post-cpr modification of qxl memory, I switched my attention
-to the
-fix.
-As for your patch, I can report that it doesn't resolve the issue as it
-is.  But I was able to track down another possible memory corruption
-using your approach with readonly mmap'ing:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
-[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
-(gdb) bt
-#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
-#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
-errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
-#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
-errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
-#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
-errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
-#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
-value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
-#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
-v=0x5638996f3770, name=0x56389759b141 "realized",
-opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:2374
-#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
-name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:1449
-#7  0x00005638970f8586 in object_property_set_qobject
-(obj=0x5638996e0e70, name=0x56389759b141 "realized",
-value=0x5638996df900, errp=0x7ffd3c2b84e0)
-      at ../qom/qom-qobject.c:28
-#8  0x00005638970f3d8d in object_property_set_bool
-(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
-errp=0x7ffd3c2b84e0)
-      at ../qom/object.c:1519
-#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
-bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
-#10 0x0000563896dba675 in qdev_device_add_from_qdict
-(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
-system/qdev-monitor.c:714
-#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
-errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
-#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
-opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
-vl.c:1207
-#13 0x000056389737a6cc in qemu_opts_foreach
-      (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
-<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
-      at ../util/qemu-option.c:1135
-#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
-vl.c:2745
-#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
-<error_fatal>) at ../system/vl.c:2806
-#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
-at ../system/vl.c:3838
-#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
-system/main.c:72
-So the attached adjusted version of your patch does seem to help.  At
-least I can't reproduce the crash on my stand.
-Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
-are
-definitely harmful.  Try V2 of the patch, attached, which skips the lines
-of init_qxl_ram that modify guest memory.
-Thanks, your v2 patch does seem to prevent the crash.  Would you re-send
-it to the list as a proper fix?
-Yes.  Was waiting for your confirmation.
-I'm wondering, could it be useful to explicitly mark all the reused
-memory regions readonly upon cpr-transfer, and then make them writable
-back again after the migration is done?  That way we will be segfaulting
-early on instead of debugging tricky memory corruptions.
-It's a useful debugging technique, but changing protection on a large
-memory region
-can be too expensive for production due to TLB shootdowns.
-
-Also, there are cases where writes are performed but the value is
-guaranteed to
-be the same:
-   qxl_post_load()
-     qxl_set_mode()
-       d->rom->mode = cpu_to_le32(modenr);
-The value is the same because mode and shadow_rom.mode were passed in
-vmstate
-from old qemu.
-There're also cases where devices' ROM might be re-initialized.  E.g.
-this segfault occures upon further exploration of RO mapped RAM blocks:
-Program terminated with signal SIGSEGV, Segmentation fault.
-#0  __memmove_avx_unaligned_erms () at 
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
-664             rep     movsb
-[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
-(gdb) bt
-#0  __memmove_avx_unaligned_erms () at 
-../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
-#1  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380, owner=0x55aa2019ac10, 
-name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
-     at ../hw/core/loader.c:1032
-#2  0x000055aa1d031577 in rom_add_blob
-     (name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072, max_len=2097152, 
-addr=18446744073709551615, fw_file_name=0x55aa1da51f13 "etc/acpi/tables", 
-fw_callback=0x55aa1d441f59 <acpi_build_update>, callback_opaque=0x55aa20ff0010, as=0x0, 
-read_only=true) at ../hw/core/loader.c:1147
-#3  0x000055aa1cfd788d in acpi_add_rom_blob
-     (update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010, 
-blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at ../hw/acpi/utils.c:46
-#4  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
-#5  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0) 
-at ../hw/i386/pc.c:638
-#6  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10 
-<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
-#7  0x000055aa1d039ee5 in qdev_machine_creation_done () at 
-../hw/core/machine.c:1749
-#8  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40 
-<error_fatal>) at ../system/vl.c:2779
-#9  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40 
-<error_fatal>) at ../system/vl.c:2807
-#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at 
-../system/vl.c:3838
-#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at 
-../system/main.c:72
-I'm not sure whether ACPI tables ROM in particular is rewritten with the
-same content, but there might be cases where ROM can be read from file
-system upon initialization.  That is undesirable as guest kernel
-certainly won't be too happy about sudden change of the device's ROM
-content.
-
-So the issue we're dealing with here is any unwanted memory related
-device initialization upon cpr.
-
-For now the only thing that comes to my mind is to make a test where we
-put as many devices as we can into a VM, make ram blocks RO upon cpr
-(and remap them as RW later after migration is done, if needed), and
-catch any unwanted memory violations.  As Den suggested, we might
-consider adding that behaviour as a separate non-default option (or
-"migrate" command flag specific to cpr-transfer), which would only be
-used in the testing.
-I'll look into adding an option, but there may be too many false positives,
-such as the qxl_set_mode case above.  And the maintainers may object to me
-eliminating the false positives by adding more CPR_IN tests, due to gratuitous
-(from their POV) ugliness.
-
-But I will use the technique to look for more write violations.
-Andrey
-No way. ACPI with the source must be used in the same way as BIOSes
-and optional ROMs.
-Yup, its a bug.  Will fix.
-
-- Steve
-
diff --git a/classification_output/04/mistranslation/64322995 b/classification_output/04/mistranslation/64322995
deleted file mode 100644
index 7330769b7..000000000
--- a/classification_output/04/mistranslation/64322995
+++ /dev/null
@@ -1,62 +0,0 @@
-mistranslation: 0.936
-device: 0.915
-network: 0.914
-semantic: 0.906
-graphic: 0.904
-other: 0.881
-socket: 0.866
-instruction: 0.864
-vnc: 0.801
-boot: 0.780
-KVM: 0.742
-assembly: 0.653
-
-[Qemu-devel] [BUG] trace: QEMU hangs on initialization with the	"simple" backend
-
-While starting the softmmu version of QEMU, the simple backend waits for the
-writeout thread to signal a condition variable when initializing the output file
-path. But since the writeout thread has not been created, it just waits forever.
-
-Thanks,
-  Lluis
-
-On Tue, Feb 09, 2016 at 09:24:04PM +0100, Lluís Vilanova wrote:
->
-While starting the softmmu version of QEMU, the simple backend waits for the
->
-writeout thread to signal a condition variable when initializing the output
->
-file
->
-path. But since the writeout thread has not been created, it just waits
->
-forever.
-Denis Lunev posted a fix:
-https://patchwork.ozlabs.org/patch/580968/
-Stefan
-signature.asc
-Description:
-PGP signature
-
-Stefan Hajnoczi writes:
-
->
-On Tue, Feb 09, 2016 at 09:24:04PM +0100, Lluís Vilanova wrote:
->
-> While starting the softmmu version of QEMU, the simple backend waits for the
->
-> writeout thread to signal a condition variable when initializing the output
->
-> file
->
-> path. But since the writeout thread has not been created, it just waits
->
-> forever.
->
-Denis Lunev posted a fix:
->
-https://patchwork.ozlabs.org/patch/580968/
-Great, thanks.
-
-Lluis
-
diff --git a/classification_output/04/mistranslation/70294255 b/classification_output/04/mistranslation/70294255
deleted file mode 100644
index 2f154bf28..000000000
--- a/classification_output/04/mistranslation/70294255
+++ /dev/null
@@ -1,1069 +0,0 @@
-mistranslation: 0.862
-assembly: 0.861
-semantic: 0.858
-socket: 0.858
-device: 0.857
-graphic: 0.857
-instruction: 0.856
-other: 0.852
-network: 0.846
-vnc: 0.837
-boot: 0.811
-KVM: 0.806
-
-[Qemu-devel] 答复: Re:   答复: Re:  答复: Re: 答复: Re: [BUG]COLO failover hang
-
-hi:
-
-yes.it is better.
-
-And should we delete 
-
-
-
-
-#ifdef WIN32
-
-    QIO_CHANNEL(cioc)->event = CreateEvent(NULL, FALSE, FALSE, NULL)
-
-#endif
-
-
-
-
-in qio_channel_socket_accept?
-
-qio_channel_socket_new already have it.
-
-
-
-
-
-
-
-
-
-
-
-
-原始邮件
-
-
-
-发件人: address@hidden
-收件人:王广10165992
-抄送人: address@hidden address@hidden address@hidden address@hidden
-日 期 :2017年03月22日 15:03
-主 题 :Re: [Qemu-devel]  答复: Re:  答复: Re: 答复: Re: [BUG]COLO failover hang
-
-
-
-
-
-Hi,
-
-On 2017/3/22 9:42, address@hidden wrote:
-> diff --git a/migration/socket.c b/migration/socket.c
->
->
-> index 13966f1..d65a0ea 100644
->
->
-> --- a/migration/socket.c
->
->
-> +++ b/migration/socket.c
->
->
-> @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
->
->
->       }
->
->
->
->
->
->       trace_migration_socket_incoming_accepted()
->
->
->
->
->
->       qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming")
->
->
-> +    qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN)
->
->
->       migration_channel_process_incoming(migrate_get_current(),
->
->
->                                          QIO_CHANNEL(sioc))
->
->
->       object_unref(OBJECT(sioc))
->
->
->
->
-> Is this patch ok?
->
-
-Yes, i think this works, but a better way maybe to call 
-qio_channel_set_feature()
-in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the 
-socket accept fd,
-Or fix it by this:
-
-diff --git a/io/channel-socket.c b/io/channel-socket.c
-index f546c68..ce6894c 100644
---- a/io/channel-socket.c
-+++ b/io/channel-socket.c
-@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
-                            Error **errp)
-  {
-      QIOChannelSocket *cioc
--
--    cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET))
--    cioc->fd = -1
-+
-+    cioc = qio_channel_socket_new()
-      cioc->remoteAddrLen = sizeof(ioc->remoteAddr)
-      cioc->localAddrLen = sizeof(ioc->localAddr)
-
-
-Thanks,
-Hailiang
-
-> I have test it . The test could not hang any more.
->
->
->
->
->
->
->
->
->
->
->
->
-> 原始邮件
->
->
->
-> 发件人: address@hidden
-> 收件人: address@hidden address@hidden
-> 抄送人: address@hidden address@hidden address@hidden
-> 日 期 :2017年03月22日 09:11
-> 主 题 :Re: [Qemu-devel]  答复: Re:  答复: Re: [BUG]COLO failover hang
->
->
->
->
->
-> On 2017/3/21 19:56, Dr. David Alan Gilbert wrote:
-> > * Hailiang Zhang (address@hidden) wrote:
-> >> Hi,
-> >>
-> >> Thanks for reporting this, and i confirmed it in my test, and it is a bug.
-> >>
-> >> Though we tried to call qemu_file_shutdown() to shutdown the related fd, in
-> >> case COLO thread/incoming thread is stuck in read/write() while do 
-failover,
-> >> but it didn't take effect, because all the fd used by COLO (also migration)
-> >> has been wrapped by qio channel, and it will not call the shutdown API if
-> >> we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN).
-> >>
-> >> Cc: Dr. David Alan Gilbert address@hidden
-> >>
-> >> I doubted migration cancel has the same problem, it may be stuck in write()
-> >> if we tried to cancel migration.
-> >>
-> >> void fd_start_outgoing_migration(MigrationState *s, const char *fdname, 
-Error **errp)
-> >> {
-> >>      qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing")
-> >>      migration_channel_connect(s, ioc, NULL)
-> >>      ... ...
-> >> We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN) above,
-> >> and the
-> >> migrate_fd_cancel()
-> >> {
-> >>   ... ...
-> >>      if (s->state == MIGRATION_STATUS_CANCELLING && f) {
-> >>          qemu_file_shutdown(f)  --> This will not take effect. No ?
-> >>      }
-> >> }
-> >
-> > (cc'd in Daniel Berrange).
-> > I see that we call qio_channel_set_feature(ioc, 
-QIO_CHANNEL_FEATURE_SHUTDOWN) at the
-> > top of qio_channel_socket_new  so I think that's safe isn't it?
-> >
->
-> Hmm, you are right, this problem is only exist for the migration incoming fd, 
-thanks.
->
-> > Dave
-> >
-> >> Thanks,
-> >> Hailiang
-> >>
-> >> On 2017/3/21 16:10, address@hidden wrote:
-> >>> Thank you。
-> >>>
-> >>> I have test aready。
-> >>>
-> >>> When the Primary Node panic,the Secondary Node qemu hang at the same 
-place。
-> >>>
-> >>> Incorrding
-http://wiki.qemu-project.org/Features/COLO
-,kill Primary Node 
-qemu will not produce the problem,but Primary Node panic can。
-> >>>
-> >>> I think due to the feature of channel does not support 
-QIO_CHANNEL_FEATURE_SHUTDOWN.
-> >>>
-> >>>
-> >>> when failover,channel_shutdown could not shut down the channel.
-> >>>
-> >>>
-> >>> so the colo_process_incoming_thread will hang at recvmsg.
-> >>>
-> >>>
-> >>> I test a patch:
-> >>>
-> >>>
-> >>> diff --git a/migration/socket.c b/migration/socket.c
-> >>>
-> >>>
-> >>> index 13966f1..d65a0ea 100644
-> >>>
-> >>>
-> >>> --- a/migration/socket.c
-> >>>
-> >>>
-> >>> +++ b/migration/socket.c
-> >>>
-> >>>
-> >>> @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
-> >>>
-> >>>
-> >>>        }
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>        trace_migration_socket_incoming_accepted()
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>        qio_channel_set_name(QIO_CHANNEL(sioc), 
-"migration-socket-incoming")
-> >>>
-> >>>
-> >>> +    qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN)
-> >>>
-> >>>
-> >>>        migration_channel_process_incoming(migrate_get_current(),
-> >>>
-> >>>
-> >>>                                           QIO_CHANNEL(sioc))
-> >>>
-> >>>
-> >>>        object_unref(OBJECT(sioc))
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> My test will not hang any more.
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> 原始邮件
-> >>>
-> >>>
-> >>>
-> >>> 发件人: address@hidden
-> >>> 收件人:王广10165992 address@hidden
-> >>> 抄送人: address@hidden address@hidden
-> >>> 日 期 :2017年03月21日 15:58
-> >>> 主 题 :Re: [Qemu-devel]  答复: Re:  [BUG]COLO failover hang
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> Hi,Wang.
-> >>>
-> >>> You can test this branch:
-> >>>
-> >>>
-https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk
-> >>>
-> >>> and please follow wiki ensure your own configuration correctly.
-> >>>
-> >>>
-http://wiki.qemu-project.org/Features/COLO
-> >>>
-> >>>
-> >>> Thanks
-> >>>
-> >>> Zhang Chen
-> >>>
-> >>>
-> >>> On 03/21/2017 03:27 PM, address@hidden wrote:
-> >>> >
-> >>> > hi.
-> >>> >
-> >>> > I test the git qemu master have the same problem.
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0  qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880,
-> >>> > niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461
-> >>> >
-> >>> > #1  0x00007f658e4aa0c2 in qio_channel_read
-> >>> > (address@hidden, address@hidden "",
-> >>> > address@hidden, address@hidden) at io/channel.c:114
-> >>> >
-> >>> > #2  0x00007f658e3ea990 in channel_get_buffer (opaque=<optimized out>,
-> >>> > buf=0x7f65907cb838 "", pos=<optimized out>, size=32768) at
-> >>> > migration/qemu-file-channel.c:78
-> >>> >
-> >>> > #3  0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at
-> >>> > migration/qemu-file.c:295
-> >>> >
-> >>> > #4  0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden,
-> >>> > address@hidden) at migration/qemu-file.c:555
-> >>> >
-> >>> > #5  0x00007f658e3ea34b in qemu_get_byte (address@hidden) at
-> >>> > migration/qemu-file.c:568
-> >>> >
-> >>> > #6  0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at
-> >>> > migration/qemu-file.c:648
-> >>> >
-> >>> > #7  0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800,
-> >>> > address@hidden) at migration/colo.c:244
-> >>> >
-> >>> > #8  0x00007f658e3e681e in colo_receive_check_message (f=<optimized
-> >>> > out>, address@hidden,
-> >>> > address@hidden)
-> >>> >
-> >>> >     at migration/colo.c:264
-> >>> >
-> >>> > #9  0x00007f658e3e740e in colo_process_incoming_thread
-> >>> > (opaque=0x7f658eb30360 <mis_current.31286>) at migration/colo.c:577
-> >>> >
-> >>> > #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0
-> >>> >
-> >>> > #11 0x00007f65881983ed in clone () from /lib64/libc.so.6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7f658ff7d5c0 "migration-socket-incoming"
-> >>> >
-> >>> > (gdb) p ioc->features        Do not support QIO_CHANNEL_FEATURE_SHUTDOWN
-> >>> >
-> >>> > $3 = 0
-> >>> >
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0  socket_accept_incoming_migration (ioc=0x7fdcceeafa90,
-> >>> > condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137
-> >>> >
-> >>> > #1  0x00007fdcc6966350 in g_main_dispatch (context=<optimized out>) at
-> >>> > gmain.c:3054
-> >>> >
-> >>> > #2  g_main_context_dispatch (context=<optimized out>,
-> >>> > address@hidden) at gmain.c:3630
-> >>> >
-> >>> > #3  0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213
-> >>> >
-> >>> > #4  os_host_main_loop_wait (timeout=<optimized out>) at
-> >>> > util/main-loop.c:258
-> >>> >
-> >>> > #5  main_loop_wait (address@hidden) at
-> >>> > util/main-loop.c:506
-> >>> >
-> >>> > #6  0x00007fdccb526187 in main_loop () at vl.c:1898
-> >>> >
-> >>> > #7  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
-> >>> > out>) at vl.c:4709
-> >>> >
-> >>> > (gdb) p ioc->features
-> >>> >
-> >>> > $1 = 6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7fdcce1b1ab0 "migration-socket-listener"
-> >>> >
-> >>> >
-> >>> > May be socket_accept_incoming_migration should
-> >>> > call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)??
-> >>> >
-> >>> >
-> >>> > thank you.
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > 原始邮件
-> >>> > address@hidden
-> >>> > address@hidden
-> >>> > address@hidden@huawei.com>
-> >>> > *日 期 :*2017年03月16日 14:46
-> >>> > *主 题 :**Re: [Qemu-devel] COLO failover hang*
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > On 03/15/2017 05:06 PM, wangguang wrote:
-> >>> > >   am testing QEMU COLO feature described here [QEMU
-> >>> > > Wiki](
-http://wiki.qemu-project.org/Features/COLO
-).
-> >>> > >
-> >>> > > When the Primary Node panic,the Secondary Node qemu hang.
-> >>> > > hang at recvmsg in qio_channel_socket_readv.
-> >>> > > And  I run  { 'execute': 'nbd-server-stop' } and { "execute":
-> >>> > > "x-colo-lost-heartbeat" } in Secondary VM's
-> >>> > > monitor,the  Secondary Node qemu still hang at recvmsg .
-> >>> > >
-> >>> > > I found that the colo in qemu is not complete yet.
-> >>> > > Do the colo have any plan for development?
-> >>> >
-> >>> > Yes, We are developing. You can see some of patch we pushing.
-> >>> >
-> >>> > > Has anyone ever run it successfully? Any help is appreciated!
-> >>> >
-> >>> > In our internal version can run it successfully,
-> >>> > The failover detail you can ask Zhanghailiang for help.
-> >>> > Next time if you have some question about COLO,
-> >>> > please cc me and zhanghailiang address@hidden
-> >>> >
-> >>> >
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > centos7.2+qemu2.7.50
-> >>> > > (gdb) bt
-> >>> > > #0  0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0
-> >>> > > #1  0x00007f3e0332b738 in qio_channel_socket_readv (ioc=<optimized 
-out>,
-> >>> > > iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, 
-errp=0x0) at
-> >>> > > io/channel-socket.c:497
-> >>> > > #2  0x00007f3e03329472 in qio_channel_read (address@hidden,
-> >>> > > address@hidden "", address@hidden,
-> >>> > > address@hidden) at io/channel.c:97
-> >>> > > #3  0x00007f3e032750e0 in channel_get_buffer (opaque=<optimized out>,
-> >>> > > buf=0x7f3e05910f38 "", pos=<optimized out>, size=32768) at
-> >>> > > migration/qemu-file-channel.c:78
-> >>> > > #4  0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at
-> >>> > > migration/qemu-file.c:257
-> >>> > > #5  0x00007f3e03274a41 in qemu_peek_byte (address@hidden,
-> >>> > > address@hidden) at migration/qemu-file.c:510
-> >>> > > #6  0x00007f3e03274aab in qemu_get_byte (address@hidden) at
-> >>> > > migration/qemu-file.c:523
-> >>> > > #7  0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at
-> >>> > > migration/qemu-file.c:603
-> >>> > > #8  0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00,
-> >>> > > address@hidden) at migration/colo.c:215
-> >>> > > #9  0x00007f3e0327250d in colo_wait_handle_message 
-(errp=0x7f3d62bfaa48,
-> >>> > > checkpoint_request=<synthetic pointer>, f=<optimized out>) at
-> >>> > > migration/colo.c:546
-> >>> > > #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at
-> >>> > > migration/colo.c:649
-> >>> > > #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0
-> >>> > > #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > --
-> >>> > > View this message in context:
-http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html
-> >>> > > Sent from the Developer mailing list archive at Nabble.com.
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> >
-> >>> > --
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>>
-> >>
-> > --
-> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
-> >
-> > .
-> >
->
-
-On 2017/3/22 16:09, address@hidden wrote:
-hi:
-
-yes.it is better.
-
-And should we delete
-Yes, you are right.
-#ifdef WIN32
-
-     QIO_CHANNEL(cioc)->event = CreateEvent(NULL, FALSE, FALSE, NULL)
-
-#endif
-
-
-
-
-in qio_channel_socket_accept?
-
-qio_channel_socket_new already have it.
-
-
-
-
-
-
-
-
-
-
-
-
-原始邮件
-
-
-
-发件人: address@hidden
-收件人:王广10165992
-抄送人: address@hidden address@hidden address@hidden address@hidden
-日 期 :2017年03月22日 15:03
-主 题 :Re: [Qemu-devel]  答复: Re:  答复: Re: 答复: Re: [BUG]COLO failover hang
-
-
-
-
-
-Hi,
-
-On 2017/3/22 9:42, address@hidden wrote:
-> diff --git a/migration/socket.c b/migration/socket.c
->
->
-> index 13966f1..d65a0ea 100644
->
->
-> --- a/migration/socket.c
->
->
-> +++ b/migration/socket.c
->
->
-> @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
->
->
->       }
->
->
->
->
->
->       trace_migration_socket_incoming_accepted()
->
->
->
->
->
->       qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming")
->
->
-> +    qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN)
->
->
->       migration_channel_process_incoming(migrate_get_current(),
->
->
->                                          QIO_CHANNEL(sioc))
->
->
->       object_unref(OBJECT(sioc))
->
->
->
->
-> Is this patch ok?
->
-
-Yes, i think this works, but a better way maybe to call 
-qio_channel_set_feature()
-in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the 
-socket accept fd,
-Or fix it by this:
-
-diff --git a/io/channel-socket.c b/io/channel-socket.c
-index f546c68..ce6894c 100644
---- a/io/channel-socket.c
-+++ b/io/channel-socket.c
-@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
-                             Error **errp)
-   {
-       QIOChannelSocket *cioc
--
--    cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET))
--    cioc->fd = -1
-+
-+    cioc = qio_channel_socket_new()
-       cioc->remoteAddrLen = sizeof(ioc->remoteAddr)
-       cioc->localAddrLen = sizeof(ioc->localAddr)
-
-
-Thanks,
-Hailiang
-
-> I have test it . The test could not hang any more.
->
->
->
->
->
->
->
->
->
->
->
->
-> 原始邮件
->
->
->
-> 发件人: address@hidden
-> 收件人: address@hidden address@hidden
-> 抄送人: address@hidden address@hidden address@hidden
-> 日 期 :2017年03月22日 09:11
-> 主 题 :Re: [Qemu-devel]  答复: Re:  答复: Re: [BUG]COLO failover hang
->
->
->
->
->
-> On 2017/3/21 19:56, Dr. David Alan Gilbert wrote:
-> > * Hailiang Zhang (address@hidden) wrote:
-> >> Hi,
-> >>
-> >> Thanks for reporting this, and i confirmed it in my test, and it is a bug.
-> >>
-> >> Though we tried to call qemu_file_shutdown() to shutdown the related fd, in
-> >> case COLO thread/incoming thread is stuck in read/write() while do 
-failover,
-> >> but it didn't take effect, because all the fd used by COLO (also migration)
-> >> has been wrapped by qio channel, and it will not call the shutdown API if
-> >> we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN).
-> >>
-> >> Cc: Dr. David Alan Gilbert address@hidden
-> >>
-> >> I doubted migration cancel has the same problem, it may be stuck in write()
-> >> if we tried to cancel migration.
-> >>
-> >> void fd_start_outgoing_migration(MigrationState *s, const char *fdname, 
-Error **errp)
-> >> {
-> >>      qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing")
-> >>      migration_channel_connect(s, ioc, NULL)
-> >>      ... ...
-> >> We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN) above,
-> >> and the
-> >> migrate_fd_cancel()
-> >> {
-> >>   ... ...
-> >>      if (s->state == MIGRATION_STATUS_CANCELLING && f) {
-> >>          qemu_file_shutdown(f)  --> This will not take effect. No ?
-> >>      }
-> >> }
-> >
-> > (cc'd in Daniel Berrange).
-> > I see that we call qio_channel_set_feature(ioc, 
-QIO_CHANNEL_FEATURE_SHUTDOWN) at the
-> > top of qio_channel_socket_new  so I think that's safe isn't it?
-> >
->
-> Hmm, you are right, this problem is only exist for the migration incoming fd, 
-thanks.
->
-> > Dave
-> >
-> >> Thanks,
-> >> Hailiang
-> >>
-> >> On 2017/3/21 16:10, address@hidden wrote:
-> >>> Thank you。
-> >>>
-> >>> I have test aready。
-> >>>
-> >>> When the Primary Node panic,the Secondary Node qemu hang at the same 
-place。
-> >>>
-> >>> Incorrding
-http://wiki.qemu-project.org/Features/COLO
-,kill Primary Node 
-qemu will not produce the problem,but Primary Node panic can。
-> >>>
-> >>> I think due to the feature of channel does not support 
-QIO_CHANNEL_FEATURE_SHUTDOWN.
-> >>>
-> >>>
-> >>> when failover,channel_shutdown could not shut down the channel.
-> >>>
-> >>>
-> >>> so the colo_process_incoming_thread will hang at recvmsg.
-> >>>
-> >>>
-> >>> I test a patch:
-> >>>
-> >>>
-> >>> diff --git a/migration/socket.c b/migration/socket.c
-> >>>
-> >>>
-> >>> index 13966f1..d65a0ea 100644
-> >>>
-> >>>
-> >>> --- a/migration/socket.c
-> >>>
-> >>>
-> >>> +++ b/migration/socket.c
-> >>>
-> >>>
-> >>> @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
-> >>>
-> >>>
-> >>>        }
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>        trace_migration_socket_incoming_accepted()
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>        qio_channel_set_name(QIO_CHANNEL(sioc), 
-"migration-socket-incoming")
-> >>>
-> >>>
-> >>> +    qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN)
-> >>>
-> >>>
-> >>>        migration_channel_process_incoming(migrate_get_current(),
-> >>>
-> >>>
-> >>>                                           QIO_CHANNEL(sioc))
-> >>>
-> >>>
-> >>>        object_unref(OBJECT(sioc))
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> My test will not hang any more.
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> 原始邮件
-> >>>
-> >>>
-> >>>
-> >>> 发件人: address@hidden
-> >>> 收件人:王广10165992 address@hidden
-> >>> 抄送人: address@hidden address@hidden
-> >>> 日 期 :2017年03月21日 15:58
-> >>> 主 题 :Re: [Qemu-devel]  答复: Re:  [BUG]COLO failover hang
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> Hi,Wang.
-> >>>
-> >>> You can test this branch:
-> >>>
-> >>>
-https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk
-> >>>
-> >>> and please follow wiki ensure your own configuration correctly.
-> >>>
-> >>>
-http://wiki.qemu-project.org/Features/COLO
-> >>>
-> >>>
-> >>> Thanks
-> >>>
-> >>> Zhang Chen
-> >>>
-> >>>
-> >>> On 03/21/2017 03:27 PM, address@hidden wrote:
-> >>> >
-> >>> > hi.
-> >>> >
-> >>> > I test the git qemu master have the same problem.
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0  qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880,
-> >>> > niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461
-> >>> >
-> >>> > #1  0x00007f658e4aa0c2 in qio_channel_read
-> >>> > (address@hidden, address@hidden "",
-> >>> > address@hidden, address@hidden) at io/channel.c:114
-> >>> >
-> >>> > #2  0x00007f658e3ea990 in channel_get_buffer (opaque=<optimized out>,
-> >>> > buf=0x7f65907cb838 "", pos=<optimized out>, size=32768) at
-> >>> > migration/qemu-file-channel.c:78
-> >>> >
-> >>> > #3  0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at
-> >>> > migration/qemu-file.c:295
-> >>> >
-> >>> > #4  0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden,
-> >>> > address@hidden) at migration/qemu-file.c:555
-> >>> >
-> >>> > #5  0x00007f658e3ea34b in qemu_get_byte (address@hidden) at
-> >>> > migration/qemu-file.c:568
-> >>> >
-> >>> > #6  0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at
-> >>> > migration/qemu-file.c:648
-> >>> >
-> >>> > #7  0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800,
-> >>> > address@hidden) at migration/colo.c:244
-> >>> >
-> >>> > #8  0x00007f658e3e681e in colo_receive_check_message (f=<optimized
-> >>> > out>, address@hidden,
-> >>> > address@hidden)
-> >>> >
-> >>> >     at migration/colo.c:264
-> >>> >
-> >>> > #9  0x00007f658e3e740e in colo_process_incoming_thread
-> >>> > (opaque=0x7f658eb30360 <mis_current.31286>) at migration/colo.c:577
-> >>> >
-> >>> > #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0
-> >>> >
-> >>> > #11 0x00007f65881983ed in clone () from /lib64/libc.so.6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7f658ff7d5c0 "migration-socket-incoming"
-> >>> >
-> >>> > (gdb) p ioc->features        Do not support QIO_CHANNEL_FEATURE_SHUTDOWN
-> >>> >
-> >>> > $3 = 0
-> >>> >
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0  socket_accept_incoming_migration (ioc=0x7fdcceeafa90,
-> >>> > condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137
-> >>> >
-> >>> > #1  0x00007fdcc6966350 in g_main_dispatch (context=<optimized out>) at
-> >>> > gmain.c:3054
-> >>> >
-> >>> > #2  g_main_context_dispatch (context=<optimized out>,
-> >>> > address@hidden) at gmain.c:3630
-> >>> >
-> >>> > #3  0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213
-> >>> >
-> >>> > #4  os_host_main_loop_wait (timeout=<optimized out>) at
-> >>> > util/main-loop.c:258
-> >>> >
-> >>> > #5  main_loop_wait (address@hidden) at
-> >>> > util/main-loop.c:506
-> >>> >
-> >>> > #6  0x00007fdccb526187 in main_loop () at vl.c:1898
-> >>> >
-> >>> > #7  main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
-> >>> > out>) at vl.c:4709
-> >>> >
-> >>> > (gdb) p ioc->features
-> >>> >
-> >>> > $1 = 6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7fdcce1b1ab0 "migration-socket-listener"
-> >>> >
-> >>> >
-> >>> > May be socket_accept_incoming_migration should
-> >>> > call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)??
-> >>> >
-> >>> >
-> >>> > thank you.
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > 原始邮件
-> >>> > address@hidden
-> >>> > address@hidden
-> >>> > address@hidden@huawei.com>
-> >>> > *日 期 :*2017年03月16日 14:46
-> >>> > *主 题 :**Re: [Qemu-devel] COLO failover hang*
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > On 03/15/2017 05:06 PM, wangguang wrote:
-> >>> > >   am testing QEMU COLO feature described here [QEMU
-> >>> > > Wiki](
-http://wiki.qemu-project.org/Features/COLO
-).
-> >>> > >
-> >>> > > When the Primary Node panic,the Secondary Node qemu hang.
-> >>> > > hang at recvmsg in qio_channel_socket_readv.
-> >>> > > And  I run  { 'execute': 'nbd-server-stop' } and { "execute":
-> >>> > > "x-colo-lost-heartbeat" } in Secondary VM's
-> >>> > > monitor,the  Secondary Node qemu still hang at recvmsg .
-> >>> > >
-> >>> > > I found that the colo in qemu is not complete yet.
-> >>> > > Do the colo have any plan for development?
-> >>> >
-> >>> > Yes, We are developing. You can see some of patch we pushing.
-> >>> >
-> >>> > > Has anyone ever run it successfully? Any help is appreciated!
-> >>> >
-> >>> > In our internal version can run it successfully,
-> >>> > The failover detail you can ask Zhanghailiang for help.
-> >>> > Next time if you have some question about COLO,
-> >>> > please cc me and zhanghailiang address@hidden
-> >>> >
-> >>> >
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > centos7.2+qemu2.7.50
-> >>> > > (gdb) bt
-> >>> > > #0  0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0
-> >>> > > #1  0x00007f3e0332b738 in qio_channel_socket_readv (ioc=<optimized 
-out>,
-> >>> > > iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, 
-errp=0x0) at
-> >>> > > io/channel-socket.c:497
-> >>> > > #2  0x00007f3e03329472 in qio_channel_read (address@hidden,
-> >>> > > address@hidden "", address@hidden,
-> >>> > > address@hidden) at io/channel.c:97
-> >>> > > #3  0x00007f3e032750e0 in channel_get_buffer (opaque=<optimized out>,
-> >>> > > buf=0x7f3e05910f38 "", pos=<optimized out>, size=32768) at
-> >>> > > migration/qemu-file-channel.c:78
-> >>> > > #4  0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at
-> >>> > > migration/qemu-file.c:257
-> >>> > > #5  0x00007f3e03274a41 in qemu_peek_byte (address@hidden,
-> >>> > > address@hidden) at migration/qemu-file.c:510
-> >>> > > #6  0x00007f3e03274aab in qemu_get_byte (address@hidden) at
-> >>> > > migration/qemu-file.c:523
-> >>> > > #7  0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at
-> >>> > > migration/qemu-file.c:603
-> >>> > > #8  0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00,
-> >>> > > address@hidden) at migration/colo.c:215
-> >>> > > #9  0x00007f3e0327250d in colo_wait_handle_message 
-(errp=0x7f3d62bfaa48,
-> >>> > > checkpoint_request=<synthetic pointer>, f=<optimized out>) at
-> >>> > > migration/colo.c:546
-> >>> > > #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at
-> >>> > > migration/colo.c:649
-> >>> > > #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0
-> >>> > > #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > --
-> >>> > > View this message in context:
-http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html
-> >>> > > Sent from the Developer mailing list archive at Nabble.com.
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> >
-> >>> > --
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>>
-> >>
-> > --
-> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
-> >
-> > .
-> >
->
-
diff --git a/classification_output/04/mistranslation/74466963 b/classification_output/04/mistranslation/74466963
deleted file mode 100644
index ceba02704..000000000
--- a/classification_output/04/mistranslation/74466963
+++ /dev/null
@@ -1,1886 +0,0 @@
-mistranslation: 0.927
-assembly: 0.910
-device: 0.909
-instruction: 0.903
-KVM: 0.903
-graphic: 0.895
-boot: 0.894
-semantic: 0.891
-socket: 0.879
-vnc: 0.878
-other: 0.877
-network: 0.871
-
-[Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
-
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-And now we found the same issue at the tcg vm(kvm is fine), after
-migration, the content VM's memory is inconsistent.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
- obj-y += memory_mapping.o
- obj-y += dump.o
- obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
- # xen support
- obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
-}
-
-     rcu_read_unlock();
--    DPRINTF("Completed load of VM with exit code %d seq iteration "
-+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
-             "%" PRIu64 "\n", ret, seq_iter);
-     return ret;
- }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
- }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+    int i;
-+    unsigned char md[MD5_DIGEST_LENGTH];
-+    rcu_read_lock();
-+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+    rcu_read_unlock();
-+
-+    MD5(block->host, block->used_length, md);
-+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+        fprintf(stderr, "%02x", md[i]);
-+    }
-+    fprintf(stderr, "\n");
-+    error_report("end ram md5");
-+}
-+
- void qemu_savevm_state_begin(QEMUFile *f,
-                              const MigrationParams *params)
- {
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile
-*f, bool iterable_only)
-save_section_header(f, se, QEMU_VM_SECTION_END);
-
-         ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+        fprintf(stderr, "after saving %s complete\n", se->idstr);
-+        check_host_md5();
-+
-         trace_savevm_section_end(se->idstr, se->section_id, ret);
-         save_section_footer(f, se);
-         if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
-section_id, le->se->idstr);
-                 return ret;
-             }
-+            if (section_type == QEMU_VM_SECTION_END) {
-+                error_report("after loading state section id %d(%s)",
-+                             section_id, le->se->idstr);
-+                check_host_md5();
-+            }
-             if (!check_section_footer(f, le)) {
-                 return -EINVAL;
-             }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
-     }
-
-     cpu_synchronize_all_post_init();
-+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
-+    check_host_md5();
-
-     return ret;
- }
-
-* Li Zhijian (address@hidden) wrote:
->
-Hi all,
->
->
-Does anyboday remember the similar issue post by hailiang months ago
->
-http://patchwork.ozlabs.org/patch/454322/
->
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-
->
-And now we found the same issue at the tcg vm(kvm is fine), after migration,
->
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-
->
-we add a patch to check memory content, you can find it from affix
->
->
-steps to reporduce:
->
-1) apply the patch and re-build qemu
->
-2) prepare the ubuntu guest and run memtest in grub.
->
-soruce side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off
->
->
-destination side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
->
->
-3) start migration
->
-with 1000M NIC, migration will finish within 3 min.
->
->
-at source:
->
-(qemu) migrate tcp:192.168.2.66:8881
->
-after saving ram complete
->
-e9e725df678d392b1a83b3a917f332bb
->
-qemu-system-x86_64: end ram md5
->
-(qemu)
->
->
-at destination:
->
-...skip...
->
-Completed load of VM with exit code 0 seq iteration 1264
->
-Completed load of VM with exit code 0 seq iteration 1265
->
-Completed load of VM with exit code 0 seq iteration 1266
->
-qemu-system-x86_64: after loading state section id 2(ram)
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
->
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
->
-This occurs occasionally and only at tcg machine. It seems that
->
-some pages dirtied in source side don't transferred to destination.
->
-This problem can be reproduced even if we disable virtio.
->
->
-Is it OK for some pages that not transferred to destination when do
->
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug.  Hard to find though, I guess
-at least memtest is smaller than a big OS.  I think I'd dump the whole
-of memory on both sides, hexdump and diff them  - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-
-Dave
-
->
-Any idea...
->
->
-=================md5 check patch=============================
->
->
-diff --git a/Makefile.target b/Makefile.target
->
-index 962d004..e2cb8e9 100644
->
---- a/Makefile.target
->
-+++ b/Makefile.target
->
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
->
-obj-y += memory_mapping.o
->
-obj-y += dump.o
->
-obj-y += migration/ram.o migration/savevm.o
->
--LIBS := $(libs_softmmu) $(LIBS)
->
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
->
->
-# xen support
->
-obj-$(CONFIG_XEN) += xen-common.o
->
-diff --git a/migration/ram.c b/migration/ram.c
->
-index 1eb155a..3b7a09d 100644
->
---- a/migration/ram.c
->
-+++ b/migration/ram.c
->
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
->
-version_id)
->
-}
->
->
-rcu_read_unlock();
->
--    DPRINTF("Completed load of VM with exit code %d seq iteration "
->
-+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
->
-"%" PRIu64 "\n", ret, seq_iter);
->
-return ret;
->
-}
->
-diff --git a/migration/savevm.c b/migration/savevm.c
->
-index 0ad1b93..3feaa61 100644
->
---- a/migration/savevm.c
->
-+++ b/migration/savevm.c
->
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
->
->
-}
->
->
-+#include "exec/ram_addr.h"
->
-+#include "qemu/rcu_queue.h"
->
-+#include <clplumbing/md5.h>
->
-+#ifndef MD5_DIGEST_LENGTH
->
-+#define MD5_DIGEST_LENGTH 16
->
-+#endif
->
-+
->
-+static void check_host_md5(void)
->
-+{
->
-+    int i;
->
-+    unsigned char md[MD5_DIGEST_LENGTH];
->
-+    rcu_read_lock();
->
-+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
->
-'pc.ram' block */
->
-+    rcu_read_unlock();
->
-+
->
-+    MD5(block->host, block->used_length, md);
->
-+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
->
-+        fprintf(stderr, "%02x", md[i]);
->
-+    }
->
-+    fprintf(stderr, "\n");
->
-+    error_report("end ram md5");
->
-+}
->
-+
->
-void qemu_savevm_state_begin(QEMUFile *f,
->
-const MigrationParams *params)
->
-{
->
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
->
-bool iterable_only)
->
-save_section_header(f, se, QEMU_VM_SECTION_END);
->
->
-ret = se->ops->save_live_complete_precopy(f, se->opaque);
->
-+
->
-+        fprintf(stderr, "after saving %s complete\n", se->idstr);
->
-+        check_host_md5();
->
-+
->
-trace_savevm_section_end(se->idstr, se->section_id, ret);
->
-save_section_footer(f, se);
->
-if (ret < 0) {
->
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
->
-MigrationIncomingState *mis)
->
-section_id, le->se->idstr);
->
-return ret;
->
-}
->
-+            if (section_type == QEMU_VM_SECTION_END) {
->
-+                error_report("after loading state section id %d(%s)",
->
-+                             section_id, le->se->idstr);
->
-+                check_host_md5();
->
-+            }
->
-if (!check_section_footer(f, le)) {
->
-return -EINVAL;
->
-}
->
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
->
-}
->
->
-cpu_synchronize_all_post_init();
->
-+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
->
-+    check_host_md5();
->
->
-return ret;
->
-}
->
->
->
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-On 2015/12/3 17:24, Dr. David Alan Gilbert wrote:
-* Li Zhijian (address@hidden) wrote:
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-And now we found the same issue at the tcg vm(kvm is fine), after migration,
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug.  Hard to find though, I guess
-at least memtest is smaller than a big OS.  I think I'd dump the whole
-of memory on both sides, hexdump and diff them  - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-Maybe one better way to do that is with the help of userfaultfd's write-protect
-capability. It is still in the development by Andrea Arcangeli, but there
-is a RFC version available, please refer to
-http://www.spinics.net/lists/linux-mm/msg97422.html
-(I'm developing live memory snapshot which based on it, maybe this is another 
-scene where we
-can use userfaultfd's WP ;) ).
-Dave
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
-  obj-y += memory_mapping.o
-  obj-y += dump.o
-  obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
-  # xen support
-  obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
-      }
-
-      rcu_read_unlock();
--    DPRINTF("Completed load of VM with exit code %d seq iteration "
-+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
-              "%" PRIu64 "\n", ret, seq_iter);
-      return ret;
-  }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
-  }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+    int i;
-+    unsigned char md[MD5_DIGEST_LENGTH];
-+    rcu_read_lock();
-+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+    rcu_read_unlock();
-+
-+    MD5(block->host, block->used_length, md);
-+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+        fprintf(stderr, "%02x", md[i]);
-+    }
-+    fprintf(stderr, "\n");
-+    error_report("end ram md5");
-+}
-+
-  void qemu_savevm_state_begin(QEMUFile *f,
-                               const MigrationParams *params)
-  {
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
-bool iterable_only)
-          save_section_header(f, se, QEMU_VM_SECTION_END);
-
-          ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+        fprintf(stderr, "after saving %s complete\n", se->idstr);
-+        check_host_md5();
-+
-          trace_savevm_section_end(se->idstr, se->section_id, ret);
-          save_section_footer(f, se);
-          if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
-                               section_id, le->se->idstr);
-                  return ret;
-              }
-+            if (section_type == QEMU_VM_SECTION_END) {
-+                error_report("after loading state section id %d(%s)",
-+                             section_id, le->se->idstr);
-+                check_host_md5();
-+            }
-              if (!check_section_footer(f, le)) {
-                  return -EINVAL;
-              }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
-      }
-
-      cpu_synchronize_all_post_init();
-+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
-+    check_host_md5();
-
-      return ret;
-  }
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-.
-
-On 12/03/2015 05:37 PM, Hailiang Zhang wrote:
-On 2015/12/3 17:24, Dr. David Alan Gilbert wrote:
-* Li Zhijian (address@hidden) wrote:
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-And now we found the same issue at the tcg vm(kvm is fine), after
-migration,
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after
-cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug.  Hard to find though, I guess
-at least memtest is smaller than a big OS.  I think I'd dump the whole
-of memory on both sides, hexdump and diff them  - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-Maybe one better way to do that is with the help of userfaultfd's
-write-protect
-capability. It is still in the development by Andrea Arcangeli, but there
-is a RFC version available, please refer to
-http://www.spinics.net/lists/linux-mm/msg97422.html
-(I'm developing live memory snapshot which based on it, maybe this is
-another scene where we
-can use userfaultfd's WP ;) ).
-sounds good.
-
-thanks
-Li
-Dave
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
-  obj-y += memory_mapping.o
-  obj-y += dump.o
-  obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
-  # xen support
-  obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
-      }
-
-      rcu_read_unlock();
--    DPRINTF("Completed load of VM with exit code %d seq iteration "
-+    fprintf(stderr, "Completed load of VM with exit code %d seq
-iteration "
-              "%" PRIu64 "\n", ret, seq_iter);
-      return ret;
-  }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
-  }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+    int i;
-+    unsigned char md[MD5_DIGEST_LENGTH];
-+    rcu_read_lock();
-+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+    rcu_read_unlock();
-+
-+    MD5(block->host, block->used_length, md);
-+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+        fprintf(stderr, "%02x", md[i]);
-+    }
-+    fprintf(stderr, "\n");
-+    error_report("end ram md5");
-+}
-+
-  void qemu_savevm_state_begin(QEMUFile *f,
-                               const MigrationParams *params)
-  {
-@@ -1056,6 +1079,10 @@ void
-qemu_savevm_state_complete_precopy(QEMUFile *f,
-bool iterable_only)
-          save_section_header(f, se, QEMU_VM_SECTION_END);
-
-          ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+        fprintf(stderr, "after saving %s complete\n", se->idstr);
-+        check_host_md5();
-+
-          trace_savevm_section_end(se->idstr, se->section_id, ret);
-          save_section_footer(f, se);
-          if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
-                               section_id, le->se->idstr);
-                  return ret;
-              }
-+            if (section_type == QEMU_VM_SECTION_END) {
-+                error_report("after loading state section id %d(%s)",
-+                             section_id, le->se->idstr);
-+                check_host_md5();
-+            }
-              if (!check_section_footer(f, le)) {
-                  return -EINVAL;
-              }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
-      }
-
-      cpu_synchronize_all_post_init();
-+    error_report("%s: after cpu_synchronize_all_post_init\n",
-__func__);
-+    check_host_md5();
-
-      return ret;
-  }
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-.
-.
---
-Best regards.
-Li Zhijian (8555)
-
-On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote:
-* Li Zhijian (address@hidden) wrote:
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-And now we found the same issue at the tcg vm(kvm is fine), after migration,
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug.  Hard to find though, I guess
-at least memtest is smaller than a big OS.  I think I'd dump the whole
-of memory on both sides, hexdump and diff them  - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-I try to dump and compare them, more than 10 pages are different.
-in source side, they are random value rather than always 'FF' 'FB' 'EF'
-'BF'... in destination.
-and not all of the different pages are continuous.
-
-thanks
-Li
-Dave
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
-  obj-y += memory_mapping.o
-  obj-y += dump.o
-  obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
-  # xen support
-  obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
-      }
-
-      rcu_read_unlock();
--    DPRINTF("Completed load of VM with exit code %d seq iteration "
-+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
-              "%" PRIu64 "\n", ret, seq_iter);
-      return ret;
-  }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
-  }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+    int i;
-+    unsigned char md[MD5_DIGEST_LENGTH];
-+    rcu_read_lock();
-+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+    rcu_read_unlock();
-+
-+    MD5(block->host, block->used_length, md);
-+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+        fprintf(stderr, "%02x", md[i]);
-+    }
-+    fprintf(stderr, "\n");
-+    error_report("end ram md5");
-+}
-+
-  void qemu_savevm_state_begin(QEMUFile *f,
-                               const MigrationParams *params)
-  {
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
-bool iterable_only)
-          save_section_header(f, se, QEMU_VM_SECTION_END);
-
-          ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+        fprintf(stderr, "after saving %s complete\n", se->idstr);
-+        check_host_md5();
-+
-          trace_savevm_section_end(se->idstr, se->section_id, ret);
-          save_section_footer(f, se);
-          if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
-                               section_id, le->se->idstr);
-                  return ret;
-              }
-+            if (section_type == QEMU_VM_SECTION_END) {
-+                error_report("after loading state section id %d(%s)",
-+                             section_id, le->se->idstr);
-+                check_host_md5();
-+            }
-              if (!check_section_footer(f, le)) {
-                  return -EINVAL;
-              }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
-      }
-
-      cpu_synchronize_all_post_init();
-+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
-+    check_host_md5();
-
-      return ret;
-  }
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-
-.
---
-Best regards.
-Li Zhijian (8555)
-
-* Li Zhijian (address@hidden) wrote:
->
->
->
-On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote:
->
->* Li Zhijian (address@hidden) wrote:
->
->>Hi all,
->
->>
->
->>Does anyboday remember the similar issue post by hailiang months ago
->
->>
-http://patchwork.ozlabs.org/patch/454322/
->
->>At least tow bugs about migration had been fixed since that.
->
->
->
->Yes, I wondered what happened to that.
->
->
->
->>And now we found the same issue at the tcg vm(kvm is fine), after migration,
->
->>the content VM's memory is inconsistent.
->
->
->
->Hmm, TCG only - I don't know much about that; but I guess something must
->
->be accessing memory without using the proper macros/functions so
->
->it doesn't mark it as dirty.
->
->
->
->>we add a patch to check memory content, you can find it from affix
->
->>
->
->>steps to reporduce:
->
->>1) apply the patch and re-build qemu
->
->>2) prepare the ubuntu guest and run memtest in grub.
->
->>soruce side:
->
->>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
->>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
->>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
->>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
->>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
->>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
->>pc-i440fx-2.3,accel=tcg,usb=off
->
->>
->
->>destination side:
->
->>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
->>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
->>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
->>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
->>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
->>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
->>pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
->
->>
->
->>3) start migration
->
->>with 1000M NIC, migration will finish within 3 min.
->
->>
->
->>at source:
->
->>(qemu) migrate tcp:192.168.2.66:8881
->
->>after saving ram complete
->
->>e9e725df678d392b1a83b3a917f332bb
->
->>qemu-system-x86_64: end ram md5
->
->>(qemu)
->
->>
->
->>at destination:
->
->>...skip...
->
->>Completed load of VM with exit code 0 seq iteration 1264
->
->>Completed load of VM with exit code 0 seq iteration 1265
->
->>Completed load of VM with exit code 0 seq iteration 1266
->
->>qemu-system-x86_64: after loading state section id 2(ram)
->
->>49c2dac7bde0e5e22db7280dcb3824f9
->
->>qemu-system-x86_64: end ram md5
->
->>qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
->
->>
->
->>49c2dac7bde0e5e22db7280dcb3824f9
->
->>qemu-system-x86_64: end ram md5
->
->>
->
->>This occurs occasionally and only at tcg machine. It seems that
->
->>some pages dirtied in source side don't transferred to destination.
->
->>This problem can be reproduced even if we disable virtio.
->
->>
->
->>Is it OK for some pages that not transferred to destination when do
->
->>migration ? Or is it a bug?
->
->
->
->I'm pretty sure that means it's a bug.  Hard to find though, I guess
->
->at least memtest is smaller than a big OS.  I think I'd dump the whole
->
->of memory on both sides, hexdump and diff them  - I'd guess it would
->
->just be one byte/word different, maybe that would offer some idea what
->
->wrote it.
->
->
-I try to dump and compare them, more than 10 pages are different.
->
-in source side, they are random value rather than always 'FF' 'FB' 'EF'
->
-'BF'... in destination.
->
->
-and not all of the different pages are continuous.
-I wonder if it happens on all of memtest's different test patterns,
-perhaps it might be possible to narrow it down if you tell memtest
-to only run one test at a time.
-
-Dave
-
->
->
-thanks
->
-Li
->
->
->
->
->
->Dave
->
->
->
->>Any idea...
->
->>
->
->>=================md5 check patch=============================
->
->>
->
->>diff --git a/Makefile.target b/Makefile.target
->
->>index 962d004..e2cb8e9 100644
->
->>--- a/Makefile.target
->
->>+++ b/Makefile.target
->
->>@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
->
->>  obj-y += memory_mapping.o
->
->>  obj-y += dump.o
->
->>  obj-y += migration/ram.o migration/savevm.o
->
->>-LIBS := $(libs_softmmu) $(LIBS)
->
->>+LIBS := $(libs_softmmu) $(LIBS) -lplumb
->
->>
->
->>  # xen support
->
->>  obj-$(CONFIG_XEN) += xen-common.o
->
->>diff --git a/migration/ram.c b/migration/ram.c
->
->>index 1eb155a..3b7a09d 100644
->
->>--- a/migration/ram.c
->
->>+++ b/migration/ram.c
->
->>@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
->
->>version_id)
->
->>      }
->
->>
->
->>      rcu_read_unlock();
->
->>-    DPRINTF("Completed load of VM with exit code %d seq iteration "
->
->>+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
->
->>              "%" PRIu64 "\n", ret, seq_iter);
->
->>      return ret;
->
->>  }
->
->>diff --git a/migration/savevm.c b/migration/savevm.c
->
->>index 0ad1b93..3feaa61 100644
->
->>--- a/migration/savevm.c
->
->>+++ b/migration/savevm.c
->
->>@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
->
->>
->
->>  }
->
->>
->
->>+#include "exec/ram_addr.h"
->
->>+#include "qemu/rcu_queue.h"
->
->>+#include <clplumbing/md5.h>
->
->>+#ifndef MD5_DIGEST_LENGTH
->
->>+#define MD5_DIGEST_LENGTH 16
->
->>+#endif
->
->>+
->
->>+static void check_host_md5(void)
->
->>+{
->
->>+    int i;
->
->>+    unsigned char md[MD5_DIGEST_LENGTH];
->
->>+    rcu_read_lock();
->
->>+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
->
->>'pc.ram' block */
->
->>+    rcu_read_unlock();
->
->>+
->
->>+    MD5(block->host, block->used_length, md);
->
->>+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
->
->>+        fprintf(stderr, "%02x", md[i]);
->
->>+    }
->
->>+    fprintf(stderr, "\n");
->
->>+    error_report("end ram md5");
->
->>+}
->
->>+
->
->>  void qemu_savevm_state_begin(QEMUFile *f,
->
->>                               const MigrationParams *params)
->
->>  {
->
->>@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
->
->>bool iterable_only)
->
->>          save_section_header(f, se, QEMU_VM_SECTION_END);
->
->>
->
->>          ret = se->ops->save_live_complete_precopy(f, se->opaque);
->
->>+
->
->>+        fprintf(stderr, "after saving %s complete\n", se->idstr);
->
->>+        check_host_md5();
->
->>+
->
->>          trace_savevm_section_end(se->idstr, se->section_id, ret);
->
->>          save_section_footer(f, se);
->
->>          if (ret < 0) {
->
->>@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
->
->>MigrationIncomingState *mis)
->
->>                               section_id, le->se->idstr);
->
->>                  return ret;
->
->>              }
->
->>+            if (section_type == QEMU_VM_SECTION_END) {
->
->>+                error_report("after loading state section id %d(%s)",
->
->>+                             section_id, le->se->idstr);
->
->>+                check_host_md5();
->
->>+            }
->
->>              if (!check_section_footer(f, le)) {
->
->>                  return -EINVAL;
->
->>              }
->
->>@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
->
->>      }
->
->>
->
->>      cpu_synchronize_all_post_init();
->
->>+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
->
->>+    check_host_md5();
->
->>
->
->>      return ret;
->
->>  }
->
->>
->
->>
->
->>
->
->--
->
->Dr. David Alan Gilbert / address@hidden / Manchester, UK
->
->
->
->
->
->.
->
->
->
->
---
->
-Best regards.
->
-Li Zhijian (8555)
->
->
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-Li Zhijian <address@hidden> wrote:
->
-Hi all,
->
->
-Does anyboday remember the similar issue post by hailiang months ago
->
-http://patchwork.ozlabs.org/patch/454322/
->
-At least tow bugs about migration had been fixed since that.
->
->
-And now we found the same issue at the tcg vm(kvm is fine), after
->
-migration, the content VM's memory is inconsistent.
->
->
-we add a patch to check memory content, you can find it from affix
->
->
-steps to reporduce:
->
-1) apply the patch and re-build qemu
->
-2) prepare the ubuntu guest and run memtest in grub.
->
-soruce side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off
->
->
-destination side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
->
->
-3) start migration
->
-with 1000M NIC, migration will finish within 3 min.
->
->
-at source:
->
-(qemu) migrate tcp:192.168.2.66:8881
->
-after saving ram complete
->
-e9e725df678d392b1a83b3a917f332bb
->
-qemu-system-x86_64: end ram md5
->
-(qemu)
->
->
-at destination:
->
-...skip...
->
-Completed load of VM with exit code 0 seq iteration 1264
->
-Completed load of VM with exit code 0 seq iteration 1265
->
-Completed load of VM with exit code 0 seq iteration 1266
->
-qemu-system-x86_64: after loading state section id 2(ram)
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
->
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
->
-This occurs occasionally and only at tcg machine. It seems that
->
-some pages dirtied in source side don't transferred to destination.
->
-This problem can be reproduced even if we disable virtio.
->
->
-Is it OK for some pages that not transferred to destination when do
->
-migration ? Or is it a bug?
->
->
-Any idea...
-Thanks for describing how to reproduce the bug.
-If some pages are not transferred to destination then it is a bug, so we
-need to know what the problem is, notice that the problem can be that
-TCG is not marking dirty some page, that Migration code "forgets" about
-that page, or anything eles altogether, that is what we need to find.
-
-There are more posibilities, I am not sure that memtest is on 32bit
-mode, and it is inside posibility that we are missing some state when we
-are on real mode.
-
-Will try to take a look at this.
-
-THanks, again.
-
-
->
->
-=================md5 check patch=============================
->
->
-diff --git a/Makefile.target b/Makefile.target
->
-index 962d004..e2cb8e9 100644
->
---- a/Makefile.target
->
-+++ b/Makefile.target
->
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
->
-obj-y += memory_mapping.o
->
-obj-y += dump.o
->
-obj-y += migration/ram.o migration/savevm.o
->
--LIBS := $(libs_softmmu) $(LIBS)
->
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
->
->
-# xen support
->
-obj-$(CONFIG_XEN) += xen-common.o
->
-diff --git a/migration/ram.c b/migration/ram.c
->
-index 1eb155a..3b7a09d 100644
->
---- a/migration/ram.c
->
-+++ b/migration/ram.c
->
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque,
->
-int version_id)
->
-}
->
->
-rcu_read_unlock();
->
--    DPRINTF("Completed load of VM with exit code %d seq iteration "
->
-+    fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
->
-"%" PRIu64 "\n", ret, seq_iter);
->
-return ret;
->
-}
->
-diff --git a/migration/savevm.c b/migration/savevm.c
->
-index 0ad1b93..3feaa61 100644
->
---- a/migration/savevm.c
->
-+++ b/migration/savevm.c
->
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
->
->
-}
->
->
-+#include "exec/ram_addr.h"
->
-+#include "qemu/rcu_queue.h"
->
-+#include <clplumbing/md5.h>
->
-+#ifndef MD5_DIGEST_LENGTH
->
-+#define MD5_DIGEST_LENGTH 16
->
-+#endif
->
-+
->
-+static void check_host_md5(void)
->
-+{
->
-+    int i;
->
-+    unsigned char md[MD5_DIGEST_LENGTH];
->
-+    rcu_read_lock();
->
-+    RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
->
-'pc.ram' block */
->
-+    rcu_read_unlock();
->
-+
->
-+    MD5(block->host, block->used_length, md);
->
-+    for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
->
-+        fprintf(stderr, "%02x", md[i]);
->
-+    }
->
-+    fprintf(stderr, "\n");
->
-+    error_report("end ram md5");
->
-+}
->
-+
->
-void qemu_savevm_state_begin(QEMUFile *f,
->
-const MigrationParams *params)
->
-{
->
-@@ -1056,6 +1079,10 @@ void
->
-qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only)
->
-save_section_header(f, se, QEMU_VM_SECTION_END);
->
->
-ret = se->ops->save_live_complete_precopy(f, se->opaque);
->
-+
->
-+        fprintf(stderr, "after saving %s complete\n", se->idstr);
->
-+        check_host_md5();
->
-+
->
-trace_savevm_section_end(se->idstr, se->section_id, ret);
->
-save_section_footer(f, se);
->
-if (ret < 0) {
->
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
->
-MigrationIncomingState *mis)
->
-section_id, le->se->idstr);
->
-return ret;
->
-}
->
-+            if (section_type == QEMU_VM_SECTION_END) {
->
-+                error_report("after loading state section id %d(%s)",
->
-+                             section_id, le->se->idstr);
->
-+                check_host_md5();
->
-+            }
->
-if (!check_section_footer(f, le)) {
->
-return -EINVAL;
->
-}
->
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
->
-}
->
->
-cpu_synchronize_all_post_init();
->
-+    error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
->
-+    check_host_md5();
->
->
-return ret;
->
-}
-
->
->
-Thanks for describing how to reproduce the bug.
->
-If some pages are not transferred to destination then it is a bug, so we need
->
-to know what the problem is, notice that the problem can be that TCG is not
->
-marking dirty some page, that Migration code "forgets" about that page, or
->
-anything eles altogether, that is what we need to find.
->
->
-There are more posibilities, I am not sure that memtest is on 32bit mode, and
->
-it is inside posibility that we are missing some state when we are on real
->
-mode.
->
->
-Will try to take a look at this.
->
->
-THanks, again.
->
-Hi Juan & Amit
-
- Do you think we should add a mechanism to check the data integrity during LM 
-like Zhijian's patch did?  it may be very helpful for developers. 
- Actually, I did the similar thing before in order to make sure that I did the 
-right thing we I change the code related to LM.
-
-Liang
-
-On (Fri) 04 Dec 2015 [01:43:07], Li, Liang Z wrote:
->
->
->
-> Thanks for describing how to reproduce the bug.
->
-> If some pages are not transferred to destination then it is a bug, so we
->
-> need
->
-> to know what the problem is, notice that the problem can be that TCG is not
->
-> marking dirty some page, that Migration code "forgets" about that page, or
->
-> anything eles altogether, that is what we need to find.
->
->
->
-> There are more posibilities, I am not sure that memtest is on 32bit mode,
->
-> and
->
-> it is inside posibility that we are missing some state when we are on real
->
-> mode.
->
->
->
-> Will try to take a look at this.
->
->
->
-> THanks, again.
->
->
->
->
-Hi Juan & Amit
->
->
-Do you think we should add a mechanism to check the data integrity during LM
->
-like Zhijian's patch did?  it may be very helpful for developers.
->
-Actually, I did the similar thing before in order to make sure that I did
->
-the right thing we I change the code related to LM.
-If you mean for debugging, something that's not always on, then I'm
-fine with it.
-
-A script that goes along that shows the result of comparison of the
-diff will be helpful too, something that shows how many pages are
-differnt, how many bytes in a page on average, and so on.
-
-                Amit
-
diff --git a/classification_output/04/mistranslation/74545755 b/classification_output/04/mistranslation/74545755
deleted file mode 100644
index 7f5ace505..000000000
--- a/classification_output/04/mistranslation/74545755
+++ /dev/null
@@ -1,352 +0,0 @@
-mistranslation: 0.752
-device: 0.720
-instruction: 0.700
-other: 0.683
-semantic: 0.669
-KVM: 0.661
-graphic: 0.660
-vnc: 0.650
-assembly: 0.648
-boot: 0.607
-network: 0.550
-socket: 0.549
-
-[Bug Report][RFC PATCH 0/1] block: fix failing assert on paused VM migration
-
-There's a bug (failing assert) which is reproduced during migration of
-a paused VM.  I am able to reproduce it on a stand with 2 nodes and a common
-NFS share, with VM's disk on that share.
-
-root@fedora40-1-vm:~# virsh domblklist alma8-vm
- Target   Source
-------------------------------------------
- sda      /mnt/shared/images/alma8.qcow2
-
-root@fedora40-1-vm:~# df -Th /mnt/shared
-Filesystem          Type  Size  Used Avail Use% Mounted on
-127.0.0.1:/srv/nfsd nfs4   63G   16G   48G  25% /mnt/shared
-
-On the 1st node:
-
-root@fedora40-1-vm:~# virsh start alma8-vm ; virsh suspend alma8-vm
-root@fedora40-1-vm:~# virsh migrate --compressed --p2p --persistent 
---undefinesource --live alma8-vm qemu+ssh://fedora40-2-vm/system
-
-Then on the 2nd node:
-
-root@fedora40-2-vm:~# virsh migrate --compressed --p2p --persistent 
---undefinesource --live alma8-vm qemu+ssh://fedora40-1-vm/system
-error: operation failed: domain is not running
-
-root@fedora40-2-vm:~# tail -3 /var/log/libvirt/qemu/alma8-vm.log
-2024-09-19 13:53:33.336+0000: initiating migration
-qemu-system-x86_64: ../block.c:6976: int 
-bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & 
-BDRV_O_INACTIVE)' failed.
-2024-09-19 13:53:42.991+0000: shutting down, reason=crashed
-
-Backtrace:
-
-(gdb) bt
-#0  0x00007f7eaa2f1664 in __pthread_kill_implementation () at /lib64/libc.so.6
-#1  0x00007f7eaa298c4e in raise () at /lib64/libc.so.6
-#2  0x00007f7eaa280902 in abort () at /lib64/libc.so.6
-#3  0x00007f7eaa28081e in __assert_fail_base.cold () at /lib64/libc.so.6
-#4  0x00007f7eaa290d87 in __assert_fail () at /lib64/libc.so.6
-#5  0x0000563c38b95eb8 in bdrv_inactivate_recurse (bs=0x563c3b6c60c0) at 
-../block.c:6976
-#6  0x0000563c38b95aeb in bdrv_inactivate_all () at ../block.c:7038
-#7  0x0000563c3884d354 in qemu_savevm_state_complete_precopy_non_iterable 
-(f=0x563c3b700c20, in_postcopy=false, inactivate_disks=true)
-    at ../migration/savevm.c:1571
-#8  0x0000563c3884dc1a in qemu_savevm_state_complete_precopy (f=0x563c3b700c20, 
-iterable_only=false, inactivate_disks=true) at ../migration/savevm.c:1631
-#9  0x0000563c3883a340 in migration_completion_precopy (s=0x563c3b4d51f0, 
-current_active_state=<optimized out>) at ../migration/migration.c:2780
-#10 migration_completion (s=0x563c3b4d51f0) at ../migration/migration.c:2844
-#11 migration_iteration_run (s=0x563c3b4d51f0) at ../migration/migration.c:3270
-#12 migration_thread (opaque=0x563c3b4d51f0) at ../migration/migration.c:3536
-#13 0x0000563c38dbcf14 in qemu_thread_start (args=0x563c3c2d5bf0) at 
-../util/qemu-thread-posix.c:541
-#14 0x00007f7eaa2ef6d7 in start_thread () at /lib64/libc.so.6
-#15 0x00007f7eaa373414 in clone () at /lib64/libc.so.6
-
-What happens here is that after 1st migration BDS related to HDD remains
-inactive as VM is still paused.  Then when we initiate 2nd migration,
-bdrv_inactivate_all() leads to the attempt to set BDRV_O_INACTIVE flag
-on that node which is already set, thus assert fails.
-
-Attached patch which simply skips setting flag if it's already set is more
-of a kludge than a clean solution.  Should we use more sophisticated logic
-which allows some of the nodes be in inactive state prior to the migration,
-and takes them into account during bdrv_inactivate_all()?  Comments would
-be appreciated.
-
-Andrey
-
-Andrey Drobyshev (1):
-  block: do not fail when inactivating node which is inactive
-
- block.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
-
--- 
-2.39.3
-
-Instead of throwing an assert let's just ignore that flag is already set
-and return.  We assume that it's going to be safe to ignore.  Otherwise
-this assert fails when migrating a paused VM back and forth.
-
-Ideally we'd like to have a more sophisticated solution, e.g. not even
-scan the nodes which should be inactive at this point.
-
-Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
----
- block.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/block.c b/block.c
-index 7d90007cae..c1dcf906d1 100644
---- a/block.c
-+++ b/block.c
-@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK 
-bdrv_inactivate_recurse(BlockDriverState *bs)
-         return 0;
-     }
- 
--    assert(!(bs->open_flags & BDRV_O_INACTIVE));
-+    if (bs->open_flags & BDRV_O_INACTIVE) {
-+        /*
-+         * Return here instead of throwing assert as a workaround to
-+         * prevent failure on migrating paused VM.
-+         * Here we assume that if we're trying to inactivate BDS that's
-+         * already inactive, it's safe to just ignore it.
-+         */
-+        return 0;
-+    }
- 
-     /* Inactivate this node */
-     if (bs->drv->bdrv_inactivate) {
--- 
-2.39.3
-
-[add migration maintainers]
-
-On 24.09.24 15:56, Andrey Drobyshev wrote:
-Instead of throwing an assert let's just ignore that flag is already set
-and return.  We assume that it's going to be safe to ignore.  Otherwise
-this assert fails when migrating a paused VM back and forth.
-
-Ideally we'd like to have a more sophisticated solution, e.g. not even
-scan the nodes which should be inactive at this point.
-
-Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
----
-  block.c | 10 +++++++++-
-  1 file changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/block.c b/block.c
-index 7d90007cae..c1dcf906d1 100644
---- a/block.c
-+++ b/block.c
-@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK 
-bdrv_inactivate_recurse(BlockDriverState *bs)
-          return 0;
-      }
--    assert(!(bs->open_flags & BDRV_O_INACTIVE));
-+    if (bs->open_flags & BDRV_O_INACTIVE) {
-+        /*
-+         * Return here instead of throwing assert as a workaround to
-+         * prevent failure on migrating paused VM.
-+         * Here we assume that if we're trying to inactivate BDS that's
-+         * already inactive, it's safe to just ignore it.
-+         */
-+        return 0;
-+    }
-/* Inactivate this node */
-if (bs->drv->bdrv_inactivate) {
-I doubt that this a correct way to go.
-
-As far as I understand, "inactive" actually means that "storage is not belong to 
-qemu, but to someone else (another qemu process for example), and may be changed 
-transparently". In turn this means that Qemu should do nothing with inactive disks. So the 
-problem is that nobody called bdrv_activate_all on target, and we shouldn't ignore that.
-
-Hmm, I see in process_incoming_migration_bh() we do call bdrv_activate_all(), 
-but only in some scenarios. May be, the condition should be less strict here.
-
-Why we need any condition here at all? Don't we want to activate block-layer on 
-target after migration anyway?
-
---
-Best regards,
-Vladimir
-
-On 9/30/24 12:25 PM, Vladimir Sementsov-Ogievskiy wrote:
->
-[add migration maintainers]
->
->
-On 24.09.24 15:56, Andrey Drobyshev wrote:
->
-> [...]
->
->
-I doubt that this a correct way to go.
->
->
-As far as I understand, "inactive" actually means that "storage is not
->
-belong to qemu, but to someone else (another qemu process for example),
->
-and may be changed transparently". In turn this means that Qemu should
->
-do nothing with inactive disks. So the problem is that nobody called
->
-bdrv_activate_all on target, and we shouldn't ignore that.
->
->
-Hmm, I see in process_incoming_migration_bh() we do call
->
-bdrv_activate_all(), but only in some scenarios. May be, the condition
->
-should be less strict here.
->
->
-Why we need any condition here at all? Don't we want to activate
->
-block-layer on target after migration anyway?
->
-Hmm I'm not sure about the unconditional activation, since we at least
-have to honor LATE_BLOCK_ACTIVATE cap if it's set (and probably delay it
-in such a case).  In current libvirt upstream I see such code:
-
->
-/* Migration capabilities which should always be enabled as long as they
->
->
-* are supported by QEMU. If the capability is supposed to be enabled on both
->
->
-* sides of migration, it won't be enabled unless both sides support it.
->
->
-*/
->
->
-static const qemuMigrationParamsAlwaysOnItem qemuMigrationParamsAlwaysOn[] =
->
-{
->
->
-{QEMU_MIGRATION_CAP_PAUSE_BEFORE_SWITCHOVER,
->
->
-QEMU_MIGRATION_SOURCE},
->
->
->
->
-{QEMU_MIGRATION_CAP_LATE_BLOCK_ACTIVATE,
->
->
-QEMU_MIGRATION_DESTINATION},
->
->
-};
-which means that libvirt always wants LATE_BLOCK_ACTIVATE to be set.
-
-The code from process_incoming_migration_bh() you're referring to:
-
->
-/* If capability late_block_activate is set:
->
->
-* Only fire up the block code now if we're going to restart the
->
->
-* VM, else 'cont' will do it.
->
->
-* This causes file locking to happen; so we don't want it to happen
->
->
-* unless we really are starting the VM.
->
->
-*/
->
->
-if (!migrate_late_block_activate() ||
->
->
-(autostart && (!global_state_received() ||
->
->
-runstate_is_live(global_state_get_runstate())))) {
->
->
-/* Make sure all file formats throw away their mutable metadata.
->
->
->
-* If we get an error here, just don't restart the VM yet. */
->
->
-bdrv_activate_all(&local_err);
->
->
-if (local_err) {
->
->
-error_report_err(local_err);
->
->
-local_err = NULL;
->
->
-autostart = false;
->
->
-}
->
->
-}
-It states explicitly that we're either going to start VM right at this
-point if (autostart == true), or we wait till "cont" command happens.
-None of this is going to happen if we start another migration while
-still being in PAUSED state.  So I think it seems reasonable to take
-such case into account.  For instance, this patch does prevent the crash:
-
->
-diff --git a/migration/migration.c b/migration/migration.c
->
-index ae2be31557..3222f6745b 100644
->
---- a/migration/migration.c
->
-+++ b/migration/migration.c
->
-@@ -733,7 +733,8 @@ static void process_incoming_migration_bh(void *opaque)
->
-*/
->
-if (!migrate_late_block_activate() ||
->
-(autostart && (!global_state_received() ||
->
--            runstate_is_live(global_state_get_runstate())))) {
->
-+            runstate_is_live(global_state_get_runstate()))) ||
->
-+         (!autostart && global_state_get_runstate() == RUN_STATE_PAUSED)) {
->
-/* Make sure all file formats throw away their mutable metadata.
->
-* If we get an error here, just don't restart the VM yet. */
->
-bdrv_activate_all(&local_err);
-What are your thoughts on it?
-
-Andrey
-
diff --git a/classification_output/04/mistranslation/80604314 b/classification_output/04/mistranslation/80604314
deleted file mode 100644
index cb64e7d6d..000000000
--- a/classification_output/04/mistranslation/80604314
+++ /dev/null
@@ -1,1488 +0,0 @@
-mistranslation: 0.922
-device: 0.917
-graphic: 0.901
-other: 0.898
-KVM: 0.891
-semantic: 0.890
-assembly: 0.886
-socket: 0.884
-vnc: 0.881
-instruction: 0.877
-network: 0.865
-boot: 0.860
-
-[BUG] vhost-vdpa: qemu-system-s390x crashes with second virtio-net-ccw device
-
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, 
-    config=0x55d6ad9e3f80 "RT") at /home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-
-Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
-the autogenerated virtio-net-ccw device is present) works. Specifying
-several "-device virtio-net-pci" works as well.
-
-Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
-client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
-works (in-between state does not compile).
-
-This is reproducible with tcg as well. Same problem both with
---enable-vhost-vdpa and --disable-vhost-vdpa.
-
-Have not yet tried to figure out what might be special with
-virtio-ccw... anyone have an idea?
-
-[This should probably be considered a blocker?]
-
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-When I start qemu with a second virtio-net-ccw device (i.e. adding
->
--device virtio-net-ccw in addition to the autogenerated device), I get
->
-a segfault. gdb points to
->
->
-#0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
-config=0x55d6ad9e3f80 "RT") at
->
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-146       if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->
-(backtrace doesn't go further)
->
->
-Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-the autogenerated virtio-net-ccw device is present) works. Specifying
->
-several "-device virtio-net-pci" works as well.
->
->
-Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-works (in-between state does not compile).
-Ouch. I didn't test all in-between states :(
-But I wish we had a 0-day instrastructure like kernel has,
-that catches things like that.
-
->
-This is reproducible with tcg as well. Same problem both with
->
---enable-vhost-vdpa and --disable-vhost-vdpa.
->
->
-Have not yet tried to figure out what might be special with
->
-virtio-ccw... anyone have an idea?
->
->
-[This should probably be considered a blocker?]
-
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin" <mst@redhat.com> wrote:
-
->
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> When I start qemu with a second virtio-net-ccw device (i.e. adding
->
-> -device virtio-net-ccw in addition to the autogenerated device), I get
->
-> a segfault. gdb points to
->
->
->
-> #0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
->     config=0x55d6ad9e3f80 "RT") at
->
-> /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> 146     if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->
->
-> (backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-
->
->
->
-> Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-> the autogenerated virtio-net-ccw device is present) works. Specifying
->
-> several "-device virtio-net-pci" works as well.
->
->
->
-> Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-> client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-> works (in-between state does not compile).
->
->
-Ouch. I didn't test all in-between states :(
->
-But I wish we had a 0-day instrastructure like kernel has,
->
-that catches things like that.
-Yep, that would be useful... so patchew only builds the complete series?
-
->
->
-> This is reproducible with tcg as well. Same problem both with
->
-> --enable-vhost-vdpa and --disable-vhost-vdpa.
->
->
->
-> Have not yet tried to figure out what might be special with
->
-> virtio-ccw... anyone have an idea?
->
->
->
-> [This should probably be considered a blocker?]
-I think so, as it makes s390x unusable with more that one
-virtio-net-ccw device, and I don't even see a workaround.
-
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-On Fri, 24 Jul 2020 09:30:58 -0400
->
-"Michael S. Tsirkin" <mst@redhat.com> wrote:
->
->
-> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> > When I start qemu with a second virtio-net-ccw device (i.e. adding
->
-> > -device virtio-net-ccw in addition to the autogenerated device), I get
->
-> > a segfault. gdb points to
->
-> >
->
-> > #0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
-> >     config=0x55d6ad9e3f80 "RT") at
->
-> > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > 146           if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> >
->
-> > (backtrace doesn't go further)
->
->
-The core was incomplete, but running under gdb directly shows that it
->
-is just a bog-standard config space access (first for that device).
->
->
-The cause of the crash is that nc->peer is not set... no idea how that
->
-can happen, not that familiar with that part of QEMU. (Should the code
->
-check, or is that really something that should not happen?)
->
->
-What I don't understand is why it is set correctly for the first,
->
-autogenerated virtio-net-ccw device, but not for the second one, and
->
-why virtio-net-pci doesn't show these problems. The only difference
->
-between -ccw and -pci that comes to my mind here is that config space
->
-accesses for ccw are done via an asynchronous operation, so timing
->
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-
->
-> >
->
-> > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-> > the autogenerated virtio-net-ccw device is present) works. Specifying
->
-> > several "-device virtio-net-pci" works as well.
->
-> >
->
-> > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-> > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-> > works (in-between state does not compile).
->
->
->
-> Ouch. I didn't test all in-between states :(
->
-> But I wish we had a 0-day instrastructure like kernel has,
->
-> that catches things like that.
->
->
-Yep, that would be useful... so patchew only builds the complete series?
->
->
->
->
-> > This is reproducible with tcg as well. Same problem both with
->
-> > --enable-vhost-vdpa and --disable-vhost-vdpa.
->
-> >
->
-> > Have not yet tried to figure out what might be special with
->
-> > virtio-ccw... anyone have an idea?
->
-> >
->
-> > [This should probably be considered a blocker?]
->
->
-I think so, as it makes s390x unusable with more that one
->
-virtio-net-ccw device, and I don't even see a workaround.
-
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin" <mst@redhat.com> wrote:
-
->
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-> On Fri, 24 Jul 2020 09:30:58 -0400
->
-> "Michael S. Tsirkin" <mst@redhat.com> wrote:
->
->
->
-> > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> > > When I start qemu with a second virtio-net-ccw device (i.e. adding
->
-> > > -device virtio-net-ccw in addition to the autogenerated device), I get
->
-> > > a segfault. gdb points to
->
-> > >
->
-> > > #0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
-> > >     config=0x55d6ad9e3f80 "RT") at
->
-> > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > > 146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> > >
->
-> > > (backtrace doesn't go further)
->
->
->
-> The core was incomplete, but running under gdb directly shows that it
->
-> is just a bog-standard config space access (first for that device).
->
->
->
-> The cause of the crash is that nc->peer is not set... no idea how that
->
-> can happen, not that familiar with that part of QEMU. (Should the code
->
-> check, or is that really something that should not happen?)
->
->
->
-> What I don't understand is why it is set correctly for the first,
->
-> autogenerated virtio-net-ccw device, but not for the second one, and
->
-> why virtio-net-pci doesn't show these problems. The only difference
->
-> between -ccw and -pci that comes to my mind here is that config space
->
-> accesses for ccw are done via an asynchronous operation, so timing
->
-> might be different.
->
->
-Hopefully Jason has an idea. Could you post a full command line
->
-please? Do you need a working guest to trigger this? Does this trigger
->
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on 
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 
--device 
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
- 
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-
->
->
-> > >
->
-> > > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-> > > the autogenerated virtio-net-ccw device is present) works. Specifying
->
-> > > several "-device virtio-net-pci" works as well.
->
-> > >
->
-> > > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-> > > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-> > > works (in-between state does not compile).
->
-> >
->
-> > Ouch. I didn't test all in-between states :(
->
-> > But I wish we had a 0-day instrastructure like kernel has,
->
-> > that catches things like that.
->
->
->
-> Yep, that would be useful... so patchew only builds the complete series?
->
->
->
-> >
->
-> > > This is reproducible with tcg as well. Same problem both with
->
-> > > --enable-vhost-vdpa and --disable-vhost-vdpa.
->
-> > >
->
-> > > Have not yet tried to figure out what might be special with
->
-> > > virtio-ccw... anyone have an idea?
->
-> > >
->
-> > > [This should probably be considered a blocker?]
->
->
->
-> I think so, as it makes s390x unusable with more that one
->
-> virtio-net-ccw device, and I don't even see a workaround.
->
-
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com>  wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com>  wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
-     config=0x55d6ad9e3f80 "RT") at 
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device 
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-
-Thanks
-0001-virtio-net-check-the-existence-of-peer-before-accesi.patch
-Description:
-Text Data
-
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-
->
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
-> On Fri, 24 Jul 2020 11:17:57 -0400
->
-> "Michael S. Tsirkin"<mst@redhat.com>  wrote:
->
->
->
->> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
->>> On Fri, 24 Jul 2020 09:30:58 -0400
->
->>> "Michael S. Tsirkin"<mst@redhat.com>  wrote:
->
->>>
->
->>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
->>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding
->
->>>>> -device virtio-net-ccw in addition to the autogenerated device), I get
->
->>>>> a segfault. gdb points to
->
->>>>>
->
->>>>> #0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
->>>>>      config=0x55d6ad9e3f80 "RT") at
->
->>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
->>>>> 146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->>>>>
->
->>>>> (backtrace doesn't go further)
->
->>> The core was incomplete, but running under gdb directly shows that it
->
->>> is just a bog-standard config space access (first for that device).
->
->>>
->
->>> The cause of the crash is that nc->peer is not set... no idea how that
->
->>> can happen, not that familiar with that part of QEMU. (Should the code
->
->>> check, or is that really something that should not happen?)
->
->>>
->
->>> What I don't understand is why it is set correctly for the first,
->
->>> autogenerated virtio-net-ccw device, but not for the second one, and
->
->>> why virtio-net-pci doesn't show these problems. The only difference
->
->>> between -ccw and -pci that comes to my mind here is that config space
->
->>> accesses for ccw are done via an asynchronous operation, so timing
->
->>> might be different.
->
->> Hopefully Jason has an idea. Could you post a full command line
->
->> please? Do you need a working guest to trigger this? Does this trigger
->
->> on an x86 host?
->
-> Yes, it does trigger with tcg-on-x86 as well. I've been using
->
->
->
-> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu
->
-> qemu,zpci=on
->
-> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
-> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
-> -device
->
-> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
-> -device virtio-net-ccw
->
->
->
-> It seems it needs the guest actually doing something with the nics; I
->
-> cannot reproduce the crash if I use the old advent calendar moon buggy
->
-> image and just add a virtio-net-ccw device.
->
->
->
-> (I don't think it's a problem with my local build, as I see the problem
->
-> both on my laptop and on an LPAR.)
->
->
->
-It looks to me we forget the check the existence of peer.
->
->
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck <cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com>  wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com>  wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
-      config=0x55d6ad9e3f80 "RT") at 
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device 
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck <cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-For autogenerated virtio-net-cww, I think the reason is that it has
-already had a peer set.
-Thanks
-
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-
->
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
->
-> On Sat, 25 Jul 2020 08:40:07 +0800
->
-> Jason Wang <jasowang@redhat.com> wrote:
->
->
->
->> On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
->>> On Fri, 24 Jul 2020 11:17:57 -0400
->
->>> "Michael S. Tsirkin"<mst@redhat.com>  wrote:
->
->>>
->
->>>> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
->>>>> On Fri, 24 Jul 2020 09:30:58 -0400
->
->>>>> "Michael S. Tsirkin"<mst@redhat.com>  wrote:
->
->>>>>
->
->>>>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
->>>>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding
->
->>>>>>> -device virtio-net-ccw in addition to the autogenerated device), I get
->
->>>>>>> a segfault. gdb points to
->
->>>>>>>
->
->>>>>>> #0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
->>>>>>>       config=0x55d6ad9e3f80 "RT") at
->
->>>>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
->>>>>>> 146       if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->>>>>>>
->
->>>>>>> (backtrace doesn't go further)
->
->>>>> The core was incomplete, but running under gdb directly shows that it
->
->>>>> is just a bog-standard config space access (first for that device).
->
->>>>>
->
->>>>> The cause of the crash is that nc->peer is not set... no idea how that
->
->>>>> can happen, not that familiar with that part of QEMU. (Should the code
->
->>>>> check, or is that really something that should not happen?)
->
->>>>>
->
->>>>> What I don't understand is why it is set correctly for the first,
->
->>>>> autogenerated virtio-net-ccw device, but not for the second one, and
->
->>>>> why virtio-net-pci doesn't show these problems. The only difference
->
->>>>> between -ccw and -pci that comes to my mind here is that config space
->
->>>>> accesses for ccw are done via an asynchronous operation, so timing
->
->>>>> might be different.
->
->>>> Hopefully Jason has an idea. Could you post a full command line
->
->>>> please? Do you need a working guest to trigger this? Does this trigger
->
->>>> on an x86 host?
->
->>> Yes, it does trigger with tcg-on-x86 as well. I've been using
->
->>>
->
->>> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu
->
->>> qemu,zpci=on
->
->>> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
->>> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
->>> -device
->
->>> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
->>> -device virtio-net-ccw
->
->>>
->
->>> It seems it needs the guest actually doing something with the nics; I
->
->>> cannot reproduce the crash if I use the old advent calendar moon buggy
->
->>> image and just add a virtio-net-ccw device.
->
->>>
->
->>> (I don't think it's a problem with my local build, as I see the problem
->
->>> both on my laptop and on an LPAR.)
->
->>
->
->> It looks to me we forget the check the existence of peer.
->
->>
->
->> Please try the attached patch to see if it works.
->
-> Thanks, that patch gets my guest up and running again. So, FWIW,
->
->
->
-> Tested-by: Cornelia Huck <cohuck@redhat.com>
->
->
->
-> Any idea why this did not hit with virtio-net-pci (or the autogenerated
->
-> virtio-net-ccw device)?
->
->
->
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-
->
->
-For autogenerated virtio-net-cww, I think the reason is that it has
->
-already had a peer set.
-Ok, that might well be.
-
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com>  wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com>  wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
-       config=0x55d6ad9e3f80 "RT") at 
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device 
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck <cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-Sorry for being unclear, I meant for networking part, you just need
-start without peer, and you need a real guest (any Linux) that is trying
-to access the config space of virtio-net.
-Thanks
-For autogenerated virtio-net-cww, I think the reason is that it has
-already had a peer set.
-Ok, that might well be.
-
-On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
->
->
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
->
-> On Mon, 27 Jul 2020 15:38:12 +0800
->
-> Jason Wang <jasowang@redhat.com> wrote:
->
->
->
-> > On 2020/7/27 下午2:43, Cornelia Huck wrote:
->
-> > > On Sat, 25 Jul 2020 08:40:07 +0800
->
-> > > Jason Wang <jasowang@redhat.com> wrote:
->
-> > > > On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
-> > > > > On Fri, 24 Jul 2020 11:17:57 -0400
->
-> > > > > "Michael S. Tsirkin"<mst@redhat.com>  wrote:
->
-> > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-> > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400
->
-> > > > > > > "Michael S. Tsirkin"<mst@redhat.com>  wrote:
->
-> > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> > > > > > > > > When I start qemu with a second virtio-net-ccw device (i.e.
->
-> > > > > > > > > adding
->
-> > > > > > > > > -device virtio-net-ccw in addition to the autogenerated
->
-> > > > > > > > > device), I get
->
-> > > > > > > > > a segfault. gdb points to
->
-> > > > > > > > >
->
-> > > > > > > > > #0  0x000055d6ab52681d in virtio_net_get_config
->
-> > > > > > > > > (vdev=<optimized out>,
->
-> > > > > > > > >        config=0x55d6ad9e3f80 "RT") at
->
-> > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > > > > > > > > 146     if (nc->peer->info->type ==
->
-> > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> > > > > > > > >
->
-> > > > > > > > > (backtrace doesn't go further)
->
-> > > > > > > The core was incomplete, but running under gdb directly shows
->
-> > > > > > > that it
->
-> > > > > > > is just a bog-standard config space access (first for that
->
-> > > > > > > device).
->
-> > > > > > >
->
-> > > > > > > The cause of the crash is that nc->peer is not set... no idea
->
-> > > > > > > how that
->
-> > > > > > > can happen, not that familiar with that part of QEMU. (Should
->
-> > > > > > > the code
->
-> > > > > > > check, or is that really something that should not happen?)
->
-> > > > > > >
->
-> > > > > > > What I don't understand is why it is set correctly for the
->
-> > > > > > > first,
->
-> > > > > > > autogenerated virtio-net-ccw device, but not for the second
->
-> > > > > > > one, and
->
-> > > > > > > why virtio-net-pci doesn't show these problems. The only
->
-> > > > > > > difference
->
-> > > > > > > between -ccw and -pci that comes to my mind here is that config
->
-> > > > > > > space
->
-> > > > > > > accesses for ccw are done via an asynchronous operation, so
->
-> > > > > > > timing
->
-> > > > > > > might be different.
->
-> > > > > > Hopefully Jason has an idea. Could you post a full command line
->
-> > > > > > please? Do you need a working guest to trigger this? Does this
->
-> > > > > > trigger
->
-> > > > > > on an x86 host?
->
-> > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using
->
-> > > > >
->
-> > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu
->
-> > > > > qemu,zpci=on
->
-> > > > > -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
-> > > > > -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
-> > > > > -device
->
-> > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
-> > > > > -device virtio-net-ccw
->
-> > > > >
->
-> > > > > It seems it needs the guest actually doing something with the nics;
->
-> > > > > I
->
-> > > > > cannot reproduce the crash if I use the old advent calendar moon
->
-> > > > > buggy
->
-> > > > > image and just add a virtio-net-ccw device.
->
-> > > > >
->
-> > > > > (I don't think it's a problem with my local build, as I see the
->
-> > > > > problem
->
-> > > > > both on my laptop and on an LPAR.)
->
-> > > > It looks to me we forget the check the existence of peer.
->
-> > > >
->
-> > > > Please try the attached patch to see if it works.
->
-> > > Thanks, that patch gets my guest up and running again. So, FWIW,
->
-> > >
->
-> > > Tested-by: Cornelia Huck <cohuck@redhat.com>
->
-> > >
->
-> > > Any idea why this did not hit with virtio-net-pci (or the autogenerated
->
-> > > virtio-net-ccw device)?
->
-> >
->
-> > It can be hit with virtio-net-pci as well (just start without peer).
->
-> Hm, I had not been able to reproduce the crash with a 'naked' -device
->
-> virtio-net-pci. But checking seems to be the right idea anyway.
->
->
->
-Sorry for being unclear, I meant for networking part, you just need start
->
-without peer, and you need a real guest (any Linux) that is trying to access
->
-the config space of virtio-net.
->
->
-Thanks
-A pxe guest will do it, but that doesn't support ccw, right?
-
-I'm still unclear why this triggers with ccw but not pci -
-any idea?
-
->
->
->
->
-> > For autogenerated virtio-net-cww, I think the reason is that it has
->
-> > already had a peer set.
->
-> Ok, that might well be.
->
->
->
->
-
-On 2020/7/27 下午7:43, Michael S. Tsirkin wrote:
-On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang<jasowang@redhat.com>  wrote:
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang<jasowang@redhat.com>  wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com>   wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com>   wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
-        config=0x55d6ad9e3f80 "RT") at 
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device 
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck<cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-Sorry for being unclear, I meant for networking part, you just need start
-without peer, and you need a real guest (any Linux) that is trying to access
-the config space of virtio-net.
-
-Thanks
-A pxe guest will do it, but that doesn't support ccw, right?
-Yes, it depends on the cli actually.
-I'm still unclear why this triggers with ccw but not pci -
-any idea?
-I don't test pxe but I can reproduce this with pci (just start a linux
-guest without a peer).
-Thanks
-
-On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote:
->
->
-On 2020/7/27 下午7:43, Michael S. Tsirkin wrote:
->
-> On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
->
-> > On 2020/7/27 下午4:41, Cornelia Huck wrote:
->
-> > > On Mon, 27 Jul 2020 15:38:12 +0800
->
-> > > Jason Wang<jasowang@redhat.com>  wrote:
->
-> > >
->
-> > > > On 2020/7/27 下午2:43, Cornelia Huck wrote:
->
-> > > > > On Sat, 25 Jul 2020 08:40:07 +0800
->
-> > > > > Jason Wang<jasowang@redhat.com>  wrote:
->
-> > > > > > On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
-> > > > > > > On Fri, 24 Jul 2020 11:17:57 -0400
->
-> > > > > > > "Michael S. Tsirkin"<mst@redhat.com>   wrote:
->
-> > > > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-> > > > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400
->
-> > > > > > > > > "Michael S. Tsirkin"<mst@redhat.com>   wrote:
->
-> > > > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck
->
-> > > > > > > > > > wrote:
->
-> > > > > > > > > > > When I start qemu with a second virtio-net-ccw device
->
-> > > > > > > > > > > (i.e. adding
->
-> > > > > > > > > > > -device virtio-net-ccw in addition to the autogenerated
->
-> > > > > > > > > > > device), I get
->
-> > > > > > > > > > > a segfault. gdb points to
->
-> > > > > > > > > > >
->
-> > > > > > > > > > > #0  0x000055d6ab52681d in virtio_net_get_config
->
-> > > > > > > > > > > (vdev=<optimized out>,
->
-> > > > > > > > > > >         config=0x55d6ad9e3f80 "RT") at
->
-> > > > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > > > > > > > > > > 146         if (nc->peer->info->type ==
->
-> > > > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> > > > > > > > > > >
->
-> > > > > > > > > > > (backtrace doesn't go further)
->
-> > > > > > > > > The core was incomplete, but running under gdb directly
->
-> > > > > > > > > shows that it
->
-> > > > > > > > > is just a bog-standard config space access (first for that
->
-> > > > > > > > > device).
->
-> > > > > > > > >
->
-> > > > > > > > > The cause of the crash is that nc->peer is not set... no
->
-> > > > > > > > > idea how that
->
-> > > > > > > > > can happen, not that familiar with that part of QEMU.
->
-> > > > > > > > > (Should the code
->
-> > > > > > > > > check, or is that really something that should not happen?)
->
-> > > > > > > > >
->
-> > > > > > > > > What I don't understand is why it is set correctly for the
->
-> > > > > > > > > first,
->
-> > > > > > > > > autogenerated virtio-net-ccw device, but not for the second
->
-> > > > > > > > > one, and
->
-> > > > > > > > > why virtio-net-pci doesn't show these problems. The only
->
-> > > > > > > > > difference
->
-> > > > > > > > > between -ccw and -pci that comes to my mind here is that
->
-> > > > > > > > > config space
->
-> > > > > > > > > accesses for ccw are done via an asynchronous operation, so
->
-> > > > > > > > > timing
->
-> > > > > > > > > might be different.
->
-> > > > > > > > Hopefully Jason has an idea. Could you post a full command
->
-> > > > > > > > line
->
-> > > > > > > > please? Do you need a working guest to trigger this? Does
->
-> > > > > > > > this trigger
->
-> > > > > > > > on an x86 host?
->
-> > > > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using
->
-> > > > > > >
->
-> > > > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg
->
-> > > > > > > -cpu qemu,zpci=on
->
-> > > > > > > -m 1024 -nographic -device
->
-> > > > > > > virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
-> > > > > > > -drive
->
-> > > > > > > file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
-> > > > > > > -device
->
-> > > > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
-> > > > > > > -device virtio-net-ccw
->
-> > > > > > >
->
-> > > > > > > It seems it needs the guest actually doing something with the
->
-> > > > > > > nics; I
->
-> > > > > > > cannot reproduce the crash if I use the old advent calendar
->
-> > > > > > > moon buggy
->
-> > > > > > > image and just add a virtio-net-ccw device.
->
-> > > > > > >
->
-> > > > > > > (I don't think it's a problem with my local build, as I see the
->
-> > > > > > > problem
->
-> > > > > > > both on my laptop and on an LPAR.)
->
-> > > > > > It looks to me we forget the check the existence of peer.
->
-> > > > > >
->
-> > > > > > Please try the attached patch to see if it works.
->
-> > > > > Thanks, that patch gets my guest up and running again. So, FWIW,
->
-> > > > >
->
-> > > > > Tested-by: Cornelia Huck<cohuck@redhat.com>
->
-> > > > >
->
-> > > > > Any idea why this did not hit with virtio-net-pci (or the
->
-> > > > > autogenerated
->
-> > > > > virtio-net-ccw device)?
->
-> > > > It can be hit with virtio-net-pci as well (just start without peer).
->
-> > > Hm, I had not been able to reproduce the crash with a 'naked' -device
->
-> > > virtio-net-pci. But checking seems to be the right idea anyway.
->
-> > Sorry for being unclear, I meant for networking part, you just need start
->
-> > without peer, and you need a real guest (any Linux) that is trying to
->
-> > access
->
-> > the config space of virtio-net.
->
-> >
->
-> > Thanks
->
-> A pxe guest will do it, but that doesn't support ccw, right?
->
->
->
-Yes, it depends on the cli actually.
->
->
->
->
->
-> I'm still unclear why this triggers with ccw but not pci -
->
-> any idea?
->
->
->
-I don't test pxe but I can reproduce this with pci (just start a linux guest
->
-without a peer).
->
->
-Thanks
->
-Might be a good addition to a unit test. Not sure what would the
-test do exactly: just make sure guest runs? Looks like a lot of work
-for an empty test ... maybe we can poke at the guest config with
-qtest commands at least.
-
--- 
-MST
-
-On 2020/7/27 下午9:16, Michael S. Tsirkin wrote:
-On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote:
-On 2020/7/27 下午7:43, Michael S. Tsirkin wrote:
-On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang<jasowang@redhat.com>  wrote:
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang<jasowang@redhat.com>  wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com>   wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com>   wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0  0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
-         config=0x55d6ad9e3f80 "RT") at 
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146         if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device 
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck<cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-Sorry for being unclear, I meant for networking part, you just need start
-without peer, and you need a real guest (any Linux) that is trying to access
-the config space of virtio-net.
-
-Thanks
-A pxe guest will do it, but that doesn't support ccw, right?
-Yes, it depends on the cli actually.
-I'm still unclear why this triggers with ccw but not pci -
-any idea?
-I don't test pxe but I can reproduce this with pci (just start a linux guest
-without a peer).
-
-Thanks
-Might be a good addition to a unit test. Not sure what would the
-test do exactly: just make sure guest runs? Looks like a lot of work
-for an empty test ... maybe we can poke at the guest config with
-qtest commands at least.
-That should work or we can simply extend the exist virtio-net qtest to
-do that.
-Thanks
-