summaryrefslogtreecommitdiffstats
path: root/classification_output/05/mistranslation
diff options
context:
space:
mode:
authorChristian Krinitsin <mail@krinitsin.com>2025-06-01 21:35:14 +0200
committerChristian Krinitsin <mail@krinitsin.com>2025-06-01 21:35:14 +0200
commit3e4c5a6261770bced301b5e74233e7866166ea5b (patch)
tree9379fddaba693ef8a045da06efee8529baa5f6f4 /classification_output/05/mistranslation
parente5634e2806195bee44407853c4bf8776f7abfa4f (diff)
downloademulator-bug-study-3e4c5a6261770bced301b5e74233e7866166ea5b.tar.gz
emulator-bug-study-3e4c5a6261770bced301b5e74233e7866166ea5b.zip
clean up repository
Diffstat (limited to 'classification_output/05/mistranslation')
-rw-r--r--classification_output/05/mistranslation/14887122266
-rw-r--r--classification_output/05/mistranslation/23270873700
-rw-r--r--classification_output/05/mistranslation/25842545210
-rw-r--r--classification_output/05/mistranslation/6432299562
-rw-r--r--classification_output/05/mistranslation/702942551069
-rw-r--r--classification_output/05/mistranslation/744669631886
-rw-r--r--classification_output/05/mistranslation/74545755352
-rw-r--r--classification_output/05/mistranslation/806043141488
8 files changed, 0 insertions, 6033 deletions
diff --git a/classification_output/05/mistranslation/14887122 b/classification_output/05/mistranslation/14887122
deleted file mode 100644
index 1a87937b..00000000
--- a/classification_output/05/mistranslation/14887122
+++ /dev/null
@@ -1,266 +0,0 @@
-mistranslation: 0.930
-semantic: 0.928
-device: 0.919
-assembly: 0.918
-socket: 0.914
-graphic: 0.910
-instruction: 0.905
-other: 0.890
-vnc: 0.871
-network: 0.855
-boot: 0.831
-KVM: 0.814
-
-[BUG][RFC] CPR transfer Issues: Socket permissions and PID files
-
-Hello,
-
-While testing CPR transfer I encountered two issues. The first is that the
-transfer fails when running with pidfiles due to the destination qemu process
-attempting to create the pidfile while it is still locked by the source
-process. The second is that the transfer fails when running with the -run-with
-user=$USERID parameter. This is because the destination qemu process creates
-the UNIX sockets used for the CPR transfer before dropping to the lower
-permissioned user, which causes them to be owned by the original user. The
-source qemu process then does not have permission to connect to it because it
-is already running as the lesser permissioned user.
-
-Reproducing the first issue:
-
-Create a source and destination qemu instance associated with the same VM where
-both processes have the -pidfile parameter passed on the command line. You
-should see the following error on the command line of the second process:
-
-qemu-system-x86_64: cannot create PID file: Cannot lock pid file: Resource
-temporarily unavailable
-
-Reproducing the second issue:
-
-Create a source and destination qemu instance associated with the same VM where
-both processes have -run-with user=$USERID passed on the command line, where
-$USERID is a different user from the one launching the processes. Then attempt
-a CPR transfer using UNIX sockets for the main and cpr sockets. You should
-receive the following error via QMP:
-{"error": {"class": "GenericError", "desc": "Failed to connect to 'cpr.sock':
-Permission denied"}}
-
-I provided a minimal patch that works around the second issue.
-
-Thank you,
-Ben Chaney
-
----
-include/system/os-posix.h | 4 ++++
-os-posix.c | 8 --------
-util/qemu-sockets.c | 21 +++++++++++++++++++++
-3 files changed, 25 insertions(+), 8 deletions(-)
-
-diff --git a/include/system/os-posix.h b/include/system/os-posix.h
-index ce5b3bccf8..2a414a914a 100644
---- a/include/system/os-posix.h
-+++ b/include/system/os-posix.h
-@@ -55,6 +55,10 @@ void os_setup_limits(void);
-void os_setup_post(void);
-int os_mlock(bool on_fault);
-
-+extern struct passwd *user_pwd;
-+extern uid_t user_uid;
-+extern gid_t user_gid;
-+
-/**
-* qemu_alloc_stack:
-* @sz: pointer to a size_t holding the requested usable stack size
-diff --git a/os-posix.c b/os-posix.c
-index 52925c23d3..9369b312a0 100644
---- a/os-posix.c
-+++ b/os-posix.c
-@@ -86,14 +86,6 @@ void os_set_proc_name(const char *s)
-}
-
-
--/*
-- * Must set all three of these at once.
-- * Legal combinations are unset by name by uid
-- */
--static struct passwd *user_pwd; /* NULL non-NULL NULL */
--static uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
--static gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
--
-/*
-* Prepare to change user ID. user_id can be one of 3 forms:
-* - a username, in which case user ID will be changed to its uid,
-diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
-index 77477c1cd5..987977ead9 100644
---- a/util/qemu-sockets.c
-+++ b/util/qemu-sockets.c
-@@ -871,6 +871,14 @@ static bool saddr_is_tight(UnixSocketAddress *saddr)
-#endif
-}
-
-+/*
-+ * Must set all three of these at once.
-+ * Legal combinations are unset by name by uid
-+ */
-+struct passwd *user_pwd; /* NULL non-NULL NULL */
-+uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
-+gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
-+
-static int unix_listen_saddr(UnixSocketAddress *saddr,
-int num,
-Error **errp)
-@@ -947,6 +955,19 @@ static int unix_listen_saddr(UnixSocketAddress *saddr,
-error_setg_errno(errp, errno, "Failed to bind socket to %s", path);
-goto err;
-}
-+ if (user_pwd) {
-+ if (chown(un.sun_path, user_pwd->pw_uid, user_pwd->pw_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s",
-path);
-+ goto err;
-+ }
-+ }
-+ else if (user_uid != -1 && user_gid != -1) {
-+ if (chown(un.sun_path, user_uid, user_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s",
-path);
-+ goto err;
-+ }
-+ }
-+
-if (listen(sock, num) < 0) {
-error_setg_errno(errp, errno, "Failed to listen on socket");
-goto err;
---
-2.40.1
-
-Thank you Ben. I appreciate you testing CPR and shaking out the bugs.
-I will study these and propose patches.
-
-My initial reaction to the pidfile issue is that the orchestration layer must
-pass a different filename when starting the destination qemu instance. When
-using live update without containers, these types of resource conflicts in the
-global namespaces are a known issue.
-
-- Steve
-
-On 3/14/2025 2:33 PM, Chaney, Ben wrote:
-Hello,
-
-While testing CPR transfer I encountered two issues. The first is that the
-transfer fails when running with pidfiles due to the destination qemu process
-attempting to create the pidfile while it is still locked by the source
-process. The second is that the transfer fails when running with the -run-with
-user=$USERID parameter. This is because the destination qemu process creates
-the UNIX sockets used for the CPR transfer before dropping to the lower
-permissioned user, which causes them to be owned by the original user. The
-source qemu process then does not have permission to connect to it because it
-is already running as the lesser permissioned user.
-
-Reproducing the first issue:
-
-Create a source and destination qemu instance associated with the same VM where
-both processes have the -pidfile parameter passed on the command line. You
-should see the following error on the command line of the second process:
-
-qemu-system-x86_64: cannot create PID file: Cannot lock pid file: Resource
-temporarily unavailable
-
-Reproducing the second issue:
-
-Create a source and destination qemu instance associated with the same VM where
-both processes have -run-with user=$USERID passed on the command line, where
-$USERID is a different user from the one launching the processes. Then attempt
-a CPR transfer using UNIX sockets for the main and cpr sockets. You should
-receive the following error via QMP:
-{"error": {"class": "GenericError", "desc": "Failed to connect to 'cpr.sock':
-Permission denied"}}
-
-I provided a minimal patch that works around the second issue.
-
-Thank you,
-Ben Chaney
-
----
-include/system/os-posix.h | 4 ++++
-os-posix.c | 8 --------
-util/qemu-sockets.c | 21 +++++++++++++++++++++
-3 files changed, 25 insertions(+), 8 deletions(-)
-
-diff --git a/include/system/os-posix.h b/include/system/os-posix.h
-index ce5b3bccf8..2a414a914a 100644
---- a/include/system/os-posix.h
-+++ b/include/system/os-posix.h
-@@ -55,6 +55,10 @@ void os_setup_limits(void);
-void os_setup_post(void);
-int os_mlock(bool on_fault);
-
-+extern struct passwd *user_pwd;
-+extern uid_t user_uid;
-+extern gid_t user_gid;
-+
-/**
-* qemu_alloc_stack:
-* @sz: pointer to a size_t holding the requested usable stack size
-diff --git a/os-posix.c b/os-posix.c
-index 52925c23d3..9369b312a0 100644
---- a/os-posix.c
-+++ b/os-posix.c
-@@ -86,14 +86,6 @@ void os_set_proc_name(const char *s)
-}
-
-
--/*
-- * Must set all three of these at once.
-- * Legal combinations are unset by name by uid
-- */
--static struct passwd *user_pwd; /* NULL non-NULL NULL */
--static uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
--static gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
--
-/*
-* Prepare to change user ID. user_id can be one of 3 forms:
-* - a username, in which case user ID will be changed to its uid,
-diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
-index 77477c1cd5..987977ead9 100644
---- a/util/qemu-sockets.c
-+++ b/util/qemu-sockets.c
-@@ -871,6 +871,14 @@ static bool saddr_is_tight(UnixSocketAddress *saddr)
-#endif
-}
-
-+/*
-+ * Must set all three of these at once.
-+ * Legal combinations are unset by name by uid
-+ */
-+struct passwd *user_pwd; /* NULL non-NULL NULL */
-+uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */
-+gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */
-+
-static int unix_listen_saddr(UnixSocketAddress *saddr,
-int num,
-Error **errp)
-@@ -947,6 +955,19 @@ static int unix_listen_saddr(UnixSocketAddress *saddr,
-error_setg_errno(errp, errno, "Failed to bind socket to %s", path);
-goto err;
-}
-+ if (user_pwd) {
-+ if (chown(un.sun_path, user_pwd->pw_uid, user_pwd->pw_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s",
-path);
-+ goto err;
-+ }
-+ }
-+ else if (user_uid != -1 && user_gid != -1) {
-+ if (chown(un.sun_path, user_uid, user_gid) < 0) {
-+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s",
-path);
-+ goto err;
-+ }
-+ }
-+
-if (listen(sock, num) < 0) {
-error_setg_errno(errp, errno, "Failed to listen on socket");
-goto err;
---
-2.40.1
-
diff --git a/classification_output/05/mistranslation/23270873 b/classification_output/05/mistranslation/23270873
deleted file mode 100644
index 4d8b927f..00000000
--- a/classification_output/05/mistranslation/23270873
+++ /dev/null
@@ -1,700 +0,0 @@
-mistranslation: 0.881
-other: 0.839
-boot: 0.830
-vnc: 0.820
-device: 0.810
-KVM: 0.803
-assembly: 0.768
-network: 0.768
-graphic: 0.763
-socket: 0.758
-instruction: 0.755
-semantic: 0.752
-
-[Qemu-devel] [BUG?] aio_get_linux_aio: Assertion `ctx->linux_aio' failed
-
-Hi,
-
-I am seeing some strange QEMU assertion failures for qemu on s390x,
-which prevents a guest from starting.
-
-Git bisecting points to the following commit as the source of the error.
-
-commit ed6e2161715c527330f936d44af4c547f25f687e
-Author: Nishanth Aravamudan <address@hidden>
-Date: Fri Jun 22 12:37:00 2018 -0700
-
- linux-aio: properly bubble up errors from initialization
-
- laio_init() can fail for a couple of reasons, which will lead to a NULL
- pointer dereference in laio_attach_aio_context().
-
- To solve this, add a aio_setup_linux_aio() function which is called
- early in raw_open_common. If this fails, propagate the error up. The
- signature of aio_get_linux_aio() was not modified, because it seems
- preferable to return the actual errno from the possible failing
- initialization calls.
-
- Additionally, when the AioContext changes, we need to associate a
- LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
- callback and call the new aio_setup_linux_aio(), which will allocate a
-new AioContext if needed, and return errors on failures. If it
-fails for
-any reason, fallback to threaded AIO with an error message, as the
- device is already in-use by the guest.
-
- Add an assert that aio_get_linux_aio() cannot return NULL.
-
- Signed-off-by: Nishanth Aravamudan <address@hidden>
- Message-id: address@hidden
- Signed-off-by: Stefan Hajnoczi <address@hidden>
-Not sure what is causing this assertion to fail. Here is the qemu
-command line of the guest, from qemu log, which throws this error:
-LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
-QEMU_AUDIO_DRV=none /usr/local/bin/qemu-system-s390x -name
-guest=rt_vm1,debug-threads=on -S -object
-secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-21-rt_vm1/master-key.aes
--machine s390-ccw-virtio-2.12,accel=kvm,usb=off,dump-guest-core=off -m
-1024 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object
-iothread,id=iothread1 -uuid 0cde16cd-091d-41bd-9ac2-5243df5c9a0d
--display none -no-user-config -nodefaults -chardev
-socket,id=charmonitor,fd=28,server,nowait -mon
-chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
--boot strict=on -drive
-file=/dev/mapper/360050763998b0883980000002a000031,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native
--device
-virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0001,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on
--netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device
-virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:3a:c8:67:95:84,devno=fe.0.0000
--netdev tap,fd=32,id=hostnet1,vhost=on,vhostfd=33 -device
-virtio-net-ccw,netdev=hostnet1,id=net1,mac=52:54:00:2a:e5:08,devno=fe.0.0002
--chardev pty,id=charconsole0 -device
-sclpconsole,chardev=charconsole0,id=console0 -device
-virtio-balloon-ccw,id=balloon0,devno=fe.3.ffba -sandbox
-on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
--msg timestamp=on
-2018-07-17 15:48:42.252+0000: Domain id=21 is tainted: high-privileges
-2018-07-17T15:48:42.279380Z qemu-system-s390x: -chardev
-pty,id=charconsole0: char device redirected to /dev/pts/3 (label
-charconsole0)
-qemu-system-s390x: util/async.c:339: aio_get_linux_aio: Assertion
-`ctx->linux_aio' failed.
-2018-07-17 15:48:43.309+0000: shutting down, reason=failed
-
-
-Any help debugging this would be greatly appreciated.
-
-Thank you
-Farhan
-
-On 17.07.2018 [13:25:53 -0400], Farhan Ali wrote:
->
-Hi,
->
->
-I am seeing some strange QEMU assertion failures for qemu on s390x,
->
-which prevents a guest from starting.
->
->
-Git bisecting points to the following commit as the source of the error.
->
->
-commit ed6e2161715c527330f936d44af4c547f25f687e
->
-Author: Nishanth Aravamudan <address@hidden>
->
-Date: Fri Jun 22 12:37:00 2018 -0700
->
->
-linux-aio: properly bubble up errors from initialization
->
->
-laio_init() can fail for a couple of reasons, which will lead to a NULL
->
-pointer dereference in laio_attach_aio_context().
->
->
-To solve this, add a aio_setup_linux_aio() function which is called
->
-early in raw_open_common. If this fails, propagate the error up. The
->
-signature of aio_get_linux_aio() was not modified, because it seems
->
-preferable to return the actual errno from the possible failing
->
-initialization calls.
->
->
-Additionally, when the AioContext changes, we need to associate a
->
-LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context
->
-callback and call the new aio_setup_linux_aio(), which will allocate a
->
-new AioContext if needed, and return errors on failures. If it fails for
->
-any reason, fallback to threaded AIO with an error message, as the
->
-device is already in-use by the guest.
->
->
-Add an assert that aio_get_linux_aio() cannot return NULL.
->
->
-Signed-off-by: Nishanth Aravamudan <address@hidden>
->
-Message-id: address@hidden
->
-Signed-off-by: Stefan Hajnoczi <address@hidden>
->
->
->
-Not sure what is causing this assertion to fail. Here is the qemu command
->
-line of the guest, from qemu log, which throws this error:
->
->
->
-LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
->
-QEMU_AUDIO_DRV=none /usr/local/bin/qemu-system-s390x -name
->
-guest=rt_vm1,debug-threads=on -S -object
->
-secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-21-rt_vm1/master-key.aes
->
--machine s390-ccw-virtio-2.12,accel=kvm,usb=off,dump-guest-core=off -m 1024
->
--realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object
->
-iothread,id=iothread1 -uuid 0cde16cd-091d-41bd-9ac2-5243df5c9a0d -display
->
-none -no-user-config -nodefaults -chardev
->
-socket,id=charmonitor,fd=28,server,nowait -mon
->
-chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot
->
-strict=on -drive
->
-file=/dev/mapper/360050763998b0883980000002a000031,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native
->
--device
->
-virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0001,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on
->
--netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device
->
-virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:3a:c8:67:95:84,devno=fe.0.0000
->
--netdev tap,fd=32,id=hostnet1,vhost=on,vhostfd=33 -device
->
-virtio-net-ccw,netdev=hostnet1,id=net1,mac=52:54:00:2a:e5:08,devno=fe.0.0002
->
--chardev pty,id=charconsole0 -device
->
-sclpconsole,chardev=charconsole0,id=console0 -device
->
-virtio-balloon-ccw,id=balloon0,devno=fe.3.ffba -sandbox
->
-on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg
->
-timestamp=on
->
->
->
->
-2018-07-17 15:48:42.252+0000: Domain id=21 is tainted: high-privileges
->
-2018-07-17T15:48:42.279380Z qemu-system-s390x: -chardev pty,id=charconsole0:
->
-char device redirected to /dev/pts/3 (label charconsole0)
->
-qemu-system-s390x: util/async.c:339: aio_get_linux_aio: Assertion
->
-`ctx->linux_aio' failed.
->
-2018-07-17 15:48:43.309+0000: shutting down, reason=failed
->
->
->
-Any help debugging this would be greatly appreciated.
-iiuc, this possibly implies AIO was not actually used previously on this
-guest (it might have silently been falling back to threaded IO?). I
-don't have access to s390x, but would it be possible to run qemu under
-gdb and see if aio_setup_linux_aio is being called at all (I think it
-might not be, but I'm not sure why), and if so, if it's for the context
-in question?
-
-If it's not being called first, could you see what callpath is calling
-aio_get_linux_aio when this assertion trips?
-
-Thanks!
--Nish
-
-On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
-iiuc, this possibly implies AIO was not actually used previously on this
-guest (it might have silently been falling back to threaded IO?). I
-don't have access to s390x, but would it be possible to run qemu under
-gdb and see if aio_setup_linux_aio is being called at all (I think it
-might not be, but I'm not sure why), and if so, if it's for the context
-in question?
-
-If it's not being called first, could you see what callpath is calling
-aio_get_linux_aio when this assertion trips?
-
-Thanks!
--Nish
-Hi Nishant,
-From the coredump of the guest this is the call trace that calls
-aio_get_linux_aio:
-Stack trace of thread 145158:
-#0 0x000003ff94dbe274 raise (libc.so.6)
-#1 0x000003ff94da39a8 abort (libc.so.6)
-#2 0x000003ff94db62ce __assert_fail_base (libc.so.6)
-#3 0x000003ff94db634c __assert_fail (libc.so.6)
-#4 0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
-#5 0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
-#6 0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
-#7 0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
-#8 0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
-#9 0x000002aa20db3c34 aio_poll (qemu-system-s390x)
-#10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
-#11 0x000003ff94f879a8 start_thread (libpthread.so.0)
-#12 0x000003ff94e797ee thread_start (libc.so.6)
-
-
-Thanks for taking a look and responding.
-
-Thanks
-Farhan
-
-On 07/18/2018 09:42 AM, Farhan Ali wrote:
-On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
-iiuc, this possibly implies AIO was not actually used previously on this
-guest (it might have silently been falling back to threaded IO?). I
-don't have access to s390x, but would it be possible to run qemu under
-gdb and see if aio_setup_linux_aio is being called at all (I think it
-might not be, but I'm not sure why), and if so, if it's for the context
-in question?
-
-If it's not being called first, could you see what callpath is calling
-aio_get_linux_aio when this assertion trips?
-
-Thanks!
--Nish
-Hi Nishant,
-From the coredump of the guest this is the call trace that calls
-aio_get_linux_aio:
-Stack trace of thread 145158:
-#0  0x000003ff94dbe274 raise (libc.so.6)
-#1  0x000003ff94da39a8 abort (libc.so.6)
-#2  0x000003ff94db62ce __assert_fail_base (libc.so.6)
-#3  0x000003ff94db634c __assert_fail (libc.so.6)
-#4  0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
-#5  0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
-#6  0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
-#7  0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
-#8  0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
-#9  0x000002aa20db3c34 aio_poll (qemu-system-s390x)
-#10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
-#11 0x000003ff94f879a8 start_thread (libpthread.so.0)
-#12 0x000003ff94e797ee thread_start (libc.so.6)
-
-
-Thanks for taking a look and responding.
-
-Thanks
-Farhan
-Trying to debug a little further, the block device in this case is a
-"host device". And looking at your commit carefully you use the
-bdrv_attach_aio_context callback to setup a Linux AioContext.
-For some reason the "host device" struct (BlockDriver bdrv_host_device
-in block/file-posix.c) does not have a bdrv_attach_aio_context defined.
-So a simple change of adding the callback to the struct solves the issue
-and the guest starts fine.
-diff --git a/block/file-posix.c b/block/file-posix.c
-index 28824aa..b8d59fb 100644
---- a/block/file-posix.c
-+++ b/block/file-posix.c
-@@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = {
- .bdrv_refresh_limits = raw_refresh_limits,
- .bdrv_io_plug = raw_aio_plug,
- .bdrv_io_unplug = raw_aio_unplug,
-+ .bdrv_attach_aio_context = raw_aio_attach_aio_context,
-
- .bdrv_co_truncate = raw_co_truncate,
- .bdrv_getlength = raw_getlength,
-I am not too familiar with block device code in QEMU, so not sure if
-this is the right fix or if there are some underlying problems.
-Thanks
-Farhan
-
-On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote:
->
->
->
-On 07/18/2018 09:42 AM, Farhan Ali wrote:
->
->
->
->
->
-> On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
->
-> > iiuc, this possibly implies AIO was not actually used previously on this
->
-> > guest (it might have silently been falling back to threaded IO?). I
->
-> > don't have access to s390x, but would it be possible to run qemu under
->
-> > gdb and see if aio_setup_linux_aio is being called at all (I think it
->
-> > might not be, but I'm not sure why), and if so, if it's for the context
->
-> > in question?
->
-> >
->
-> > If it's not being called first, could you see what callpath is calling
->
-> > aio_get_linux_aio when this assertion trips?
->
-> >
->
-> > Thanks!
->
-> > -Nish
->
->
->
->
->
-> Hi Nishant,
->
->
->
-> From the coredump of the guest this is the call trace that calls
->
-> aio_get_linux_aio:
->
->
->
->
->
-> Stack trace of thread 145158:
->
-> #0  0x000003ff94dbe274 raise (libc.so.6)
->
-> #1  0x000003ff94da39a8 abort (libc.so.6)
->
-> #2  0x000003ff94db62ce __assert_fail_base (libc.so.6)
->
-> #3  0x000003ff94db634c __assert_fail (libc.so.6)
->
-> #4  0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
->
-> #5  0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
->
-> #6  0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
->
-> #7  0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
->
-> #8  0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
->
-> #9  0x000002aa20db3c34 aio_poll (qemu-system-s390x)
->
-> #10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
->
-> #11 0x000003ff94f879a8 start_thread (libpthread.so.0)
->
-> #12 0x000003ff94e797ee thread_start (libc.so.6)
->
->
->
->
->
-> Thanks for taking a look and responding.
->
->
->
-> Thanks
->
-> Farhan
->
->
->
->
->
->
->
->
-Trying to debug a little further, the block device in this case is a "host
->
-device". And looking at your commit carefully you use the
->
-bdrv_attach_aio_context callback to setup a Linux AioContext.
->
->
-For some reason the "host device" struct (BlockDriver bdrv_host_device in
->
-block/file-posix.c) does not have a bdrv_attach_aio_context defined.
->
-So a simple change of adding the callback to the struct solves the issue and
->
-the guest starts fine.
->
->
->
-diff --git a/block/file-posix.c b/block/file-posix.c
->
-index 28824aa..b8d59fb 100644
->
---- a/block/file-posix.c
->
-+++ b/block/file-posix.c
->
-@@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = {
->
-.bdrv_refresh_limits = raw_refresh_limits,
->
-.bdrv_io_plug = raw_aio_plug,
->
-.bdrv_io_unplug = raw_aio_unplug,
->
-+ .bdrv_attach_aio_context = raw_aio_attach_aio_context,
->
->
-.bdrv_co_truncate = raw_co_truncate,
->
-.bdrv_getlength = raw_getlength,
->
->
->
->
-I am not too familiar with block device code in QEMU, so not sure if
->
-this is the right fix or if there are some underlying problems.
-Oh this is quite embarassing! I only added the bdrv_attach_aio_context
-callback for the file-backed device. Your fix is definitely corect for
-host device. Let me make sure there weren't any others missed and I will
-send out a properly formatted patch. Thank you for the quick testing and
-turnaround!
-
--Nish
-
-On 07/18/2018 08:52 PM, Nishanth Aravamudan wrote:
->
-On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote:
->
->
->
->
->
-> On 07/18/2018 09:42 AM, Farhan Ali wrote:
->
->>
->
->>
->
->> On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote:
->
->>> iiuc, this possibly implies AIO was not actually used previously on this
->
->>> guest (it might have silently been falling back to threaded IO?). I
->
->>> don't have access to s390x, but would it be possible to run qemu under
->
->>> gdb and see if aio_setup_linux_aio is being called at all (I think it
->
->>> might not be, but I'm not sure why), and if so, if it's for the context
->
->>> in question?
->
->>>
->
->>> If it's not being called first, could you see what callpath is calling
->
->>> aio_get_linux_aio when this assertion trips?
->
->>>
->
->>> Thanks!
->
->>> -Nish
->
->>
->
->>
->
->> Hi Nishant,
->
->>
->
->> From the coredump of the guest this is the call trace that calls
->
->> aio_get_linux_aio:
->
->>
->
->>
->
->> Stack trace of thread 145158:
->
->> #0  0x000003ff94dbe274 raise (libc.so.6)
->
->> #1  0x000003ff94da39a8 abort (libc.so.6)
->
->> #2  0x000003ff94db62ce __assert_fail_base (libc.so.6)
->
->> #3  0x000003ff94db634c __assert_fail (libc.so.6)
->
->> #4  0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x)
->
->> #5  0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x)
->
->> #6  0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x)
->
->> #7  0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x)
->
->> #8  0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x)
->
->> #9  0x000002aa20db3c34 aio_poll (qemu-system-s390x)
->
->> #10 0x000002aa20be32a2 iothread_run (qemu-system-s390x)
->
->> #11 0x000003ff94f879a8 start_thread (libpthread.so.0)
->
->> #12 0x000003ff94e797ee thread_start (libc.so.6)
->
->>
->
->>
->
->> Thanks for taking a look and responding.
->
->>
->
->> Thanks
->
->> Farhan
->
->>
->
->>
->
->>
->
->
->
-> Trying to debug a little further, the block device in this case is a "host
->
-> device". And looking at your commit carefully you use the
->
-> bdrv_attach_aio_context callback to setup a Linux AioContext.
->
->
->
-> For some reason the "host device" struct (BlockDriver bdrv_host_device in
->
-> block/file-posix.c) does not have a bdrv_attach_aio_context defined.
->
-> So a simple change of adding the callback to the struct solves the issue and
->
-> the guest starts fine.
->
->
->
->
->
-> diff --git a/block/file-posix.c b/block/file-posix.c
->
-> index 28824aa..b8d59fb 100644
->
-> --- a/block/file-posix.c
->
-> +++ b/block/file-posix.c
->
-> @@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = {
->
-> .bdrv_refresh_limits = raw_refresh_limits,
->
-> .bdrv_io_plug = raw_aio_plug,
->
-> .bdrv_io_unplug = raw_aio_unplug,
->
-> + .bdrv_attach_aio_context = raw_aio_attach_aio_context,
->
->
->
-> .bdrv_co_truncate = raw_co_truncate,
->
-> .bdrv_getlength = raw_getlength,
->
->
->
->
->
->
->
-> I am not too familiar with block device code in QEMU, so not sure if
->
-> this is the right fix or if there are some underlying problems.
->
->
-Oh this is quite embarassing! I only added the bdrv_attach_aio_context
->
-callback for the file-backed device. Your fix is definitely corect for
->
-host device. Let me make sure there weren't any others missed and I will
->
-send out a properly formatted patch. Thank you for the quick testing and
->
-turnaround!
-Farhan, can you respin your patch with proper sign-off and patch description?
-Adding qemu-block.
-
-Hi Christian,
-
-On 19.07.2018 [08:55:20 +0200], Christian Borntraeger wrote:
->
->
->
-On 07/18/2018 08:52 PM, Nishanth Aravamudan wrote:
->
-> On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote:
->
->>
->
->>
->
->> On 07/18/2018 09:42 AM, Farhan Ali wrote:
-<snip>
-
->
->> I am not too familiar with block device code in QEMU, so not sure if
->
->> this is the right fix or if there are some underlying problems.
->
->
->
-> Oh this is quite embarassing! I only added the bdrv_attach_aio_context
->
-> callback for the file-backed device. Your fix is definitely corect for
->
-> host device. Let me make sure there weren't any others missed and I will
->
-> send out a properly formatted patch. Thank you for the quick testing and
->
-> turnaround!
->
->
-Farhan, can you respin your patch with proper sign-off and patch description?
->
-Adding qemu-block.
-I sent it yesterday, sorry I didn't cc everyone from this e-mail:
-http://lists.nongnu.org/archive/html/qemu-block/2018-07/msg00516.html
-Thanks,
-Nish
-
diff --git a/classification_output/05/mistranslation/25842545 b/classification_output/05/mistranslation/25842545
deleted file mode 100644
index 088ed7a1..00000000
--- a/classification_output/05/mistranslation/25842545
+++ /dev/null
@@ -1,210 +0,0 @@
-mistranslation: 0.928
-other: 0.912
-KVM: 0.867
-vnc: 0.862
-device: 0.847
-instruction: 0.835
-semantic: 0.829
-boot: 0.824
-assembly: 0.824
-graphic: 0.822
-socket: 0.808
-network: 0.796
-
-[Qemu-devel] [Bug?] Guest pause because VMPTRLD failed in KVM
-
-Hello,
-
- We encountered a problem that a guest paused because the KMOD report VMPTRLD
-failed.
-
-The related information is as follows:
-
-1) Qemu command:
- /usr/bin/qemu-kvm -name omu1 -S -machine pc-i440fx-2.3,accel=kvm,usb=off -cpu
-host -m 15625 -realtime mlock=off -smp 8,sockets=1,cores=8,threads=1 -uuid
-a2aacfff-6583-48b4-b6a4-e6830e519931 -no-user-config -nodefaults -chardev
-socket,id=charmonitor,path=/var/lib/libvirt/qemu/omu1.monitor,server,nowait
--mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
--boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
-virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive
-file=/home/env/guest1.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native
- -device
-virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0
- -drive
-file=/home/env/guest_300G.img,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native
- -device
-virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1
- -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device
-virtio-net-pci,netdev=hostnet0,id=net0,mac=00:00:80:05:00:00,bus=pci.0,addr=0x3
--netdev tap,fd=27,id=hostnet1,vhost=on,vhostfd=28 -device
-virtio-net-pci,netdev=hostnet1,id=net1,mac=00:00:80:05:00:01,bus=pci.0,addr=0x4
--chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0
--device usb-tablet,id=input0 -vnc 0.0.0.0:0 -device
-cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device
-virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
-
- 2) Qemu log:
- KVM: entry failed, hardware error 0x4
- RAX=00000000ffffffed RBX=ffff8803fa2d7fd8 RCX=0100000000000000
-RDX=0000000000000000
- RSI=0000000000000000 RDI=0000000000000046 RBP=ffff8803fa2d7e90
-RSP=ffff8803fa2efe90
- R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000
-R11=000000000000b69a
- R12=0000000000000001 R13=ffffffff81a25b40 R14=0000000000000000
-R15=ffff8803fa2d7fd8
- RIP=ffffffff81053e16 RFL=00000286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
- ES =0000 0000000000000000 ffffffff 00c00000
- CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
- SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
- DS =0000 0000000000000000 ffffffff 00c00000
- FS =0000 0000000000000000 ffffffff 00c00000
- GS =0000 ffff88040f540000 ffffffff 00c00000
- LDT=0000 0000000000000000 ffffffff 00c00000
- TR =0040 ffff88040f550a40 00002087 00008b00 DPL=0 TSS64-busy
- GDT= ffff88040f549000 0000007f
- IDT= ffffffffff529000 00000fff
- CR0=80050033 CR2=00007f81ca0c5000 CR3=00000003f5081000 CR4=000407e0
- DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000
-DR3=0000000000000000
- DR6=00000000ffff0ff0 DR7=0000000000000400
- EFER=0000000000000d01
- Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ??
-?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
-
- 3) Demsg
- [347315.028339] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed
- klogd 1.4.1, ---------- state change ----------
- [347315.039506] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed
- [347315.051728] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed
- [347315.057472] vmwrite error: reg 6c0a value ffff88307e66e480 (err
-2120672384)
- [347315.064567] Pid: 69523, comm: qemu-kvm Tainted: GF X
-3.0.93-0.8-default #1
- [347315.064569] Call Trace:
- [347315.064587] [<ffffffff810049d5>] dump_trace+0x75/0x300
- [347315.064595] [<ffffffff8145e3e3>] dump_stack+0x69/0x6f
- [347315.064617] [<ffffffffa03738de>] vmx_vcpu_load+0x11e/0x1d0 [kvm_intel]
- [347315.064647] [<ffffffffa029a204>] kvm_arch_vcpu_load+0x44/0x1d0 [kvm]
- [347315.064669] [<ffffffff81054ee1>] finish_task_switch+0x81/0xe0
- [347315.064676] [<ffffffff8145f0b4>] thread_return+0x3b/0x2a7
- [347315.064687] [<ffffffffa028d9b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
- [347315.064703] [<ffffffffa02a16d1>] __vcpu_run+0xd1/0x260 [kvm]
- [347315.064732] [<ffffffffa02a2418>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0
-[kvm]
- [347315.064759] [<ffffffffa028ecee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
- [347315.064771] [<ffffffff8116bdfb>] do_vfs_ioctl+0x8b/0x3b0
- [347315.064776] [<ffffffff8116c1c1>] sys_ioctl+0xa1/0xb0
- [347315.064783] [<ffffffff81469272>] system_call_fastpath+0x16/0x1b
- [347315.064797] [<00007fee51969ce7>] 0x7fee51969ce6
- [347315.064799] vmwrite error: reg 6c0c value ffff88307e664000 (err
-2120630272)
- [347315.064802] Pid: 69523, comm: qemu-kvm Tainted: GF X
-3.0.93-0.8-default #1
- [347315.064803] Call Trace:
- [347315.064807] [<ffffffff810049d5>] dump_trace+0x75/0x300
- [347315.064811] [<ffffffff8145e3e3>] dump_stack+0x69/0x6f
- [347315.064817] [<ffffffffa03738ec>] vmx_vcpu_load+0x12c/0x1d0 [kvm_intel]
- [347315.064832] [<ffffffffa029a204>] kvm_arch_vcpu_load+0x44/0x1d0 [kvm]
- [347315.064851] [<ffffffff81054ee1>] finish_task_switch+0x81/0xe0
- [347315.064855] [<ffffffff8145f0b4>] thread_return+0x3b/0x2a7
- [347315.064865] [<ffffffffa028d9b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
- [347315.064880] [<ffffffffa02a16d1>] __vcpu_run+0xd1/0x260 [kvm]
- [347315.064907] [<ffffffffa02a2418>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0
-[kvm]
- [347315.064933] [<ffffffffa028ecee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
- [347315.064943] [<ffffffff8116bdfb>] do_vfs_ioctl+0x8b/0x3b0
- [347315.064947] [<ffffffff8116c1c1>] sys_ioctl+0xa1/0xb0
- [347315.064951] [<ffffffff81469272>] system_call_fastpath+0x16/0x1b
- [347315.064957] [<00007fee51969ce7>] 0x7fee51969ce6
- [347315.064959] vmwrite error: reg 6c10 value 0 (err 0)
-
- 4) The isssue can't be reporduced. I search the Intel VMX sepc about reaseons
-of vmptrld failure:
- The instruction fails if its operand is not properly aligned, sets
-unsupported physical-address bits, or is equal to the VMXON
- pointer. In addition, the instruction fails if the 32 bits in memory
-referenced by the operand do not match the VMCS
- revision identifier supported by this processor.
-
- But I can't find any cues from the KVM source code. It seems each
- error conditions is impossible in theory. :(
-
-Any suggestions will be appreciated! Paolo?
-
---
-Regards,
--Gonglei
-
-On 10/11/2016 15:10, gong lei wrote:
->
-4) The isssue can't be reporduced. I search the Intel VMX sepc about
->
-reaseons
->
-of vmptrld failure:
->
-The instruction fails if its operand is not properly aligned, sets
->
-unsupported physical-address bits, or is equal to the VMXON
->
-pointer. In addition, the instruction fails if the 32 bits in memory
->
-referenced by the operand do not match the VMCS
->
-revision identifier supported by this processor.
->
->
-But I can't find any cues from the KVM source code. It seems each
->
-error conditions is impossible in theory. :(
-Yes, it should not happen. :(
-
-If it's not reproducible, it's really hard to say what it was, except a
-random memory corruption elsewhere or even a bit flip (!).
-
-Paolo
-
-On 2016/11/17 20:39, Paolo Bonzini wrote:
->
->
-On 10/11/2016 15:10, gong lei wrote:
->
-> 4) The isssue can't be reporduced. I search the Intel VMX sepc about
->
-> reaseons
->
-> of vmptrld failure:
->
-> The instruction fails if its operand is not properly aligned, sets
->
-> unsupported physical-address bits, or is equal to the VMXON
->
-> pointer. In addition, the instruction fails if the 32 bits in memory
->
-> referenced by the operand do not match the VMCS
->
-> revision identifier supported by this processor.
->
->
->
-> But I can't find any cues from the KVM source code. It seems each
->
-> error conditions is impossible in theory. :(
->
-Yes, it should not happen. :(
->
->
-If it's not reproducible, it's really hard to say what it was, except a
->
-random memory corruption elsewhere or even a bit flip (!).
->
->
-Paolo
-Thanks for your reply, Paolo :)
-
---
-Regards,
--Gonglei
-
diff --git a/classification_output/05/mistranslation/64322995 b/classification_output/05/mistranslation/64322995
deleted file mode 100644
index 7330769b..00000000
--- a/classification_output/05/mistranslation/64322995
+++ /dev/null
@@ -1,62 +0,0 @@
-mistranslation: 0.936
-device: 0.915
-network: 0.914
-semantic: 0.906
-graphic: 0.904
-other: 0.881
-socket: 0.866
-instruction: 0.864
-vnc: 0.801
-boot: 0.780
-KVM: 0.742
-assembly: 0.653
-
-[Qemu-devel] [BUG] trace: QEMU hangs on initialization with the "simple" backend
-
-While starting the softmmu version of QEMU, the simple backend waits for the
-writeout thread to signal a condition variable when initializing the output file
-path. But since the writeout thread has not been created, it just waits forever.
-
-Thanks,
- Lluis
-
-On Tue, Feb 09, 2016 at 09:24:04PM +0100, Lluís Vilanova wrote:
->
-While starting the softmmu version of QEMU, the simple backend waits for the
->
-writeout thread to signal a condition variable when initializing the output
->
-file
->
-path. But since the writeout thread has not been created, it just waits
->
-forever.
-Denis Lunev posted a fix:
-https://patchwork.ozlabs.org/patch/580968/
-Stefan
-signature.asc
-Description:
-PGP signature
-
-Stefan Hajnoczi writes:
-
->
-On Tue, Feb 09, 2016 at 09:24:04PM +0100, Lluís Vilanova wrote:
->
-> While starting the softmmu version of QEMU, the simple backend waits for the
->
-> writeout thread to signal a condition variable when initializing the output
->
-> file
->
-> path. But since the writeout thread has not been created, it just waits
->
-> forever.
->
-Denis Lunev posted a fix:
->
-https://patchwork.ozlabs.org/patch/580968/
-Great, thanks.
-
-Lluis
-
diff --git a/classification_output/05/mistranslation/70294255 b/classification_output/05/mistranslation/70294255
deleted file mode 100644
index 2f154bf2..00000000
--- a/classification_output/05/mistranslation/70294255
+++ /dev/null
@@ -1,1069 +0,0 @@
-mistranslation: 0.862
-assembly: 0.861
-semantic: 0.858
-socket: 0.858
-device: 0.857
-graphic: 0.857
-instruction: 0.856
-other: 0.852
-network: 0.846
-vnc: 0.837
-boot: 0.811
-KVM: 0.806
-
-[Qemu-devel] 答复: Re: 答复: Re: 答复: Re: 答复: Re: [BUG]COLO failover hang
-
-hi:
-
-yes.it is better.
-
-And should we delete
-
-
-
-
-#ifdef WIN32
-
- QIO_CHANNEL(cioc)->event = CreateEvent(NULL, FALSE, FALSE, NULL)
-
-#endif
-
-
-
-
-in qio_channel_socket_accept?
-
-qio_channel_socket_new already have it.
-
-
-
-
-
-
-
-
-
-
-
-
-原始邮件
-
-
-
-发件人: address@hidden
-收件人:王广10165992
-抄送人: address@hidden address@hidden address@hidden address@hidden
-日 期 :2017年03月22日 15:03
-主 题 :Re: [Qemu-devel] 答复: Re: 答复: Re: 答复: Re: [BUG]COLO failover hang
-
-
-
-
-
-Hi,
-
-On 2017/3/22 9:42, address@hidden wrote:
-> diff --git a/migration/socket.c b/migration/socket.c
->
->
-> index 13966f1..d65a0ea 100644
->
->
-> --- a/migration/socket.c
->
->
-> +++ b/migration/socket.c
->
->
-> @@ -147,8 +147,9 @@ static gboolean
-socket_accept_incoming_migration(QIOChannel *ioc,
->
->
-> }
->
->
->
->
->
-> trace_migration_socket_incoming_accepted()
->
->
->
->
->
-> qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming")
->
->
-> + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN)
->
->
-> migration_channel_process_incoming(migrate_get_current(),
->
->
-> QIO_CHANNEL(sioc))
->
->
-> object_unref(OBJECT(sioc))
->
->
->
->
-> Is this patch ok?
->
-
-Yes, i think this works, but a better way maybe to call
-qio_channel_set_feature()
-in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the
-socket accept fd,
-Or fix it by this:
-
-diff --git a/io/channel-socket.c b/io/channel-socket.c
-index f546c68..ce6894c 100644
---- a/io/channel-socket.c
-+++ b/io/channel-socket.c
-@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
- Error **errp)
- {
- QIOChannelSocket *cioc
--
-- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET))
-- cioc->fd = -1
-+
-+ cioc = qio_channel_socket_new()
- cioc->remoteAddrLen = sizeof(ioc->remoteAddr)
- cioc->localAddrLen = sizeof(ioc->localAddr)
-
-
-Thanks,
-Hailiang
-
-> I have test it . The test could not hang any more.
->
->
->
->
->
->
->
->
->
->
->
->
-> 原始邮件
->
->
->
-> 发件人: address@hidden
-> 收件人: address@hidden address@hidden
-> 抄送人: address@hidden address@hidden address@hidden
-> 日 期 :2017年03月22日 09:11
-> 主 题 :Re: [Qemu-devel] 答复: Re: 答复: Re: [BUG]COLO failover hang
->
->
->
->
->
-> On 2017/3/21 19:56, Dr. David Alan Gilbert wrote:
-> > * Hailiang Zhang (address@hidden) wrote:
-> >> Hi,
-> >>
-> >> Thanks for reporting this, and i confirmed it in my test, and it is a bug.
-> >>
-> >> Though we tried to call qemu_file_shutdown() to shutdown the related fd, in
-> >> case COLO thread/incoming thread is stuck in read/write() while do
-failover,
-> >> but it didn't take effect, because all the fd used by COLO (also migration)
-> >> has been wrapped by qio channel, and it will not call the shutdown API if
-> >> we didn't qio_channel_set_feature(QIO_CHANNEL(sioc),
-QIO_CHANNEL_FEATURE_SHUTDOWN).
-> >>
-> >> Cc: Dr. David Alan Gilbert address@hidden
-> >>
-> >> I doubted migration cancel has the same problem, it may be stuck in write()
-> >> if we tried to cancel migration.
-> >>
-> >> void fd_start_outgoing_migration(MigrationState *s, const char *fdname,
-Error **errp)
-> >> {
-> >> qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing")
-> >> migration_channel_connect(s, ioc, NULL)
-> >> ... ...
-> >> We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc),
-QIO_CHANNEL_FEATURE_SHUTDOWN) above,
-> >> and the
-> >> migrate_fd_cancel()
-> >> {
-> >> ... ...
-> >> if (s->state == MIGRATION_STATUS_CANCELLING && f) {
-> >> qemu_file_shutdown(f) --> This will not take effect. No ?
-> >> }
-> >> }
-> >
-> > (cc'd in Daniel Berrange).
-> > I see that we call qio_channel_set_feature(ioc,
-QIO_CHANNEL_FEATURE_SHUTDOWN) at the
-> > top of qio_channel_socket_new so I think that's safe isn't it?
-> >
->
-> Hmm, you are right, this problem is only exist for the migration incoming fd,
-thanks.
->
-> > Dave
-> >
-> >> Thanks,
-> >> Hailiang
-> >>
-> >> On 2017/3/21 16:10, address@hidden wrote:
-> >>> Thank you。
-> >>>
-> >>> I have test aready。
-> >>>
-> >>> When the Primary Node panic,the Secondary Node qemu hang at the same
-place。
-> >>>
-> >>> Incorrding
-http://wiki.qemu-project.org/Features/COLO
-,kill Primary Node
-qemu will not produce the problem,but Primary Node panic can。
-> >>>
-> >>> I think due to the feature of channel does not support
-QIO_CHANNEL_FEATURE_SHUTDOWN.
-> >>>
-> >>>
-> >>> when failover,channel_shutdown could not shut down the channel.
-> >>>
-> >>>
-> >>> so the colo_process_incoming_thread will hang at recvmsg.
-> >>>
-> >>>
-> >>> I test a patch:
-> >>>
-> >>>
-> >>> diff --git a/migration/socket.c b/migration/socket.c
-> >>>
-> >>>
-> >>> index 13966f1..d65a0ea 100644
-> >>>
-> >>>
-> >>> --- a/migration/socket.c
-> >>>
-> >>>
-> >>> +++ b/migration/socket.c
-> >>>
-> >>>
-> >>> @@ -147,8 +147,9 @@ static gboolean
-socket_accept_incoming_migration(QIOChannel *ioc,
-> >>>
-> >>>
-> >>> }
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> trace_migration_socket_incoming_accepted()
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> qio_channel_set_name(QIO_CHANNEL(sioc),
-"migration-socket-incoming")
-> >>>
-> >>>
-> >>> + qio_channel_set_feature(QIO_CHANNEL(sioc),
-QIO_CHANNEL_FEATURE_SHUTDOWN)
-> >>>
-> >>>
-> >>> migration_channel_process_incoming(migrate_get_current(),
-> >>>
-> >>>
-> >>> QIO_CHANNEL(sioc))
-> >>>
-> >>>
-> >>> object_unref(OBJECT(sioc))
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> My test will not hang any more.
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> 原始邮件
-> >>>
-> >>>
-> >>>
-> >>> 发件人: address@hidden
-> >>> 收件人:王广10165992 address@hidden
-> >>> 抄送人: address@hidden address@hidden
-> >>> 日 期 :2017年03月21日 15:58
-> >>> 主 题 :Re: [Qemu-devel] 答复: Re: [BUG]COLO failover hang
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> Hi,Wang.
-> >>>
-> >>> You can test this branch:
-> >>>
-> >>>
-https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk
-> >>>
-> >>> and please follow wiki ensure your own configuration correctly.
-> >>>
-> >>>
-http://wiki.qemu-project.org/Features/COLO
-> >>>
-> >>>
-> >>> Thanks
-> >>>
-> >>> Zhang Chen
-> >>>
-> >>>
-> >>> On 03/21/2017 03:27 PM, address@hidden wrote:
-> >>> >
-> >>> > hi.
-> >>> >
-> >>> > I test the git qemu master have the same problem.
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880,
-> >>> > niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461
-> >>> >
-> >>> > #1 0x00007f658e4aa0c2 in qio_channel_read
-> >>> > (address@hidden, address@hidden "",
-> >>> > address@hidden, address@hidden) at io/channel.c:114
-> >>> >
-> >>> > #2 0x00007f658e3ea990 in channel_get_buffer (opaque=<optimized out>,
-> >>> > buf=0x7f65907cb838 "", pos=<optimized out>, size=32768) at
-> >>> > migration/qemu-file-channel.c:78
-> >>> >
-> >>> > #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at
-> >>> > migration/qemu-file.c:295
-> >>> >
-> >>> > #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden,
-> >>> > address@hidden) at migration/qemu-file.c:555
-> >>> >
-> >>> > #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at
-> >>> > migration/qemu-file.c:568
-> >>> >
-> >>> > #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at
-> >>> > migration/qemu-file.c:648
-> >>> >
-> >>> > #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800,
-> >>> > address@hidden) at migration/colo.c:244
-> >>> >
-> >>> > #8 0x00007f658e3e681e in colo_receive_check_message (f=<optimized
-> >>> > out>, address@hidden,
-> >>> > address@hidden)
-> >>> >
-> >>> > at migration/colo.c:264
-> >>> >
-> >>> > #9 0x00007f658e3e740e in colo_process_incoming_thread
-> >>> > (opaque=0x7f658eb30360 <mis_current.31286>) at migration/colo.c:577
-> >>> >
-> >>> > #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0
-> >>> >
-> >>> > #11 0x00007f65881983ed in clone () from /lib64/libc.so.6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7f658ff7d5c0 "migration-socket-incoming"
-> >>> >
-> >>> > (gdb) p ioc->features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN
-> >>> >
-> >>> > $3 = 0
-> >>> >
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90,
-> >>> > condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137
-> >>> >
-> >>> > #1 0x00007fdcc6966350 in g_main_dispatch (context=<optimized out>) at
-> >>> > gmain.c:3054
-> >>> >
-> >>> > #2 g_main_context_dispatch (context=<optimized out>,
-> >>> > address@hidden) at gmain.c:3630
-> >>> >
-> >>> > #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213
-> >>> >
-> >>> > #4 os_host_main_loop_wait (timeout=<optimized out>) at
-> >>> > util/main-loop.c:258
-> >>> >
-> >>> > #5 main_loop_wait (address@hidden) at
-> >>> > util/main-loop.c:506
-> >>> >
-> >>> > #6 0x00007fdccb526187 in main_loop () at vl.c:1898
-> >>> >
-> >>> > #7 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
-> >>> > out>) at vl.c:4709
-> >>> >
-> >>> > (gdb) p ioc->features
-> >>> >
-> >>> > $1 = 6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7fdcce1b1ab0 "migration-socket-listener"
-> >>> >
-> >>> >
-> >>> > May be socket_accept_incoming_migration should
-> >>> > call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)??
-> >>> >
-> >>> >
-> >>> > thank you.
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > 原始邮件
-> >>> > address@hidden
-> >>> > address@hidden
-> >>> > address@hidden@huawei.com>
-> >>> > *日 期 :*2017年03月16日 14:46
-> >>> > *主 题 :**Re: [Qemu-devel] COLO failover hang*
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > On 03/15/2017 05:06 PM, wangguang wrote:
-> >>> > > am testing QEMU COLO feature described here [QEMU
-> >>> > > Wiki](
-http://wiki.qemu-project.org/Features/COLO
-).
-> >>> > >
-> >>> > > When the Primary Node panic,the Secondary Node qemu hang.
-> >>> > > hang at recvmsg in qio_channel_socket_readv.
-> >>> > > And I run { 'execute': 'nbd-server-stop' } and { "execute":
-> >>> > > "x-colo-lost-heartbeat" } in Secondary VM's
-> >>> > > monitor,the Secondary Node qemu still hang at recvmsg .
-> >>> > >
-> >>> > > I found that the colo in qemu is not complete yet.
-> >>> > > Do the colo have any plan for development?
-> >>> >
-> >>> > Yes, We are developing. You can see some of patch we pushing.
-> >>> >
-> >>> > > Has anyone ever run it successfully? Any help is appreciated!
-> >>> >
-> >>> > In our internal version can run it successfully,
-> >>> > The failover detail you can ask Zhanghailiang for help.
-> >>> > Next time if you have some question about COLO,
-> >>> > please cc me and zhanghailiang address@hidden
-> >>> >
-> >>> >
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > centos7.2+qemu2.7.50
-> >>> > > (gdb) bt
-> >>> > > #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0
-> >>> > > #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=<optimized
-out>,
-> >>> > > iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0,
-errp=0x0) at
-> >>> > > io/channel-socket.c:497
-> >>> > > #2 0x00007f3e03329472 in qio_channel_read (address@hidden,
-> >>> > > address@hidden "", address@hidden,
-> >>> > > address@hidden) at io/channel.c:97
-> >>> > > #3 0x00007f3e032750e0 in channel_get_buffer (opaque=<optimized out>,
-> >>> > > buf=0x7f3e05910f38 "", pos=<optimized out>, size=32768) at
-> >>> > > migration/qemu-file-channel.c:78
-> >>> > > #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at
-> >>> > > migration/qemu-file.c:257
-> >>> > > #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden,
-> >>> > > address@hidden) at migration/qemu-file.c:510
-> >>> > > #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at
-> >>> > > migration/qemu-file.c:523
-> >>> > > #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at
-> >>> > > migration/qemu-file.c:603
-> >>> > > #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00,
-> >>> > > address@hidden) at migration/colo.c:215
-> >>> > > #9 0x00007f3e0327250d in colo_wait_handle_message
-(errp=0x7f3d62bfaa48,
-> >>> > > checkpoint_request=<synthetic pointer>, f=<optimized out>) at
-> >>> > > migration/colo.c:546
-> >>> > > #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at
-> >>> > > migration/colo.c:649
-> >>> > > #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0
-> >>> > > #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > --
-> >>> > > View this message in context:
-http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html
-> >>> > > Sent from the Developer mailing list archive at Nabble.com.
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> >
-> >>> > --
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>>
-> >>
-> > --
-> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
-> >
-> > .
-> >
->
-
-On 2017/3/22 16:09, address@hidden wrote:
-hi:
-
-yes.it is better.
-
-And should we delete
-Yes, you are right.
-#ifdef WIN32
-
- QIO_CHANNEL(cioc)->event = CreateEvent(NULL, FALSE, FALSE, NULL)
-
-#endif
-
-
-
-
-in qio_channel_socket_accept?
-
-qio_channel_socket_new already have it.
-
-
-
-
-
-
-
-
-
-
-
-
-原始邮件
-
-
-
-发件人: address@hidden
-收件人:王广10165992
-抄送人: address@hidden address@hidden address@hidden address@hidden
-日 期 :2017年03月22日 15:03
-主 题 :Re: [Qemu-devel] 答复: Re: 答复: Re: 答复: Re: [BUG]COLO failover hang
-
-
-
-
-
-Hi,
-
-On 2017/3/22 9:42, address@hidden wrote:
-> diff --git a/migration/socket.c b/migration/socket.c
->
->
-> index 13966f1..d65a0ea 100644
->
->
-> --- a/migration/socket.c
->
->
-> +++ b/migration/socket.c
->
->
-> @@ -147,8 +147,9 @@ static gboolean
-socket_accept_incoming_migration(QIOChannel *ioc,
->
->
-> }
->
->
->
->
->
-> trace_migration_socket_incoming_accepted()
->
->
->
->
->
-> qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming")
->
->
-> + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN)
->
->
-> migration_channel_process_incoming(migrate_get_current(),
->
->
-> QIO_CHANNEL(sioc))
->
->
-> object_unref(OBJECT(sioc))
->
->
->
->
-> Is this patch ok?
->
-
-Yes, i think this works, but a better way maybe to call
-qio_channel_set_feature()
-in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the
-socket accept fd,
-Or fix it by this:
-
-diff --git a/io/channel-socket.c b/io/channel-socket.c
-index f546c68..ce6894c 100644
---- a/io/channel-socket.c
-+++ b/io/channel-socket.c
-@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
- Error **errp)
- {
- QIOChannelSocket *cioc
--
-- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET))
-- cioc->fd = -1
-+
-+ cioc = qio_channel_socket_new()
- cioc->remoteAddrLen = sizeof(ioc->remoteAddr)
- cioc->localAddrLen = sizeof(ioc->localAddr)
-
-
-Thanks,
-Hailiang
-
-> I have test it . The test could not hang any more.
->
->
->
->
->
->
->
->
->
->
->
->
-> 原始邮件
->
->
->
-> 发件人: address@hidden
-> 收件人: address@hidden address@hidden
-> 抄送人: address@hidden address@hidden address@hidden
-> 日 期 :2017年03月22日 09:11
-> 主 题 :Re: [Qemu-devel] 答复: Re: 答复: Re: [BUG]COLO failover hang
->
->
->
->
->
-> On 2017/3/21 19:56, Dr. David Alan Gilbert wrote:
-> > * Hailiang Zhang (address@hidden) wrote:
-> >> Hi,
-> >>
-> >> Thanks for reporting this, and i confirmed it in my test, and it is a bug.
-> >>
-> >> Though we tried to call qemu_file_shutdown() to shutdown the related fd, in
-> >> case COLO thread/incoming thread is stuck in read/write() while do
-failover,
-> >> but it didn't take effect, because all the fd used by COLO (also migration)
-> >> has been wrapped by qio channel, and it will not call the shutdown API if
-> >> we didn't qio_channel_set_feature(QIO_CHANNEL(sioc),
-QIO_CHANNEL_FEATURE_SHUTDOWN).
-> >>
-> >> Cc: Dr. David Alan Gilbert address@hidden
-> >>
-> >> I doubted migration cancel has the same problem, it may be stuck in write()
-> >> if we tried to cancel migration.
-> >>
-> >> void fd_start_outgoing_migration(MigrationState *s, const char *fdname,
-Error **errp)
-> >> {
-> >> qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing")
-> >> migration_channel_connect(s, ioc, NULL)
-> >> ... ...
-> >> We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc),
-QIO_CHANNEL_FEATURE_SHUTDOWN) above,
-> >> and the
-> >> migrate_fd_cancel()
-> >> {
-> >> ... ...
-> >> if (s->state == MIGRATION_STATUS_CANCELLING && f) {
-> >> qemu_file_shutdown(f) --> This will not take effect. No ?
-> >> }
-> >> }
-> >
-> > (cc'd in Daniel Berrange).
-> > I see that we call qio_channel_set_feature(ioc,
-QIO_CHANNEL_FEATURE_SHUTDOWN) at the
-> > top of qio_channel_socket_new so I think that's safe isn't it?
-> >
->
-> Hmm, you are right, this problem is only exist for the migration incoming fd,
-thanks.
->
-> > Dave
-> >
-> >> Thanks,
-> >> Hailiang
-> >>
-> >> On 2017/3/21 16:10, address@hidden wrote:
-> >>> Thank you。
-> >>>
-> >>> I have test aready。
-> >>>
-> >>> When the Primary Node panic,the Secondary Node qemu hang at the same
-place。
-> >>>
-> >>> Incorrding
-http://wiki.qemu-project.org/Features/COLO
-,kill Primary Node
-qemu will not produce the problem,but Primary Node panic can。
-> >>>
-> >>> I think due to the feature of channel does not support
-QIO_CHANNEL_FEATURE_SHUTDOWN.
-> >>>
-> >>>
-> >>> when failover,channel_shutdown could not shut down the channel.
-> >>>
-> >>>
-> >>> so the colo_process_incoming_thread will hang at recvmsg.
-> >>>
-> >>>
-> >>> I test a patch:
-> >>>
-> >>>
-> >>> diff --git a/migration/socket.c b/migration/socket.c
-> >>>
-> >>>
-> >>> index 13966f1..d65a0ea 100644
-> >>>
-> >>>
-> >>> --- a/migration/socket.c
-> >>>
-> >>>
-> >>> +++ b/migration/socket.c
-> >>>
-> >>>
-> >>> @@ -147,8 +147,9 @@ static gboolean
-socket_accept_incoming_migration(QIOChannel *ioc,
-> >>>
-> >>>
-> >>> }
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> trace_migration_socket_incoming_accepted()
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> qio_channel_set_name(QIO_CHANNEL(sioc),
-"migration-socket-incoming")
-> >>>
-> >>>
-> >>> + qio_channel_set_feature(QIO_CHANNEL(sioc),
-QIO_CHANNEL_FEATURE_SHUTDOWN)
-> >>>
-> >>>
-> >>> migration_channel_process_incoming(migrate_get_current(),
-> >>>
-> >>>
-> >>> QIO_CHANNEL(sioc))
-> >>>
-> >>>
-> >>> object_unref(OBJECT(sioc))
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> My test will not hang any more.
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> 原始邮件
-> >>>
-> >>>
-> >>>
-> >>> 发件人: address@hidden
-> >>> 收件人:王广10165992 address@hidden
-> >>> 抄送人: address@hidden address@hidden
-> >>> 日 期 :2017年03月21日 15:58
-> >>> 主 题 :Re: [Qemu-devel] 答复: Re: [BUG]COLO failover hang
-> >>>
-> >>>
-> >>>
-> >>>
-> >>>
-> >>> Hi,Wang.
-> >>>
-> >>> You can test this branch:
-> >>>
-> >>>
-https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk
-> >>>
-> >>> and please follow wiki ensure your own configuration correctly.
-> >>>
-> >>>
-http://wiki.qemu-project.org/Features/COLO
-> >>>
-> >>>
-> >>> Thanks
-> >>>
-> >>> Zhang Chen
-> >>>
-> >>>
-> >>> On 03/21/2017 03:27 PM, address@hidden wrote:
-> >>> >
-> >>> > hi.
-> >>> >
-> >>> > I test the git qemu master have the same problem.
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880,
-> >>> > niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461
-> >>> >
-> >>> > #1 0x00007f658e4aa0c2 in qio_channel_read
-> >>> > (address@hidden, address@hidden "",
-> >>> > address@hidden, address@hidden) at io/channel.c:114
-> >>> >
-> >>> > #2 0x00007f658e3ea990 in channel_get_buffer (opaque=<optimized out>,
-> >>> > buf=0x7f65907cb838 "", pos=<optimized out>, size=32768) at
-> >>> > migration/qemu-file-channel.c:78
-> >>> >
-> >>> > #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at
-> >>> > migration/qemu-file.c:295
-> >>> >
-> >>> > #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden,
-> >>> > address@hidden) at migration/qemu-file.c:555
-> >>> >
-> >>> > #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at
-> >>> > migration/qemu-file.c:568
-> >>> >
-> >>> > #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at
-> >>> > migration/qemu-file.c:648
-> >>> >
-> >>> > #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800,
-> >>> > address@hidden) at migration/colo.c:244
-> >>> >
-> >>> > #8 0x00007f658e3e681e in colo_receive_check_message (f=<optimized
-> >>> > out>, address@hidden,
-> >>> > address@hidden)
-> >>> >
-> >>> > at migration/colo.c:264
-> >>> >
-> >>> > #9 0x00007f658e3e740e in colo_process_incoming_thread
-> >>> > (opaque=0x7f658eb30360 <mis_current.31286>) at migration/colo.c:577
-> >>> >
-> >>> > #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0
-> >>> >
-> >>> > #11 0x00007f65881983ed in clone () from /lib64/libc.so.6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7f658ff7d5c0 "migration-socket-incoming"
-> >>> >
-> >>> > (gdb) p ioc->features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN
-> >>> >
-> >>> > $3 = 0
-> >>> >
-> >>> >
-> >>> > (gdb) bt
-> >>> >
-> >>> > #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90,
-> >>> > condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137
-> >>> >
-> >>> > #1 0x00007fdcc6966350 in g_main_dispatch (context=<optimized out>) at
-> >>> > gmain.c:3054
-> >>> >
-> >>> > #2 g_main_context_dispatch (context=<optimized out>,
-> >>> > address@hidden) at gmain.c:3630
-> >>> >
-> >>> > #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213
-> >>> >
-> >>> > #4 os_host_main_loop_wait (timeout=<optimized out>) at
-> >>> > util/main-loop.c:258
-> >>> >
-> >>> > #5 main_loop_wait (address@hidden) at
-> >>> > util/main-loop.c:506
-> >>> >
-> >>> > #6 0x00007fdccb526187 in main_loop () at vl.c:1898
-> >>> >
-> >>> > #7 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized
-> >>> > out>) at vl.c:4709
-> >>> >
-> >>> > (gdb) p ioc->features
-> >>> >
-> >>> > $1 = 6
-> >>> >
-> >>> > (gdb) p ioc->name
-> >>> >
-> >>> > $2 = 0x7fdcce1b1ab0 "migration-socket-listener"
-> >>> >
-> >>> >
-> >>> > May be socket_accept_incoming_migration should
-> >>> > call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)??
-> >>> >
-> >>> >
-> >>> > thank you.
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > 原始邮件
-> >>> > address@hidden
-> >>> > address@hidden
-> >>> > address@hidden@huawei.com>
-> >>> > *日 期 :*2017年03月16日 14:46
-> >>> > *主 题 :**Re: [Qemu-devel] COLO failover hang*
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> > On 03/15/2017 05:06 PM, wangguang wrote:
-> >>> > > am testing QEMU COLO feature described here [QEMU
-> >>> > > Wiki](
-http://wiki.qemu-project.org/Features/COLO
-).
-> >>> > >
-> >>> > > When the Primary Node panic,the Secondary Node qemu hang.
-> >>> > > hang at recvmsg in qio_channel_socket_readv.
-> >>> > > And I run { 'execute': 'nbd-server-stop' } and { "execute":
-> >>> > > "x-colo-lost-heartbeat" } in Secondary VM's
-> >>> > > monitor,the Secondary Node qemu still hang at recvmsg .
-> >>> > >
-> >>> > > I found that the colo in qemu is not complete yet.
-> >>> > > Do the colo have any plan for development?
-> >>> >
-> >>> > Yes, We are developing. You can see some of patch we pushing.
-> >>> >
-> >>> > > Has anyone ever run it successfully? Any help is appreciated!
-> >>> >
-> >>> > In our internal version can run it successfully,
-> >>> > The failover detail you can ask Zhanghailiang for help.
-> >>> > Next time if you have some question about COLO,
-> >>> > please cc me and zhanghailiang address@hidden
-> >>> >
-> >>> >
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > centos7.2+qemu2.7.50
-> >>> > > (gdb) bt
-> >>> > > #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0
-> >>> > > #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=<optimized
-out>,
-> >>> > > iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0,
-errp=0x0) at
-> >>> > > io/channel-socket.c:497
-> >>> > > #2 0x00007f3e03329472 in qio_channel_read (address@hidden,
-> >>> > > address@hidden "", address@hidden,
-> >>> > > address@hidden) at io/channel.c:97
-> >>> > > #3 0x00007f3e032750e0 in channel_get_buffer (opaque=<optimized out>,
-> >>> > > buf=0x7f3e05910f38 "", pos=<optimized out>, size=32768) at
-> >>> > > migration/qemu-file-channel.c:78
-> >>> > > #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at
-> >>> > > migration/qemu-file.c:257
-> >>> > > #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden,
-> >>> > > address@hidden) at migration/qemu-file.c:510
-> >>> > > #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at
-> >>> > > migration/qemu-file.c:523
-> >>> > > #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at
-> >>> > > migration/qemu-file.c:603
-> >>> > > #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00,
-> >>> > > address@hidden) at migration/colo.c:215
-> >>> > > #9 0x00007f3e0327250d in colo_wait_handle_message
-(errp=0x7f3d62bfaa48,
-> >>> > > checkpoint_request=<synthetic pointer>, f=<optimized out>) at
-> >>> > > migration/colo.c:546
-> >>> > > #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at
-> >>> > > migration/colo.c:649
-> >>> > > #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0
-> >>> > > #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > > --
-> >>> > > View this message in context:
-http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html
-> >>> > > Sent from the Developer mailing list archive at Nabble.com.
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> > >
-> >>> >
-> >>> > --
-> >>> > Thanks
-> >>> > Zhang Chen
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>> >
-> >>>
-> >>
-> > --
-> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
-> >
-> > .
-> >
->
-
diff --git a/classification_output/05/mistranslation/74466963 b/classification_output/05/mistranslation/74466963
deleted file mode 100644
index ceba0270..00000000
--- a/classification_output/05/mistranslation/74466963
+++ /dev/null
@@ -1,1886 +0,0 @@
-mistranslation: 0.927
-assembly: 0.910
-device: 0.909
-instruction: 0.903
-KVM: 0.903
-graphic: 0.895
-boot: 0.894
-semantic: 0.891
-socket: 0.879
-vnc: 0.878
-other: 0.877
-network: 0.871
-
-[Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration
-
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-And now we found the same issue at the tcg vm(kvm is fine), after
-migration, the content VM's memory is inconsistent.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
- obj-y += memory_mapping.o
- obj-y += dump.o
- obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
- # xen support
- obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
-}
-
- rcu_read_unlock();
-- DPRINTF("Completed load of VM with exit code %d seq iteration "
-+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
- "%" PRIu64 "\n", ret, seq_iter);
- return ret;
- }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
- }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+ int i;
-+ unsigned char md[MD5_DIGEST_LENGTH];
-+ rcu_read_lock();
-+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+ rcu_read_unlock();
-+
-+ MD5(block->host, block->used_length, md);
-+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+ fprintf(stderr, "%02x", md[i]);
-+ }
-+ fprintf(stderr, "\n");
-+ error_report("end ram md5");
-+}
-+
- void qemu_savevm_state_begin(QEMUFile *f,
- const MigrationParams *params)
- {
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile
-*f, bool iterable_only)
-save_section_header(f, se, QEMU_VM_SECTION_END);
-
- ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+ fprintf(stderr, "after saving %s complete\n", se->idstr);
-+ check_host_md5();
-+
- trace_savevm_section_end(se->idstr, se->section_id, ret);
- save_section_footer(f, se);
- if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
-section_id, le->se->idstr);
- return ret;
- }
-+ if (section_type == QEMU_VM_SECTION_END) {
-+ error_report("after loading state section id %d(%s)",
-+ section_id, le->se->idstr);
-+ check_host_md5();
-+ }
- if (!check_section_footer(f, le)) {
- return -EINVAL;
- }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
- }
-
- cpu_synchronize_all_post_init();
-+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
-+ check_host_md5();
-
- return ret;
- }
-
-* Li Zhijian (address@hidden) wrote:
->
-Hi all,
->
->
-Does anyboday remember the similar issue post by hailiang months ago
->
-http://patchwork.ozlabs.org/patch/454322/
->
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-
->
-And now we found the same issue at the tcg vm(kvm is fine), after migration,
->
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-
->
-we add a patch to check memory content, you can find it from affix
->
->
-steps to reporduce:
->
-1) apply the patch and re-build qemu
->
-2) prepare the ubuntu guest and run memtest in grub.
->
-soruce side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off
->
->
-destination side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
->
->
-3) start migration
->
-with 1000M NIC, migration will finish within 3 min.
->
->
-at source:
->
-(qemu) migrate tcp:192.168.2.66:8881
->
-after saving ram complete
->
-e9e725df678d392b1a83b3a917f332bb
->
-qemu-system-x86_64: end ram md5
->
-(qemu)
->
->
-at destination:
->
-...skip...
->
-Completed load of VM with exit code 0 seq iteration 1264
->
-Completed load of VM with exit code 0 seq iteration 1265
->
-Completed load of VM with exit code 0 seq iteration 1266
->
-qemu-system-x86_64: after loading state section id 2(ram)
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
->
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
->
-This occurs occasionally and only at tcg machine. It seems that
->
-some pages dirtied in source side don't transferred to destination.
->
-This problem can be reproduced even if we disable virtio.
->
->
-Is it OK for some pages that not transferred to destination when do
->
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug. Hard to find though, I guess
-at least memtest is smaller than a big OS. I think I'd dump the whole
-of memory on both sides, hexdump and diff them - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-
-Dave
-
->
-Any idea...
->
->
-=================md5 check patch=============================
->
->
-diff --git a/Makefile.target b/Makefile.target
->
-index 962d004..e2cb8e9 100644
->
---- a/Makefile.target
->
-+++ b/Makefile.target
->
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
->
-obj-y += memory_mapping.o
->
-obj-y += dump.o
->
-obj-y += migration/ram.o migration/savevm.o
->
--LIBS := $(libs_softmmu) $(LIBS)
->
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
->
->
-# xen support
->
-obj-$(CONFIG_XEN) += xen-common.o
->
-diff --git a/migration/ram.c b/migration/ram.c
->
-index 1eb155a..3b7a09d 100644
->
---- a/migration/ram.c
->
-+++ b/migration/ram.c
->
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
->
-version_id)
->
-}
->
->
-rcu_read_unlock();
->
-- DPRINTF("Completed load of VM with exit code %d seq iteration "
->
-+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
->
-"%" PRIu64 "\n", ret, seq_iter);
->
-return ret;
->
-}
->
-diff --git a/migration/savevm.c b/migration/savevm.c
->
-index 0ad1b93..3feaa61 100644
->
---- a/migration/savevm.c
->
-+++ b/migration/savevm.c
->
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
->
->
-}
->
->
-+#include "exec/ram_addr.h"
->
-+#include "qemu/rcu_queue.h"
->
-+#include <clplumbing/md5.h>
->
-+#ifndef MD5_DIGEST_LENGTH
->
-+#define MD5_DIGEST_LENGTH 16
->
-+#endif
->
-+
->
-+static void check_host_md5(void)
->
-+{
->
-+ int i;
->
-+ unsigned char md[MD5_DIGEST_LENGTH];
->
-+ rcu_read_lock();
->
-+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
->
-'pc.ram' block */
->
-+ rcu_read_unlock();
->
-+
->
-+ MD5(block->host, block->used_length, md);
->
-+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
->
-+ fprintf(stderr, "%02x", md[i]);
->
-+ }
->
-+ fprintf(stderr, "\n");
->
-+ error_report("end ram md5");
->
-+}
->
-+
->
-void qemu_savevm_state_begin(QEMUFile *f,
->
-const MigrationParams *params)
->
-{
->
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
->
-bool iterable_only)
->
-save_section_header(f, se, QEMU_VM_SECTION_END);
->
->
-ret = se->ops->save_live_complete_precopy(f, se->opaque);
->
-+
->
-+ fprintf(stderr, "after saving %s complete\n", se->idstr);
->
-+ check_host_md5();
->
-+
->
-trace_savevm_section_end(se->idstr, se->section_id, ret);
->
-save_section_footer(f, se);
->
-if (ret < 0) {
->
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
->
-MigrationIncomingState *mis)
->
-section_id, le->se->idstr);
->
-return ret;
->
-}
->
-+ if (section_type == QEMU_VM_SECTION_END) {
->
-+ error_report("after loading state section id %d(%s)",
->
-+ section_id, le->se->idstr);
->
-+ check_host_md5();
->
-+ }
->
-if (!check_section_footer(f, le)) {
->
-return -EINVAL;
->
-}
->
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
->
-}
->
->
-cpu_synchronize_all_post_init();
->
-+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
->
-+ check_host_md5();
->
->
-return ret;
->
-}
->
->
->
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-On 2015/12/3 17:24, Dr. David Alan Gilbert wrote:
-* Li Zhijian (address@hidden) wrote:
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-And now we found the same issue at the tcg vm(kvm is fine), after migration,
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug. Hard to find though, I guess
-at least memtest is smaller than a big OS. I think I'd dump the whole
-of memory on both sides, hexdump and diff them - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-Maybe one better way to do that is with the help of userfaultfd's write-protect
-capability. It is still in the development by Andrea Arcangeli, but there
-is a RFC version available, please refer to
-http://www.spinics.net/lists/linux-mm/msg97422.html
-(I'm developing live memory snapshot which based on it, maybe this is another
-scene where we
-can use userfaultfd's WP ;) ).
-Dave
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
- obj-y += memory_mapping.o
- obj-y += dump.o
- obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
- # xen support
- obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
- }
-
- rcu_read_unlock();
-- DPRINTF("Completed load of VM with exit code %d seq iteration "
-+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
- "%" PRIu64 "\n", ret, seq_iter);
- return ret;
- }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
- }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+ int i;
-+ unsigned char md[MD5_DIGEST_LENGTH];
-+ rcu_read_lock();
-+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+ rcu_read_unlock();
-+
-+ MD5(block->host, block->used_length, md);
-+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+ fprintf(stderr, "%02x", md[i]);
-+ }
-+ fprintf(stderr, "\n");
-+ error_report("end ram md5");
-+}
-+
- void qemu_savevm_state_begin(QEMUFile *f,
- const MigrationParams *params)
- {
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
-bool iterable_only)
- save_section_header(f, se, QEMU_VM_SECTION_END);
-
- ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+ fprintf(stderr, "after saving %s complete\n", se->idstr);
-+ check_host_md5();
-+
- trace_savevm_section_end(se->idstr, se->section_id, ret);
- save_section_footer(f, se);
- if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
- section_id, le->se->idstr);
- return ret;
- }
-+ if (section_type == QEMU_VM_SECTION_END) {
-+ error_report("after loading state section id %d(%s)",
-+ section_id, le->se->idstr);
-+ check_host_md5();
-+ }
- if (!check_section_footer(f, le)) {
- return -EINVAL;
- }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
- }
-
- cpu_synchronize_all_post_init();
-+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
-+ check_host_md5();
-
- return ret;
- }
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-.
-
-On 12/03/2015 05:37 PM, Hailiang Zhang wrote:
-On 2015/12/3 17:24, Dr. David Alan Gilbert wrote:
-* Li Zhijian (address@hidden) wrote:
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-And now we found the same issue at the tcg vm(kvm is fine), after
-migration,
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after
-cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug. Hard to find though, I guess
-at least memtest is smaller than a big OS. I think I'd dump the whole
-of memory on both sides, hexdump and diff them - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-Maybe one better way to do that is with the help of userfaultfd's
-write-protect
-capability. It is still in the development by Andrea Arcangeli, but there
-is a RFC version available, please refer to
-http://www.spinics.net/lists/linux-mm/msg97422.html
-(I'm developing live memory snapshot which based on it, maybe this is
-another scene where we
-can use userfaultfd's WP ;) ).
-sounds good.
-
-thanks
-Li
-Dave
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
- obj-y += memory_mapping.o
- obj-y += dump.o
- obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
- # xen support
- obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
- }
-
- rcu_read_unlock();
-- DPRINTF("Completed load of VM with exit code %d seq iteration "
-+ fprintf(stderr, "Completed load of VM with exit code %d seq
-iteration "
- "%" PRIu64 "\n", ret, seq_iter);
- return ret;
- }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
- }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+ int i;
-+ unsigned char md[MD5_DIGEST_LENGTH];
-+ rcu_read_lock();
-+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+ rcu_read_unlock();
-+
-+ MD5(block->host, block->used_length, md);
-+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+ fprintf(stderr, "%02x", md[i]);
-+ }
-+ fprintf(stderr, "\n");
-+ error_report("end ram md5");
-+}
-+
- void qemu_savevm_state_begin(QEMUFile *f,
- const MigrationParams *params)
- {
-@@ -1056,6 +1079,10 @@ void
-qemu_savevm_state_complete_precopy(QEMUFile *f,
-bool iterable_only)
- save_section_header(f, se, QEMU_VM_SECTION_END);
-
- ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+ fprintf(stderr, "after saving %s complete\n", se->idstr);
-+ check_host_md5();
-+
- trace_savevm_section_end(se->idstr, se->section_id, ret);
- save_section_footer(f, se);
- if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
- section_id, le->se->idstr);
- return ret;
- }
-+ if (section_type == QEMU_VM_SECTION_END) {
-+ error_report("after loading state section id %d(%s)",
-+ section_id, le->se->idstr);
-+ check_host_md5();
-+ }
- if (!check_section_footer(f, le)) {
- return -EINVAL;
- }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
- }
-
- cpu_synchronize_all_post_init();
-+ error_report("%s: after cpu_synchronize_all_post_init\n",
-__func__);
-+ check_host_md5();
-
- return ret;
- }
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-.
-.
---
-Best regards.
-Li Zhijian (8555)
-
-On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote:
-* Li Zhijian (address@hidden) wrote:
-Hi all,
-
-Does anyboday remember the similar issue post by hailiang months ago
-http://patchwork.ozlabs.org/patch/454322/
-At least tow bugs about migration had been fixed since that.
-Yes, I wondered what happened to that.
-And now we found the same issue at the tcg vm(kvm is fine), after migration,
-the content VM's memory is inconsistent.
-Hmm, TCG only - I don't know much about that; but I guess something must
-be accessing memory without using the proper macros/functions so
-it doesn't mark it as dirty.
-we add a patch to check memory content, you can find it from affix
-
-steps to reporduce:
-1) apply the patch and re-build qemu
-2) prepare the ubuntu guest and run memtest in grub.
-soruce side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off
-
-destination side:
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
-
-3) start migration
-with 1000M NIC, migration will finish within 3 min.
-
-at source:
-(qemu) migrate tcp:192.168.2.66:8881
-after saving ram complete
-e9e725df678d392b1a83b3a917f332bb
-qemu-system-x86_64: end ram md5
-(qemu)
-
-at destination:
-...skip...
-Completed load of VM with exit code 0 seq iteration 1264
-Completed load of VM with exit code 0 seq iteration 1265
-Completed load of VM with exit code 0 seq iteration 1266
-qemu-system-x86_64: after loading state section id 2(ram)
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
-
-49c2dac7bde0e5e22db7280dcb3824f9
-qemu-system-x86_64: end ram md5
-
-This occurs occasionally and only at tcg machine. It seems that
-some pages dirtied in source side don't transferred to destination.
-This problem can be reproduced even if we disable virtio.
-
-Is it OK for some pages that not transferred to destination when do
-migration ? Or is it a bug?
-I'm pretty sure that means it's a bug. Hard to find though, I guess
-at least memtest is smaller than a big OS. I think I'd dump the whole
-of memory on both sides, hexdump and diff them - I'd guess it would
-just be one byte/word different, maybe that would offer some idea what
-wrote it.
-I try to dump and compare them, more than 10 pages are different.
-in source side, they are random value rather than always 'FF' 'FB' 'EF'
-'BF'... in destination.
-and not all of the different pages are continuous.
-
-thanks
-Li
-Dave
-Any idea...
-
-=================md5 check patch=============================
-
-diff --git a/Makefile.target b/Makefile.target
-index 962d004..e2cb8e9 100644
---- a/Makefile.target
-+++ b/Makefile.target
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
- obj-y += memory_mapping.o
- obj-y += dump.o
- obj-y += migration/ram.o migration/savevm.o
--LIBS := $(libs_softmmu) $(LIBS)
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
-
- # xen support
- obj-$(CONFIG_XEN) += xen-common.o
-diff --git a/migration/ram.c b/migration/ram.c
-index 1eb155a..3b7a09d 100644
---- a/migration/ram.c
-+++ b/migration/ram.c
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
-version_id)
- }
-
- rcu_read_unlock();
-- DPRINTF("Completed load of VM with exit code %d seq iteration "
-+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
- "%" PRIu64 "\n", ret, seq_iter);
- return ret;
- }
-diff --git a/migration/savevm.c b/migration/savevm.c
-index 0ad1b93..3feaa61 100644
---- a/migration/savevm.c
-+++ b/migration/savevm.c
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
-
- }
-
-+#include "exec/ram_addr.h"
-+#include "qemu/rcu_queue.h"
-+#include <clplumbing/md5.h>
-+#ifndef MD5_DIGEST_LENGTH
-+#define MD5_DIGEST_LENGTH 16
-+#endif
-+
-+static void check_host_md5(void)
-+{
-+ int i;
-+ unsigned char md[MD5_DIGEST_LENGTH];
-+ rcu_read_lock();
-+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
-'pc.ram' block */
-+ rcu_read_unlock();
-+
-+ MD5(block->host, block->used_length, md);
-+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
-+ fprintf(stderr, "%02x", md[i]);
-+ }
-+ fprintf(stderr, "\n");
-+ error_report("end ram md5");
-+}
-+
- void qemu_savevm_state_begin(QEMUFile *f,
- const MigrationParams *params)
- {
-@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
-bool iterable_only)
- save_section_header(f, se, QEMU_VM_SECTION_END);
-
- ret = se->ops->save_live_complete_precopy(f, se->opaque);
-+
-+ fprintf(stderr, "after saving %s complete\n", se->idstr);
-+ check_host_md5();
-+
- trace_savevm_section_end(se->idstr, se->section_id, ret);
- save_section_footer(f, se);
- if (ret < 0) {
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
-MigrationIncomingState *mis)
- section_id, le->se->idstr);
- return ret;
- }
-+ if (section_type == QEMU_VM_SECTION_END) {
-+ error_report("after loading state section id %d(%s)",
-+ section_id, le->se->idstr);
-+ check_host_md5();
-+ }
- if (!check_section_footer(f, le)) {
- return -EINVAL;
- }
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
- }
-
- cpu_synchronize_all_post_init();
-+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
-+ check_host_md5();
-
- return ret;
- }
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-
-.
---
-Best regards.
-Li Zhijian (8555)
-
-* Li Zhijian (address@hidden) wrote:
->
->
->
-On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote:
->
->* Li Zhijian (address@hidden) wrote:
->
->>Hi all,
->
->>
->
->>Does anyboday remember the similar issue post by hailiang months ago
->
->>
-http://patchwork.ozlabs.org/patch/454322/
->
->>At least tow bugs about migration had been fixed since that.
->
->
->
->Yes, I wondered what happened to that.
->
->
->
->>And now we found the same issue at the tcg vm(kvm is fine), after migration,
->
->>the content VM's memory is inconsistent.
->
->
->
->Hmm, TCG only - I don't know much about that; but I guess something must
->
->be accessing memory without using the proper macros/functions so
->
->it doesn't mark it as dirty.
->
->
->
->>we add a patch to check memory content, you can find it from affix
->
->>
->
->>steps to reporduce:
->
->>1) apply the patch and re-build qemu
->
->>2) prepare the ubuntu guest and run memtest in grub.
->
->>soruce side:
->
->>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
->>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
->>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
->>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
->>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
->>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
->>pc-i440fx-2.3,accel=tcg,usb=off
->
->>
->
->>destination side:
->
->>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
->>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
->>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
->>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
->>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
->>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
->>pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
->
->>
->
->>3) start migration
->
->>with 1000M NIC, migration will finish within 3 min.
->
->>
->
->>at source:
->
->>(qemu) migrate tcp:192.168.2.66:8881
->
->>after saving ram complete
->
->>e9e725df678d392b1a83b3a917f332bb
->
->>qemu-system-x86_64: end ram md5
->
->>(qemu)
->
->>
->
->>at destination:
->
->>...skip...
->
->>Completed load of VM with exit code 0 seq iteration 1264
->
->>Completed load of VM with exit code 0 seq iteration 1265
->
->>Completed load of VM with exit code 0 seq iteration 1266
->
->>qemu-system-x86_64: after loading state section id 2(ram)
->
->>49c2dac7bde0e5e22db7280dcb3824f9
->
->>qemu-system-x86_64: end ram md5
->
->>qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
->
->>
->
->>49c2dac7bde0e5e22db7280dcb3824f9
->
->>qemu-system-x86_64: end ram md5
->
->>
->
->>This occurs occasionally and only at tcg machine. It seems that
->
->>some pages dirtied in source side don't transferred to destination.
->
->>This problem can be reproduced even if we disable virtio.
->
->>
->
->>Is it OK for some pages that not transferred to destination when do
->
->>migration ? Or is it a bug?
->
->
->
->I'm pretty sure that means it's a bug. Hard to find though, I guess
->
->at least memtest is smaller than a big OS. I think I'd dump the whole
->
->of memory on both sides, hexdump and diff them - I'd guess it would
->
->just be one byte/word different, maybe that would offer some idea what
->
->wrote it.
->
->
-I try to dump and compare them, more than 10 pages are different.
->
-in source side, they are random value rather than always 'FF' 'FB' 'EF'
->
-'BF'... in destination.
->
->
-and not all of the different pages are continuous.
-I wonder if it happens on all of memtest's different test patterns,
-perhaps it might be possible to narrow it down if you tell memtest
-to only run one test at a time.
-
-Dave
-
->
->
-thanks
->
-Li
->
->
->
->
->
->Dave
->
->
->
->>Any idea...
->
->>
->
->>=================md5 check patch=============================
->
->>
->
->>diff --git a/Makefile.target b/Makefile.target
->
->>index 962d004..e2cb8e9 100644
->
->>--- a/Makefile.target
->
->>+++ b/Makefile.target
->
->>@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
->
->> obj-y += memory_mapping.o
->
->> obj-y += dump.o
->
->> obj-y += migration/ram.o migration/savevm.o
->
->>-LIBS := $(libs_softmmu) $(LIBS)
->
->>+LIBS := $(libs_softmmu) $(LIBS) -lplumb
->
->>
->
->> # xen support
->
->> obj-$(CONFIG_XEN) += xen-common.o
->
->>diff --git a/migration/ram.c b/migration/ram.c
->
->>index 1eb155a..3b7a09d 100644
->
->>--- a/migration/ram.c
->
->>+++ b/migration/ram.c
->
->>@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int
->
->>version_id)
->
->> }
->
->>
->
->> rcu_read_unlock();
->
->>- DPRINTF("Completed load of VM with exit code %d seq iteration "
->
->>+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
->
->> "%" PRIu64 "\n", ret, seq_iter);
->
->> return ret;
->
->> }
->
->>diff --git a/migration/savevm.c b/migration/savevm.c
->
->>index 0ad1b93..3feaa61 100644
->
->>--- a/migration/savevm.c
->
->>+++ b/migration/savevm.c
->
->>@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
->
->>
->
->> }
->
->>
->
->>+#include "exec/ram_addr.h"
->
->>+#include "qemu/rcu_queue.h"
->
->>+#include <clplumbing/md5.h>
->
->>+#ifndef MD5_DIGEST_LENGTH
->
->>+#define MD5_DIGEST_LENGTH 16
->
->>+#endif
->
->>+
->
->>+static void check_host_md5(void)
->
->>+{
->
->>+ int i;
->
->>+ unsigned char md[MD5_DIGEST_LENGTH];
->
->>+ rcu_read_lock();
->
->>+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
->
->>'pc.ram' block */
->
->>+ rcu_read_unlock();
->
->>+
->
->>+ MD5(block->host, block->used_length, md);
->
->>+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
->
->>+ fprintf(stderr, "%02x", md[i]);
->
->>+ }
->
->>+ fprintf(stderr, "\n");
->
->>+ error_report("end ram md5");
->
->>+}
->
->>+
->
->> void qemu_savevm_state_begin(QEMUFile *f,
->
->> const MigrationParams *params)
->
->> {
->
->>@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f,
->
->>bool iterable_only)
->
->> save_section_header(f, se, QEMU_VM_SECTION_END);
->
->>
->
->> ret = se->ops->save_live_complete_precopy(f, se->opaque);
->
->>+
->
->>+ fprintf(stderr, "after saving %s complete\n", se->idstr);
->
->>+ check_host_md5();
->
->>+
->
->> trace_savevm_section_end(se->idstr, se->section_id, ret);
->
->> save_section_footer(f, se);
->
->> if (ret < 0) {
->
->>@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
->
->>MigrationIncomingState *mis)
->
->> section_id, le->se->idstr);
->
->> return ret;
->
->> }
->
->>+ if (section_type == QEMU_VM_SECTION_END) {
->
->>+ error_report("after loading state section id %d(%s)",
->
->>+ section_id, le->se->idstr);
->
->>+ check_host_md5();
->
->>+ }
->
->> if (!check_section_footer(f, le)) {
->
->> return -EINVAL;
->
->> }
->
->>@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
->
->> }
->
->>
->
->> cpu_synchronize_all_post_init();
->
->>+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
->
->>+ check_host_md5();
->
->>
->
->> return ret;
->
->> }
->
->>
->
->>
->
->>
->
->--
->
->Dr. David Alan Gilbert / address@hidden / Manchester, UK
->
->
->
->
->
->.
->
->
->
->
---
->
-Best regards.
->
-Li Zhijian (8555)
->
->
---
-Dr. David Alan Gilbert / address@hidden / Manchester, UK
-
-Li Zhijian <address@hidden> wrote:
->
-Hi all,
->
->
-Does anyboday remember the similar issue post by hailiang months ago
->
-http://patchwork.ozlabs.org/patch/454322/
->
-At least tow bugs about migration had been fixed since that.
->
->
-And now we found the same issue at the tcg vm(kvm is fine), after
->
-migration, the content VM's memory is inconsistent.
->
->
-we add a patch to check memory content, you can find it from affix
->
->
-steps to reporduce:
->
-1) apply the patch and re-build qemu
->
-2) prepare the ubuntu guest and run memtest in grub.
->
-soruce side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off
->
->
-destination side:
->
-x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device
->
-e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive
->
-if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device
->
-virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
->
--vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp
->
-tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine
->
-pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881
->
->
-3) start migration
->
-with 1000M NIC, migration will finish within 3 min.
->
->
-at source:
->
-(qemu) migrate tcp:192.168.2.66:8881
->
-after saving ram complete
->
-e9e725df678d392b1a83b3a917f332bb
->
-qemu-system-x86_64: end ram md5
->
-(qemu)
->
->
-at destination:
->
-...skip...
->
-Completed load of VM with exit code 0 seq iteration 1264
->
-Completed load of VM with exit code 0 seq iteration 1265
->
-Completed load of VM with exit code 0 seq iteration 1266
->
-qemu-system-x86_64: after loading state section id 2(ram)
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
-qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init
->
->
-49c2dac7bde0e5e22db7280dcb3824f9
->
-qemu-system-x86_64: end ram md5
->
->
-This occurs occasionally and only at tcg machine. It seems that
->
-some pages dirtied in source side don't transferred to destination.
->
-This problem can be reproduced even if we disable virtio.
->
->
-Is it OK for some pages that not transferred to destination when do
->
-migration ? Or is it a bug?
->
->
-Any idea...
-Thanks for describing how to reproduce the bug.
-If some pages are not transferred to destination then it is a bug, so we
-need to know what the problem is, notice that the problem can be that
-TCG is not marking dirty some page, that Migration code "forgets" about
-that page, or anything eles altogether, that is what we need to find.
-
-There are more posibilities, I am not sure that memtest is on 32bit
-mode, and it is inside posibility that we are missing some state when we
-are on real mode.
-
-Will try to take a look at this.
-
-THanks, again.
-
-
->
->
-=================md5 check patch=============================
->
->
-diff --git a/Makefile.target b/Makefile.target
->
-index 962d004..e2cb8e9 100644
->
---- a/Makefile.target
->
-+++ b/Makefile.target
->
-@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o
->
-obj-y += memory_mapping.o
->
-obj-y += dump.o
->
-obj-y += migration/ram.o migration/savevm.o
->
--LIBS := $(libs_softmmu) $(LIBS)
->
-+LIBS := $(libs_softmmu) $(LIBS) -lplumb
->
->
-# xen support
->
-obj-$(CONFIG_XEN) += xen-common.o
->
-diff --git a/migration/ram.c b/migration/ram.c
->
-index 1eb155a..3b7a09d 100644
->
---- a/migration/ram.c
->
-+++ b/migration/ram.c
->
-@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque,
->
-int version_id)
->
-}
->
->
-rcu_read_unlock();
->
-- DPRINTF("Completed load of VM with exit code %d seq iteration "
->
-+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration "
->
-"%" PRIu64 "\n", ret, seq_iter);
->
-return ret;
->
-}
->
-diff --git a/migration/savevm.c b/migration/savevm.c
->
-index 0ad1b93..3feaa61 100644
->
---- a/migration/savevm.c
->
-+++ b/migration/savevm.c
->
-@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f)
->
->
-}
->
->
-+#include "exec/ram_addr.h"
->
-+#include "qemu/rcu_queue.h"
->
-+#include <clplumbing/md5.h>
->
-+#ifndef MD5_DIGEST_LENGTH
->
-+#define MD5_DIGEST_LENGTH 16
->
-+#endif
->
-+
->
-+static void check_host_md5(void)
->
-+{
->
-+ int i;
->
-+ unsigned char md[MD5_DIGEST_LENGTH];
->
-+ rcu_read_lock();
->
-+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check
->
-'pc.ram' block */
->
-+ rcu_read_unlock();
->
-+
->
-+ MD5(block->host, block->used_length, md);
->
-+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) {
->
-+ fprintf(stderr, "%02x", md[i]);
->
-+ }
->
-+ fprintf(stderr, "\n");
->
-+ error_report("end ram md5");
->
-+}
->
-+
->
-void qemu_savevm_state_begin(QEMUFile *f,
->
-const MigrationParams *params)
->
-{
->
-@@ -1056,6 +1079,10 @@ void
->
-qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only)
->
-save_section_header(f, se, QEMU_VM_SECTION_END);
->
->
-ret = se->ops->save_live_complete_precopy(f, se->opaque);
->
-+
->
-+ fprintf(stderr, "after saving %s complete\n", se->idstr);
->
-+ check_host_md5();
->
-+
->
-trace_savevm_section_end(se->idstr, se->section_id, ret);
->
-save_section_footer(f, se);
->
-if (ret < 0) {
->
-@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f,
->
-MigrationIncomingState *mis)
->
-section_id, le->se->idstr);
->
-return ret;
->
-}
->
-+ if (section_type == QEMU_VM_SECTION_END) {
->
-+ error_report("after loading state section id %d(%s)",
->
-+ section_id, le->se->idstr);
->
-+ check_host_md5();
->
-+ }
->
-if (!check_section_footer(f, le)) {
->
-return -EINVAL;
->
-}
->
-@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f)
->
-}
->
->
-cpu_synchronize_all_post_init();
->
-+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__);
->
-+ check_host_md5();
->
->
-return ret;
->
-}
-
->
->
-Thanks for describing how to reproduce the bug.
->
-If some pages are not transferred to destination then it is a bug, so we need
->
-to know what the problem is, notice that the problem can be that TCG is not
->
-marking dirty some page, that Migration code "forgets" about that page, or
->
-anything eles altogether, that is what we need to find.
->
->
-There are more posibilities, I am not sure that memtest is on 32bit mode, and
->
-it is inside posibility that we are missing some state when we are on real
->
-mode.
->
->
-Will try to take a look at this.
->
->
-THanks, again.
->
-Hi Juan & Amit
-
- Do you think we should add a mechanism to check the data integrity during LM
-like Zhijian's patch did? it may be very helpful for developers.
- Actually, I did the similar thing before in order to make sure that I did the
-right thing we I change the code related to LM.
-
-Liang
-
-On (Fri) 04 Dec 2015 [01:43:07], Li, Liang Z wrote:
->
->
->
-> Thanks for describing how to reproduce the bug.
->
-> If some pages are not transferred to destination then it is a bug, so we
->
-> need
->
-> to know what the problem is, notice that the problem can be that TCG is not
->
-> marking dirty some page, that Migration code "forgets" about that page, or
->
-> anything eles altogether, that is what we need to find.
->
->
->
-> There are more posibilities, I am not sure that memtest is on 32bit mode,
->
-> and
->
-> it is inside posibility that we are missing some state when we are on real
->
-> mode.
->
->
->
-> Will try to take a look at this.
->
->
->
-> THanks, again.
->
->
->
->
-Hi Juan & Amit
->
->
-Do you think we should add a mechanism to check the data integrity during LM
->
-like Zhijian's patch did? it may be very helpful for developers.
->
-Actually, I did the similar thing before in order to make sure that I did
->
-the right thing we I change the code related to LM.
-If you mean for debugging, something that's not always on, then I'm
-fine with it.
-
-A script that goes along that shows the result of comparison of the
-diff will be helpful too, something that shows how many pages are
-differnt, how many bytes in a page on average, and so on.
-
- Amit
-
diff --git a/classification_output/05/mistranslation/74545755 b/classification_output/05/mistranslation/74545755
deleted file mode 100644
index 7f5ace50..00000000
--- a/classification_output/05/mistranslation/74545755
+++ /dev/null
@@ -1,352 +0,0 @@
-mistranslation: 0.752
-device: 0.720
-instruction: 0.700
-other: 0.683
-semantic: 0.669
-KVM: 0.661
-graphic: 0.660
-vnc: 0.650
-assembly: 0.648
-boot: 0.607
-network: 0.550
-socket: 0.549
-
-[Bug Report][RFC PATCH 0/1] block: fix failing assert on paused VM migration
-
-There's a bug (failing assert) which is reproduced during migration of
-a paused VM. I am able to reproduce it on a stand with 2 nodes and a common
-NFS share, with VM's disk on that share.
-
-root@fedora40-1-vm:~# virsh domblklist alma8-vm
- Target Source
-------------------------------------------
- sda /mnt/shared/images/alma8.qcow2
-
-root@fedora40-1-vm:~# df -Th /mnt/shared
-Filesystem Type Size Used Avail Use% Mounted on
-127.0.0.1:/srv/nfsd nfs4 63G 16G 48G 25% /mnt/shared
-
-On the 1st node:
-
-root@fedora40-1-vm:~# virsh start alma8-vm ; virsh suspend alma8-vm
-root@fedora40-1-vm:~# virsh migrate --compressed --p2p --persistent
---undefinesource --live alma8-vm qemu+ssh://fedora40-2-vm/system
-
-Then on the 2nd node:
-
-root@fedora40-2-vm:~# virsh migrate --compressed --p2p --persistent
---undefinesource --live alma8-vm qemu+ssh://fedora40-1-vm/system
-error: operation failed: domain is not running
-
-root@fedora40-2-vm:~# tail -3 /var/log/libvirt/qemu/alma8-vm.log
-2024-09-19 13:53:33.336+0000: initiating migration
-qemu-system-x86_64: ../block.c:6976: int
-bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags &
-BDRV_O_INACTIVE)' failed.
-2024-09-19 13:53:42.991+0000: shutting down, reason=crashed
-
-Backtrace:
-
-(gdb) bt
-#0 0x00007f7eaa2f1664 in __pthread_kill_implementation () at /lib64/libc.so.6
-#1 0x00007f7eaa298c4e in raise () at /lib64/libc.so.6
-#2 0x00007f7eaa280902 in abort () at /lib64/libc.so.6
-#3 0x00007f7eaa28081e in __assert_fail_base.cold () at /lib64/libc.so.6
-#4 0x00007f7eaa290d87 in __assert_fail () at /lib64/libc.so.6
-#5 0x0000563c38b95eb8 in bdrv_inactivate_recurse (bs=0x563c3b6c60c0) at
-../block.c:6976
-#6 0x0000563c38b95aeb in bdrv_inactivate_all () at ../block.c:7038
-#7 0x0000563c3884d354 in qemu_savevm_state_complete_precopy_non_iterable
-(f=0x563c3b700c20, in_postcopy=false, inactivate_disks=true)
- at ../migration/savevm.c:1571
-#8 0x0000563c3884dc1a in qemu_savevm_state_complete_precopy (f=0x563c3b700c20,
-iterable_only=false, inactivate_disks=true) at ../migration/savevm.c:1631
-#9 0x0000563c3883a340 in migration_completion_precopy (s=0x563c3b4d51f0,
-current_active_state=<optimized out>) at ../migration/migration.c:2780
-#10 migration_completion (s=0x563c3b4d51f0) at ../migration/migration.c:2844
-#11 migration_iteration_run (s=0x563c3b4d51f0) at ../migration/migration.c:3270
-#12 migration_thread (opaque=0x563c3b4d51f0) at ../migration/migration.c:3536
-#13 0x0000563c38dbcf14 in qemu_thread_start (args=0x563c3c2d5bf0) at
-../util/qemu-thread-posix.c:541
-#14 0x00007f7eaa2ef6d7 in start_thread () at /lib64/libc.so.6
-#15 0x00007f7eaa373414 in clone () at /lib64/libc.so.6
-
-What happens here is that after 1st migration BDS related to HDD remains
-inactive as VM is still paused. Then when we initiate 2nd migration,
-bdrv_inactivate_all() leads to the attempt to set BDRV_O_INACTIVE flag
-on that node which is already set, thus assert fails.
-
-Attached patch which simply skips setting flag if it's already set is more
-of a kludge than a clean solution. Should we use more sophisticated logic
-which allows some of the nodes be in inactive state prior to the migration,
-and takes them into account during bdrv_inactivate_all()? Comments would
-be appreciated.
-
-Andrey
-
-Andrey Drobyshev (1):
- block: do not fail when inactivating node which is inactive
-
- block.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
-
---
-2.39.3
-
-Instead of throwing an assert let's just ignore that flag is already set
-and return. We assume that it's going to be safe to ignore. Otherwise
-this assert fails when migrating a paused VM back and forth.
-
-Ideally we'd like to have a more sophisticated solution, e.g. not even
-scan the nodes which should be inactive at this point.
-
-Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
----
- block.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/block.c b/block.c
-index 7d90007cae..c1dcf906d1 100644
---- a/block.c
-+++ b/block.c
-@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK
-bdrv_inactivate_recurse(BlockDriverState *bs)
- return 0;
- }
-
-- assert(!(bs->open_flags & BDRV_O_INACTIVE));
-+ if (bs->open_flags & BDRV_O_INACTIVE) {
-+ /*
-+ * Return here instead of throwing assert as a workaround to
-+ * prevent failure on migrating paused VM.
-+ * Here we assume that if we're trying to inactivate BDS that's
-+ * already inactive, it's safe to just ignore it.
-+ */
-+ return 0;
-+ }
-
- /* Inactivate this node */
- if (bs->drv->bdrv_inactivate) {
---
-2.39.3
-
-[add migration maintainers]
-
-On 24.09.24 15:56, Andrey Drobyshev wrote:
-Instead of throwing an assert let's just ignore that flag is already set
-and return. We assume that it's going to be safe to ignore. Otherwise
-this assert fails when migrating a paused VM back and forth.
-
-Ideally we'd like to have a more sophisticated solution, e.g. not even
-scan the nodes which should be inactive at this point.
-
-Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
----
- block.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/block.c b/block.c
-index 7d90007cae..c1dcf906d1 100644
---- a/block.c
-+++ b/block.c
-@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK
-bdrv_inactivate_recurse(BlockDriverState *bs)
- return 0;
- }
-- assert(!(bs->open_flags & BDRV_O_INACTIVE));
-+ if (bs->open_flags & BDRV_O_INACTIVE) {
-+ /*
-+ * Return here instead of throwing assert as a workaround to
-+ * prevent failure on migrating paused VM.
-+ * Here we assume that if we're trying to inactivate BDS that's
-+ * already inactive, it's safe to just ignore it.
-+ */
-+ return 0;
-+ }
-/* Inactivate this node */
-if (bs->drv->bdrv_inactivate) {
-I doubt that this a correct way to go.
-
-As far as I understand, "inactive" actually means that "storage is not belong to
-qemu, but to someone else (another qemu process for example), and may be changed
-transparently". In turn this means that Qemu should do nothing with inactive disks. So the
-problem is that nobody called bdrv_activate_all on target, and we shouldn't ignore that.
-
-Hmm, I see in process_incoming_migration_bh() we do call bdrv_activate_all(),
-but only in some scenarios. May be, the condition should be less strict here.
-
-Why we need any condition here at all? Don't we want to activate block-layer on
-target after migration anyway?
-
---
-Best regards,
-Vladimir
-
-On 9/30/24 12:25 PM, Vladimir Sementsov-Ogievskiy wrote:
->
-[add migration maintainers]
->
->
-On 24.09.24 15:56, Andrey Drobyshev wrote:
->
-> [...]
->
->
-I doubt that this a correct way to go.
->
->
-As far as I understand, "inactive" actually means that "storage is not
->
-belong to qemu, but to someone else (another qemu process for example),
->
-and may be changed transparently". In turn this means that Qemu should
->
-do nothing with inactive disks. So the problem is that nobody called
->
-bdrv_activate_all on target, and we shouldn't ignore that.
->
->
-Hmm, I see in process_incoming_migration_bh() we do call
->
-bdrv_activate_all(), but only in some scenarios. May be, the condition
->
-should be less strict here.
->
->
-Why we need any condition here at all? Don't we want to activate
->
-block-layer on target after migration anyway?
->
-Hmm I'm not sure about the unconditional activation, since we at least
-have to honor LATE_BLOCK_ACTIVATE cap if it's set (and probably delay it
-in such a case). In current libvirt upstream I see such code:
-
->
-/* Migration capabilities which should always be enabled as long as they
->
->
-* are supported by QEMU. If the capability is supposed to be enabled on both
->
->
-* sides of migration, it won't be enabled unless both sides support it.
->
->
-*/
->
->
-static const qemuMigrationParamsAlwaysOnItem qemuMigrationParamsAlwaysOn[] =
->
-{
->
->
-{QEMU_MIGRATION_CAP_PAUSE_BEFORE_SWITCHOVER,
->
->
-QEMU_MIGRATION_SOURCE},
->
->
->
->
-{QEMU_MIGRATION_CAP_LATE_BLOCK_ACTIVATE,
->
->
-QEMU_MIGRATION_DESTINATION},
->
->
-};
-which means that libvirt always wants LATE_BLOCK_ACTIVATE to be set.
-
-The code from process_incoming_migration_bh() you're referring to:
-
->
-/* If capability late_block_activate is set:
->
->
-* Only fire up the block code now if we're going to restart the
->
->
-* VM, else 'cont' will do it.
->
->
-* This causes file locking to happen; so we don't want it to happen
->
->
-* unless we really are starting the VM.
->
->
-*/
->
->
-if (!migrate_late_block_activate() ||
->
->
-(autostart && (!global_state_received() ||
->
->
-runstate_is_live(global_state_get_runstate())))) {
->
->
-/* Make sure all file formats throw away their mutable metadata.
->
->
->
-* If we get an error here, just don't restart the VM yet. */
->
->
-bdrv_activate_all(&local_err);
->
->
-if (local_err) {
->
->
-error_report_err(local_err);
->
->
-local_err = NULL;
->
->
-autostart = false;
->
->
-}
->
->
-}
-It states explicitly that we're either going to start VM right at this
-point if (autostart == true), or we wait till "cont" command happens.
-None of this is going to happen if we start another migration while
-still being in PAUSED state. So I think it seems reasonable to take
-such case into account. For instance, this patch does prevent the crash:
-
->
-diff --git a/migration/migration.c b/migration/migration.c
->
-index ae2be31557..3222f6745b 100644
->
---- a/migration/migration.c
->
-+++ b/migration/migration.c
->
-@@ -733,7 +733,8 @@ static void process_incoming_migration_bh(void *opaque)
->
-*/
->
-if (!migrate_late_block_activate() ||
->
-(autostart && (!global_state_received() ||
->
-- runstate_is_live(global_state_get_runstate())))) {
->
-+ runstate_is_live(global_state_get_runstate()))) ||
->
-+ (!autostart && global_state_get_runstate() == RUN_STATE_PAUSED)) {
->
-/* Make sure all file formats throw away their mutable metadata.
->
-* If we get an error here, just don't restart the VM yet. */
->
-bdrv_activate_all(&local_err);
-What are your thoughts on it?
-
-Andrey
-
diff --git a/classification_output/05/mistranslation/80604314 b/classification_output/05/mistranslation/80604314
deleted file mode 100644
index cb64e7d6..00000000
--- a/classification_output/05/mistranslation/80604314
+++ /dev/null
@@ -1,1488 +0,0 @@
-mistranslation: 0.922
-device: 0.917
-graphic: 0.901
-other: 0.898
-KVM: 0.891
-semantic: 0.890
-assembly: 0.886
-socket: 0.884
-vnc: 0.881
-instruction: 0.877
-network: 0.865
-boot: 0.860
-
-[BUG] vhost-vdpa: qemu-system-s390x crashes with second virtio-net-ccw device
-
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
- config=0x55d6ad9e3f80 "RT") at /home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-
-Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
-the autogenerated virtio-net-ccw device is present) works. Specifying
-several "-device virtio-net-pci" works as well.
-
-Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
-client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
-works (in-between state does not compile).
-
-This is reproducible with tcg as well. Same problem both with
---enable-vhost-vdpa and --disable-vhost-vdpa.
-
-Have not yet tried to figure out what might be special with
-virtio-ccw... anyone have an idea?
-
-[This should probably be considered a blocker?]
-
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-When I start qemu with a second virtio-net-ccw device (i.e. adding
->
--device virtio-net-ccw in addition to the autogenerated device), I get
->
-a segfault. gdb points to
->
->
-#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
-config=0x55d6ad9e3f80 "RT") at
->
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->
-(backtrace doesn't go further)
->
->
-Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-the autogenerated virtio-net-ccw device is present) works. Specifying
->
-several "-device virtio-net-pci" works as well.
->
->
-Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-works (in-between state does not compile).
-Ouch. I didn't test all in-between states :(
-But I wish we had a 0-day instrastructure like kernel has,
-that catches things like that.
-
->
-This is reproducible with tcg as well. Same problem both with
->
---enable-vhost-vdpa and --disable-vhost-vdpa.
->
->
-Have not yet tried to figure out what might be special with
->
-virtio-ccw... anyone have an idea?
->
->
-[This should probably be considered a blocker?]
-
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin" <mst@redhat.com> wrote:
-
->
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> When I start qemu with a second virtio-net-ccw device (i.e. adding
->
-> -device virtio-net-ccw in addition to the autogenerated device), I get
->
-> a segfault. gdb points to
->
->
->
-> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
-> config=0x55d6ad9e3f80 "RT") at
->
-> /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->
->
-> (backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-
->
->
->
-> Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-> the autogenerated virtio-net-ccw device is present) works. Specifying
->
-> several "-device virtio-net-pci" works as well.
->
->
->
-> Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-> client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-> works (in-between state does not compile).
->
->
-Ouch. I didn't test all in-between states :(
->
-But I wish we had a 0-day instrastructure like kernel has,
->
-that catches things like that.
-Yep, that would be useful... so patchew only builds the complete series?
-
->
->
-> This is reproducible with tcg as well. Same problem both with
->
-> --enable-vhost-vdpa and --disable-vhost-vdpa.
->
->
->
-> Have not yet tried to figure out what might be special with
->
-> virtio-ccw... anyone have an idea?
->
->
->
-> [This should probably be considered a blocker?]
-I think so, as it makes s390x unusable with more that one
-virtio-net-ccw device, and I don't even see a workaround.
-
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-On Fri, 24 Jul 2020 09:30:58 -0400
->
-"Michael S. Tsirkin" <mst@redhat.com> wrote:
->
->
-> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> > When I start qemu with a second virtio-net-ccw device (i.e. adding
->
-> > -device virtio-net-ccw in addition to the autogenerated device), I get
->
-> > a segfault. gdb points to
->
-> >
->
-> > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
-> > config=0x55d6ad9e3f80 "RT") at
->
-> > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> >
->
-> > (backtrace doesn't go further)
->
->
-The core was incomplete, but running under gdb directly shows that it
->
-is just a bog-standard config space access (first for that device).
->
->
-The cause of the crash is that nc->peer is not set... no idea how that
->
-can happen, not that familiar with that part of QEMU. (Should the code
->
-check, or is that really something that should not happen?)
->
->
-What I don't understand is why it is set correctly for the first,
->
-autogenerated virtio-net-ccw device, but not for the second one, and
->
-why virtio-net-pci doesn't show these problems. The only difference
->
-between -ccw and -pci that comes to my mind here is that config space
->
-accesses for ccw are done via an asynchronous operation, so timing
->
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-
->
-> >
->
-> > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-> > the autogenerated virtio-net-ccw device is present) works. Specifying
->
-> > several "-device virtio-net-pci" works as well.
->
-> >
->
-> > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-> > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-> > works (in-between state does not compile).
->
->
->
-> Ouch. I didn't test all in-between states :(
->
-> But I wish we had a 0-day instrastructure like kernel has,
->
-> that catches things like that.
->
->
-Yep, that would be useful... so patchew only builds the complete series?
->
->
->
->
-> > This is reproducible with tcg as well. Same problem both with
->
-> > --enable-vhost-vdpa and --disable-vhost-vdpa.
->
-> >
->
-> > Have not yet tried to figure out what might be special with
->
-> > virtio-ccw... anyone have an idea?
->
-> >
->
-> > [This should probably be considered a blocker?]
->
->
-I think so, as it makes s390x unusable with more that one
->
-virtio-net-ccw device, and I don't even see a workaround.
-
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin" <mst@redhat.com> wrote:
-
->
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-> On Fri, 24 Jul 2020 09:30:58 -0400
->
-> "Michael S. Tsirkin" <mst@redhat.com> wrote:
->
->
->
-> > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> > > When I start qemu with a second virtio-net-ccw device (i.e. adding
->
-> > > -device virtio-net-ccw in addition to the autogenerated device), I get
->
-> > > a segfault. gdb points to
->
-> > >
->
-> > > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
-> > > config=0x55d6ad9e3f80 "RT") at
->
-> > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> > >
->
-> > > (backtrace doesn't go further)
->
->
->
-> The core was incomplete, but running under gdb directly shows that it
->
-> is just a bog-standard config space access (first for that device).
->
->
->
-> The cause of the crash is that nc->peer is not set... no idea how that
->
-> can happen, not that familiar with that part of QEMU. (Should the code
->
-> check, or is that really something that should not happen?)
->
->
->
-> What I don't understand is why it is set correctly for the first,
->
-> autogenerated virtio-net-ccw device, but not for the second one, and
->
-> why virtio-net-pci doesn't show these problems. The only difference
->
-> between -ccw and -pci that comes to my mind here is that config space
->
-> accesses for ccw are done via an asynchronous operation, so timing
->
-> might be different.
->
->
-Hopefully Jason has an idea. Could you post a full command line
->
-please? Do you need a working guest to trigger this? Does this trigger
->
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
-
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-
->
->
-> > >
->
-> > > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only
->
-> > > the autogenerated virtio-net-ccw device is present) works. Specifying
->
-> > > several "-device virtio-net-pci" works as well.
->
-> > >
->
-> > > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net
->
-> > > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config")
->
-> > > works (in-between state does not compile).
->
-> >
->
-> > Ouch. I didn't test all in-between states :(
->
-> > But I wish we had a 0-day instrastructure like kernel has,
->
-> > that catches things like that.
->
->
->
-> Yep, that would be useful... so patchew only builds the complete series?
->
->
->
-> >
->
-> > > This is reproducible with tcg as well. Same problem both with
->
-> > > --enable-vhost-vdpa and --disable-vhost-vdpa.
->
-> > >
->
-> > > Have not yet tried to figure out what might be special with
->
-> > > virtio-ccw... anyone have an idea?
->
-> > >
->
-> > > [This should probably be considered a blocker?]
->
->
->
-> I think so, as it makes s390x unusable with more that one
->
-> virtio-net-ccw device, and I don't even see a workaround.
->
-
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
- config=0x55d6ad9e3f80 "RT") at
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-
-Thanks
-0001-virtio-net-check-the-existence-of-peer-before-accesi.patch
-Description:
-Text Data
-
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-
->
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
-> On Fri, 24 Jul 2020 11:17:57 -0400
->
-> "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
->
->
->> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
->>> On Fri, 24 Jul 2020 09:30:58 -0400
->
->>> "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
->>>
->
->>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
->>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding
->
->>>>> -device virtio-net-ccw in addition to the autogenerated device), I get
->
->>>>> a segfault. gdb points to
->
->>>>>
->
->>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
->>>>> config=0x55d6ad9e3f80 "RT") at
->
->>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
->>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->>>>>
->
->>>>> (backtrace doesn't go further)
->
->>> The core was incomplete, but running under gdb directly shows that it
->
->>> is just a bog-standard config space access (first for that device).
->
->>>
->
->>> The cause of the crash is that nc->peer is not set... no idea how that
->
->>> can happen, not that familiar with that part of QEMU. (Should the code
->
->>> check, or is that really something that should not happen?)
->
->>>
->
->>> What I don't understand is why it is set correctly for the first,
->
->>> autogenerated virtio-net-ccw device, but not for the second one, and
->
->>> why virtio-net-pci doesn't show these problems. The only difference
->
->>> between -ccw and -pci that comes to my mind here is that config space
->
->>> accesses for ccw are done via an asynchronous operation, so timing
->
->>> might be different.
->
->> Hopefully Jason has an idea. Could you post a full command line
->
->> please? Do you need a working guest to trigger this? Does this trigger
->
->> on an x86 host?
->
-> Yes, it does trigger with tcg-on-x86 as well. I've been using
->
->
->
-> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu
->
-> qemu,zpci=on
->
-> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
-> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
-> -device
->
-> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
-> -device virtio-net-ccw
->
->
->
-> It seems it needs the guest actually doing something with the nics; I
->
-> cannot reproduce the crash if I use the old advent calendar moon buggy
->
-> image and just add a virtio-net-ccw device.
->
->
->
-> (I don't think it's a problem with my local build, as I see the problem
->
-> both on my laptop and on an LPAR.)
->
->
->
-It looks to me we forget the check the existence of peer.
->
->
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck <cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
- config=0x55d6ad9e3f80 "RT") at
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck <cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-For autogenerated virtio-net-cww, I think the reason is that it has
-already had a peer set.
-Thanks
-
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-
->
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
->
-> On Sat, 25 Jul 2020 08:40:07 +0800
->
-> Jason Wang <jasowang@redhat.com> wrote:
->
->
->
->> On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
->>> On Fri, 24 Jul 2020 11:17:57 -0400
->
->>> "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
->>>
->
->>>> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
->>>>> On Fri, 24 Jul 2020 09:30:58 -0400
->
->>>>> "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
->>>>>
->
->>>>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
->>>>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding
->
->>>>>>> -device virtio-net-ccw in addition to the autogenerated device), I get
->
->>>>>>> a segfault. gdb points to
->
->>>>>>>
->
->>>>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
->
->>>>>>> config=0x55d6ad9e3f80 "RT") at
->
->>>>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
->>>>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
->
->>>>>>>
->
->>>>>>> (backtrace doesn't go further)
->
->>>>> The core was incomplete, but running under gdb directly shows that it
->
->>>>> is just a bog-standard config space access (first for that device).
->
->>>>>
->
->>>>> The cause of the crash is that nc->peer is not set... no idea how that
->
->>>>> can happen, not that familiar with that part of QEMU. (Should the code
->
->>>>> check, or is that really something that should not happen?)
->
->>>>>
->
->>>>> What I don't understand is why it is set correctly for the first,
->
->>>>> autogenerated virtio-net-ccw device, but not for the second one, and
->
->>>>> why virtio-net-pci doesn't show these problems. The only difference
->
->>>>> between -ccw and -pci that comes to my mind here is that config space
->
->>>>> accesses for ccw are done via an asynchronous operation, so timing
->
->>>>> might be different.
->
->>>> Hopefully Jason has an idea. Could you post a full command line
->
->>>> please? Do you need a working guest to trigger this? Does this trigger
->
->>>> on an x86 host?
->
->>> Yes, it does trigger with tcg-on-x86 as well. I've been using
->
->>>
->
->>> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu
->
->>> qemu,zpci=on
->
->>> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
->>> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
->>> -device
->
->>> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
->>> -device virtio-net-ccw
->
->>>
->
->>> It seems it needs the guest actually doing something with the nics; I
->
->>> cannot reproduce the crash if I use the old advent calendar moon buggy
->
->>> image and just add a virtio-net-ccw device.
->
->>>
->
->>> (I don't think it's a problem with my local build, as I see the problem
->
->>> both on my laptop and on an LPAR.)
->
->>
->
->> It looks to me we forget the check the existence of peer.
->
->>
->
->> Please try the attached patch to see if it works.
->
-> Thanks, that patch gets my guest up and running again. So, FWIW,
->
->
->
-> Tested-by: Cornelia Huck <cohuck@redhat.com>
->
->
->
-> Any idea why this did not hit with virtio-net-pci (or the autogenerated
->
-> virtio-net-ccw device)?
->
->
->
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-
->
->
-For autogenerated virtio-net-cww, I think the reason is that it has
->
-already had a peer set.
-Ok, that might well be.
-
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang <jasowang@redhat.com> wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
- config=0x55d6ad9e3f80 "RT") at
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck <cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-Sorry for being unclear, I meant for networking part, you just need
-start without peer, and you need a real guest (any Linux) that is trying
-to access the config space of virtio-net.
-Thanks
-For autogenerated virtio-net-cww, I think the reason is that it has
-already had a peer set.
-Ok, that might well be.
-
-On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
->
->
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
->
-> On Mon, 27 Jul 2020 15:38:12 +0800
->
-> Jason Wang <jasowang@redhat.com> wrote:
->
->
->
-> > On 2020/7/27 下午2:43, Cornelia Huck wrote:
->
-> > > On Sat, 25 Jul 2020 08:40:07 +0800
->
-> > > Jason Wang <jasowang@redhat.com> wrote:
->
-> > > > On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
-> > > > > On Fri, 24 Jul 2020 11:17:57 -0400
->
-> > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
-> > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-> > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400
->
-> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
-> > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
->
-> > > > > > > > > When I start qemu with a second virtio-net-ccw device (i.e.
->
-> > > > > > > > > adding
->
-> > > > > > > > > -device virtio-net-ccw in addition to the autogenerated
->
-> > > > > > > > > device), I get
->
-> > > > > > > > > a segfault. gdb points to
->
-> > > > > > > > >
->
-> > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config
->
-> > > > > > > > > (vdev=<optimized out>,
->
-> > > > > > > > > config=0x55d6ad9e3f80 "RT") at
->
-> > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > > > > > > > > 146 if (nc->peer->info->type ==
->
-> > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> > > > > > > > >
->
-> > > > > > > > > (backtrace doesn't go further)
->
-> > > > > > > The core was incomplete, but running under gdb directly shows
->
-> > > > > > > that it
->
-> > > > > > > is just a bog-standard config space access (first for that
->
-> > > > > > > device).
->
-> > > > > > >
->
-> > > > > > > The cause of the crash is that nc->peer is not set... no idea
->
-> > > > > > > how that
->
-> > > > > > > can happen, not that familiar with that part of QEMU. (Should
->
-> > > > > > > the code
->
-> > > > > > > check, or is that really something that should not happen?)
->
-> > > > > > >
->
-> > > > > > > What I don't understand is why it is set correctly for the
->
-> > > > > > > first,
->
-> > > > > > > autogenerated virtio-net-ccw device, but not for the second
->
-> > > > > > > one, and
->
-> > > > > > > why virtio-net-pci doesn't show these problems. The only
->
-> > > > > > > difference
->
-> > > > > > > between -ccw and -pci that comes to my mind here is that config
->
-> > > > > > > space
->
-> > > > > > > accesses for ccw are done via an asynchronous operation, so
->
-> > > > > > > timing
->
-> > > > > > > might be different.
->
-> > > > > > Hopefully Jason has an idea. Could you post a full command line
->
-> > > > > > please? Do you need a working guest to trigger this? Does this
->
-> > > > > > trigger
->
-> > > > > > on an x86 host?
->
-> > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using
->
-> > > > >
->
-> > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu
->
-> > > > > qemu,zpci=on
->
-> > > > > -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
-> > > > > -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
-> > > > > -device
->
-> > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
-> > > > > -device virtio-net-ccw
->
-> > > > >
->
-> > > > > It seems it needs the guest actually doing something with the nics;
->
-> > > > > I
->
-> > > > > cannot reproduce the crash if I use the old advent calendar moon
->
-> > > > > buggy
->
-> > > > > image and just add a virtio-net-ccw device.
->
-> > > > >
->
-> > > > > (I don't think it's a problem with my local build, as I see the
->
-> > > > > problem
->
-> > > > > both on my laptop and on an LPAR.)
->
-> > > > It looks to me we forget the check the existence of peer.
->
-> > > >
->
-> > > > Please try the attached patch to see if it works.
->
-> > > Thanks, that patch gets my guest up and running again. So, FWIW,
->
-> > >
->
-> > > Tested-by: Cornelia Huck <cohuck@redhat.com>
->
-> > >
->
-> > > Any idea why this did not hit with virtio-net-pci (or the autogenerated
->
-> > > virtio-net-ccw device)?
->
-> >
->
-> > It can be hit with virtio-net-pci as well (just start without peer).
->
-> Hm, I had not been able to reproduce the crash with a 'naked' -device
->
-> virtio-net-pci. But checking seems to be the right idea anyway.
->
->
->
-Sorry for being unclear, I meant for networking part, you just need start
->
-without peer, and you need a real guest (any Linux) that is trying to access
->
-the config space of virtio-net.
->
->
-Thanks
-A pxe guest will do it, but that doesn't support ccw, right?
-
-I'm still unclear why this triggers with ccw but not pci -
-any idea?
-
->
->
->
->
-> > For autogenerated virtio-net-cww, I think the reason is that it has
->
-> > already had a peer set.
->
-> Ok, that might well be.
->
->
->
->
-
-On 2020/7/27 下午7:43, Michael S. Tsirkin wrote:
-On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang<jasowang@redhat.com> wrote:
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang<jasowang@redhat.com> wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
- config=0x55d6ad9e3f80 "RT") at
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck<cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-Sorry for being unclear, I meant for networking part, you just need start
-without peer, and you need a real guest (any Linux) that is trying to access
-the config space of virtio-net.
-
-Thanks
-A pxe guest will do it, but that doesn't support ccw, right?
-Yes, it depends on the cli actually.
-I'm still unclear why this triggers with ccw but not pci -
-any idea?
-I don't test pxe but I can reproduce this with pci (just start a linux
-guest without a peer).
-Thanks
-
-On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote:
->
->
-On 2020/7/27 下午7:43, Michael S. Tsirkin wrote:
->
-> On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
->
-> > On 2020/7/27 下午4:41, Cornelia Huck wrote:
->
-> > > On Mon, 27 Jul 2020 15:38:12 +0800
->
-> > > Jason Wang<jasowang@redhat.com> wrote:
->
-> > >
->
-> > > > On 2020/7/27 下午2:43, Cornelia Huck wrote:
->
-> > > > > On Sat, 25 Jul 2020 08:40:07 +0800
->
-> > > > > Jason Wang<jasowang@redhat.com> wrote:
->
-> > > > > > On 2020/7/24 下午11:34, Cornelia Huck wrote:
->
-> > > > > > > On Fri, 24 Jul 2020 11:17:57 -0400
->
-> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
-> > > > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
->
-> > > > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400
->
-> > > > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote:
->
-> > > > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck
->
-> > > > > > > > > > wrote:
->
-> > > > > > > > > > > When I start qemu with a second virtio-net-ccw device
->
-> > > > > > > > > > > (i.e. adding
->
-> > > > > > > > > > > -device virtio-net-ccw in addition to the autogenerated
->
-> > > > > > > > > > > device), I get
->
-> > > > > > > > > > > a segfault. gdb points to
->
-> > > > > > > > > > >
->
-> > > > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config
->
-> > > > > > > > > > > (vdev=<optimized out>,
->
-> > > > > > > > > > > config=0x55d6ad9e3f80 "RT") at
->
-> > > > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146
->
-> > > > > > > > > > > 146 if (nc->peer->info->type ==
->
-> > > > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) {
->
-> > > > > > > > > > >
->
-> > > > > > > > > > > (backtrace doesn't go further)
->
-> > > > > > > > > The core was incomplete, but running under gdb directly
->
-> > > > > > > > > shows that it
->
-> > > > > > > > > is just a bog-standard config space access (first for that
->
-> > > > > > > > > device).
->
-> > > > > > > > >
->
-> > > > > > > > > The cause of the crash is that nc->peer is not set... no
->
-> > > > > > > > > idea how that
->
-> > > > > > > > > can happen, not that familiar with that part of QEMU.
->
-> > > > > > > > > (Should the code
->
-> > > > > > > > > check, or is that really something that should not happen?)
->
-> > > > > > > > >
->
-> > > > > > > > > What I don't understand is why it is set correctly for the
->
-> > > > > > > > > first,
->
-> > > > > > > > > autogenerated virtio-net-ccw device, but not for the second
->
-> > > > > > > > > one, and
->
-> > > > > > > > > why virtio-net-pci doesn't show these problems. The only
->
-> > > > > > > > > difference
->
-> > > > > > > > > between -ccw and -pci that comes to my mind here is that
->
-> > > > > > > > > config space
->
-> > > > > > > > > accesses for ccw are done via an asynchronous operation, so
->
-> > > > > > > > > timing
->
-> > > > > > > > > might be different.
->
-> > > > > > > > Hopefully Jason has an idea. Could you post a full command
->
-> > > > > > > > line
->
-> > > > > > > > please? Do you need a working guest to trigger this? Does
->
-> > > > > > > > this trigger
->
-> > > > > > > > on an x86 host?
->
-> > > > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using
->
-> > > > > > >
->
-> > > > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg
->
-> > > > > > > -cpu qemu,zpci=on
->
-> > > > > > > -m 1024 -nographic -device
->
-> > > > > > > virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
->
-> > > > > > > -drive
->
-> > > > > > > file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
->
-> > > > > > > -device
->
-> > > > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
->
-> > > > > > > -device virtio-net-ccw
->
-> > > > > > >
->
-> > > > > > > It seems it needs the guest actually doing something with the
->
-> > > > > > > nics; I
->
-> > > > > > > cannot reproduce the crash if I use the old advent calendar
->
-> > > > > > > moon buggy
->
-> > > > > > > image and just add a virtio-net-ccw device.
->
-> > > > > > >
->
-> > > > > > > (I don't think it's a problem with my local build, as I see the
->
-> > > > > > > problem
->
-> > > > > > > both on my laptop and on an LPAR.)
->
-> > > > > > It looks to me we forget the check the existence of peer.
->
-> > > > > >
->
-> > > > > > Please try the attached patch to see if it works.
->
-> > > > > Thanks, that patch gets my guest up and running again. So, FWIW,
->
-> > > > >
->
-> > > > > Tested-by: Cornelia Huck<cohuck@redhat.com>
->
-> > > > >
->
-> > > > > Any idea why this did not hit with virtio-net-pci (or the
->
-> > > > > autogenerated
->
-> > > > > virtio-net-ccw device)?
->
-> > > > It can be hit with virtio-net-pci as well (just start without peer).
->
-> > > Hm, I had not been able to reproduce the crash with a 'naked' -device
->
-> > > virtio-net-pci. But checking seems to be the right idea anyway.
->
-> > Sorry for being unclear, I meant for networking part, you just need start
->
-> > without peer, and you need a real guest (any Linux) that is trying to
->
-> > access
->
-> > the config space of virtio-net.
->
-> >
->
-> > Thanks
->
-> A pxe guest will do it, but that doesn't support ccw, right?
->
->
->
-Yes, it depends on the cli actually.
->
->
->
->
->
-> I'm still unclear why this triggers with ccw but not pci -
->
-> any idea?
->
->
->
-I don't test pxe but I can reproduce this with pci (just start a linux guest
->
-without a peer).
->
->
-Thanks
->
-Might be a good addition to a unit test. Not sure what would the
-test do exactly: just make sure guest runs? Looks like a lot of work
-for an empty test ... maybe we can poke at the guest config with
-qtest commands at least.
-
---
-MST
-
-On 2020/7/27 下午9:16, Michael S. Tsirkin wrote:
-On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote:
-On 2020/7/27 下午7:43, Michael S. Tsirkin wrote:
-On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote:
-On 2020/7/27 下午4:41, Cornelia Huck wrote:
-On Mon, 27 Jul 2020 15:38:12 +0800
-Jason Wang<jasowang@redhat.com> wrote:
-On 2020/7/27 下午2:43, Cornelia Huck wrote:
-On Sat, 25 Jul 2020 08:40:07 +0800
-Jason Wang<jasowang@redhat.com> wrote:
-On 2020/7/24 下午11:34, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 11:17:57 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote:
-On Fri, 24 Jul 2020 09:30:58 -0400
-"Michael S. Tsirkin"<mst@redhat.com> wrote:
-On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote:
-When I start qemu with a second virtio-net-ccw device (i.e. adding
--device virtio-net-ccw in addition to the autogenerated device), I get
-a segfault. gdb points to
-
-#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>,
- config=0x55d6ad9e3f80 "RT") at
-/home/cohuck/git/qemu/hw/net/virtio-net.c:146
-146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) {
-
-(backtrace doesn't go further)
-The core was incomplete, but running under gdb directly shows that it
-is just a bog-standard config space access (first for that device).
-
-The cause of the crash is that nc->peer is not set... no idea how that
-can happen, not that familiar with that part of QEMU. (Should the code
-check, or is that really something that should not happen?)
-
-What I don't understand is why it is set correctly for the first,
-autogenerated virtio-net-ccw device, but not for the second one, and
-why virtio-net-pci doesn't show these problems. The only difference
-between -ccw and -pci that comes to my mind here is that config space
-accesses for ccw are done via an asynchronous operation, so timing
-might be different.
-Hopefully Jason has an idea. Could you post a full command line
-please? Do you need a working guest to trigger this? Does this trigger
-on an x86 host?
-Yes, it does trigger with tcg-on-x86 as well. I've been using
-
-s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on
--m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001
--drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0
--device
-scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
--device virtio-net-ccw
-
-It seems it needs the guest actually doing something with the nics; I
-cannot reproduce the crash if I use the old advent calendar moon buggy
-image and just add a virtio-net-ccw device.
-
-(I don't think it's a problem with my local build, as I see the problem
-both on my laptop and on an LPAR.)
-It looks to me we forget the check the existence of peer.
-
-Please try the attached patch to see if it works.
-Thanks, that patch gets my guest up and running again. So, FWIW,
-
-Tested-by: Cornelia Huck<cohuck@redhat.com>
-
-Any idea why this did not hit with virtio-net-pci (or the autogenerated
-virtio-net-ccw device)?
-It can be hit with virtio-net-pci as well (just start without peer).
-Hm, I had not been able to reproduce the crash with a 'naked' -device
-virtio-net-pci. But checking seems to be the right idea anyway.
-Sorry for being unclear, I meant for networking part, you just need start
-without peer, and you need a real guest (any Linux) that is trying to access
-the config space of virtio-net.
-
-Thanks
-A pxe guest will do it, but that doesn't support ccw, right?
-Yes, it depends on the cli actually.
-I'm still unclear why this triggers with ccw but not pci -
-any idea?
-I don't test pxe but I can reproduce this with pci (just start a linux guest
-without a peer).
-
-Thanks
-Might be a good addition to a unit test. Not sure what would the
-test do exactly: just make sure guest runs? Looks like a lot of work
-for an empty test ... maybe we can poke at the guest config with
-qtest commands at least.
-That should work or we can simply extend the exist virtio-net qtest to
-do that.
-Thanks
-