diff options
| author | Christian Krinitsin <mail@krinitsin.com> | 2025-06-01 21:35:14 +0200 |
|---|---|---|
| committer | Christian Krinitsin <mail@krinitsin.com> | 2025-06-01 21:35:14 +0200 |
| commit | 3e4c5a6261770bced301b5e74233e7866166ea5b (patch) | |
| tree | 9379fddaba693ef8a045da06efee8529baa5f6f4 /classification_output/05/mistranslation | |
| parent | e5634e2806195bee44407853c4bf8776f7abfa4f (diff) | |
| download | emulator-bug-study-3e4c5a6261770bced301b5e74233e7866166ea5b.tar.gz emulator-bug-study-3e4c5a6261770bced301b5e74233e7866166ea5b.zip | |
clean up repository
Diffstat (limited to 'classification_output/05/mistranslation')
| -rw-r--r-- | classification_output/05/mistranslation/14887122 | 266 | ||||
| -rw-r--r-- | classification_output/05/mistranslation/23270873 | 700 | ||||
| -rw-r--r-- | classification_output/05/mistranslation/25842545 | 210 | ||||
| -rw-r--r-- | classification_output/05/mistranslation/64322995 | 62 | ||||
| -rw-r--r-- | classification_output/05/mistranslation/70294255 | 1069 | ||||
| -rw-r--r-- | classification_output/05/mistranslation/74466963 | 1886 | ||||
| -rw-r--r-- | classification_output/05/mistranslation/74545755 | 352 | ||||
| -rw-r--r-- | classification_output/05/mistranslation/80604314 | 1488 |
8 files changed, 0 insertions, 6033 deletions
diff --git a/classification_output/05/mistranslation/14887122 b/classification_output/05/mistranslation/14887122 deleted file mode 100644 index 1a87937b..00000000 --- a/classification_output/05/mistranslation/14887122 +++ /dev/null @@ -1,266 +0,0 @@ -mistranslation: 0.930 -semantic: 0.928 -device: 0.919 -assembly: 0.918 -socket: 0.914 -graphic: 0.910 -instruction: 0.905 -other: 0.890 -vnc: 0.871 -network: 0.855 -boot: 0.831 -KVM: 0.814 - -[BUG][RFC] CPR transfer Issues: Socket permissions and PID files - -Hello, - -While testing CPR transfer I encountered two issues. The first is that the -transfer fails when running with pidfiles due to the destination qemu process -attempting to create the pidfile while it is still locked by the source -process. The second is that the transfer fails when running with the -run-with -user=$USERID parameter. This is because the destination qemu process creates -the UNIX sockets used for the CPR transfer before dropping to the lower -permissioned user, which causes them to be owned by the original user. The -source qemu process then does not have permission to connect to it because it -is already running as the lesser permissioned user. - -Reproducing the first issue: - -Create a source and destination qemu instance associated with the same VM where -both processes have the -pidfile parameter passed on the command line. You -should see the following error on the command line of the second process: - -qemu-system-x86_64: cannot create PID file: Cannot lock pid file: Resource -temporarily unavailable - -Reproducing the second issue: - -Create a source and destination qemu instance associated with the same VM where -both processes have -run-with user=$USERID passed on the command line, where -$USERID is a different user from the one launching the processes. Then attempt -a CPR transfer using UNIX sockets for the main and cpr sockets. You should -receive the following error via QMP: -{"error": {"class": "GenericError", "desc": "Failed to connect to 'cpr.sock': -Permission denied"}} - -I provided a minimal patch that works around the second issue. - -Thank you, -Ben Chaney - ---- -include/system/os-posix.h | 4 ++++ -os-posix.c | 8 -------- -util/qemu-sockets.c | 21 +++++++++++++++++++++ -3 files changed, 25 insertions(+), 8 deletions(-) - -diff --git a/include/system/os-posix.h b/include/system/os-posix.h -index ce5b3bccf8..2a414a914a 100644 ---- a/include/system/os-posix.h -+++ b/include/system/os-posix.h -@@ -55,6 +55,10 @@ void os_setup_limits(void); -void os_setup_post(void); -int os_mlock(bool on_fault); - -+extern struct passwd *user_pwd; -+extern uid_t user_uid; -+extern gid_t user_gid; -+ -/** -* qemu_alloc_stack: -* @sz: pointer to a size_t holding the requested usable stack size -diff --git a/os-posix.c b/os-posix.c -index 52925c23d3..9369b312a0 100644 ---- a/os-posix.c -+++ b/os-posix.c -@@ -86,14 +86,6 @@ void os_set_proc_name(const char *s) -} - - --/* -- * Must set all three of these at once. -- * Legal combinations are unset by name by uid -- */ --static struct passwd *user_pwd; /* NULL non-NULL NULL */ --static uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */ --static gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */ -- -/* -* Prepare to change user ID. user_id can be one of 3 forms: -* - a username, in which case user ID will be changed to its uid, -diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c -index 77477c1cd5..987977ead9 100644 ---- a/util/qemu-sockets.c -+++ b/util/qemu-sockets.c -@@ -871,6 +871,14 @@ static bool saddr_is_tight(UnixSocketAddress *saddr) -#endif -} - -+/* -+ * Must set all three of these at once. -+ * Legal combinations are unset by name by uid -+ */ -+struct passwd *user_pwd; /* NULL non-NULL NULL */ -+uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */ -+gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */ -+ -static int unix_listen_saddr(UnixSocketAddress *saddr, -int num, -Error **errp) -@@ -947,6 +955,19 @@ static int unix_listen_saddr(UnixSocketAddress *saddr, -error_setg_errno(errp, errno, "Failed to bind socket to %s", path); -goto err; -} -+ if (user_pwd) { -+ if (chown(un.sun_path, user_pwd->pw_uid, user_pwd->pw_gid) < 0) { -+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", -path); -+ goto err; -+ } -+ } -+ else if (user_uid != -1 && user_gid != -1) { -+ if (chown(un.sun_path, user_uid, user_gid) < 0) { -+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", -path); -+ goto err; -+ } -+ } -+ -if (listen(sock, num) < 0) { -error_setg_errno(errp, errno, "Failed to listen on socket"); -goto err; --- -2.40.1 - -Thank you Ben. I appreciate you testing CPR and shaking out the bugs. -I will study these and propose patches. - -My initial reaction to the pidfile issue is that the orchestration layer must -pass a different filename when starting the destination qemu instance. When -using live update without containers, these types of resource conflicts in the -global namespaces are a known issue. - -- Steve - -On 3/14/2025 2:33 PM, Chaney, Ben wrote: -Hello, - -While testing CPR transfer I encountered two issues. The first is that the -transfer fails when running with pidfiles due to the destination qemu process -attempting to create the pidfile while it is still locked by the source -process. The second is that the transfer fails when running with the -run-with -user=$USERID parameter. This is because the destination qemu process creates -the UNIX sockets used for the CPR transfer before dropping to the lower -permissioned user, which causes them to be owned by the original user. The -source qemu process then does not have permission to connect to it because it -is already running as the lesser permissioned user. - -Reproducing the first issue: - -Create a source and destination qemu instance associated with the same VM where -both processes have the -pidfile parameter passed on the command line. You -should see the following error on the command line of the second process: - -qemu-system-x86_64: cannot create PID file: Cannot lock pid file: Resource -temporarily unavailable - -Reproducing the second issue: - -Create a source and destination qemu instance associated with the same VM where -both processes have -run-with user=$USERID passed on the command line, where -$USERID is a different user from the one launching the processes. Then attempt -a CPR transfer using UNIX sockets for the main and cpr sockets. You should -receive the following error via QMP: -{"error": {"class": "GenericError", "desc": "Failed to connect to 'cpr.sock': -Permission denied"}} - -I provided a minimal patch that works around the second issue. - -Thank you, -Ben Chaney - ---- -include/system/os-posix.h | 4 ++++ -os-posix.c | 8 -------- -util/qemu-sockets.c | 21 +++++++++++++++++++++ -3 files changed, 25 insertions(+), 8 deletions(-) - -diff --git a/include/system/os-posix.h b/include/system/os-posix.h -index ce5b3bccf8..2a414a914a 100644 ---- a/include/system/os-posix.h -+++ b/include/system/os-posix.h -@@ -55,6 +55,10 @@ void os_setup_limits(void); -void os_setup_post(void); -int os_mlock(bool on_fault); - -+extern struct passwd *user_pwd; -+extern uid_t user_uid; -+extern gid_t user_gid; -+ -/** -* qemu_alloc_stack: -* @sz: pointer to a size_t holding the requested usable stack size -diff --git a/os-posix.c b/os-posix.c -index 52925c23d3..9369b312a0 100644 ---- a/os-posix.c -+++ b/os-posix.c -@@ -86,14 +86,6 @@ void os_set_proc_name(const char *s) -} - - --/* -- * Must set all three of these at once. -- * Legal combinations are unset by name by uid -- */ --static struct passwd *user_pwd; /* NULL non-NULL NULL */ --static uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */ --static gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */ -- -/* -* Prepare to change user ID. user_id can be one of 3 forms: -* - a username, in which case user ID will be changed to its uid, -diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c -index 77477c1cd5..987977ead9 100644 ---- a/util/qemu-sockets.c -+++ b/util/qemu-sockets.c -@@ -871,6 +871,14 @@ static bool saddr_is_tight(UnixSocketAddress *saddr) -#endif -} - -+/* -+ * Must set all three of these at once. -+ * Legal combinations are unset by name by uid -+ */ -+struct passwd *user_pwd; /* NULL non-NULL NULL */ -+uid_t user_uid = (uid_t)-1; /* -1 -1 >=0 */ -+gid_t user_gid = (gid_t)-1; /* -1 -1 >=0 */ -+ -static int unix_listen_saddr(UnixSocketAddress *saddr, -int num, -Error **errp) -@@ -947,6 +955,19 @@ static int unix_listen_saddr(UnixSocketAddress *saddr, -error_setg_errno(errp, errno, "Failed to bind socket to %s", path); -goto err; -} -+ if (user_pwd) { -+ if (chown(un.sun_path, user_pwd->pw_uid, user_pwd->pw_gid) < 0) { -+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", -path); -+ goto err; -+ } -+ } -+ else if (user_uid != -1 && user_gid != -1) { -+ if (chown(un.sun_path, user_uid, user_gid) < 0) { -+ error_setg_errno(errp, errno, "Failed to change permissions on socket %s", -path); -+ goto err; -+ } -+ } -+ -if (listen(sock, num) < 0) { -error_setg_errno(errp, errno, "Failed to listen on socket"); -goto err; --- -2.40.1 - diff --git a/classification_output/05/mistranslation/23270873 b/classification_output/05/mistranslation/23270873 deleted file mode 100644 index 4d8b927f..00000000 --- a/classification_output/05/mistranslation/23270873 +++ /dev/null @@ -1,700 +0,0 @@ -mistranslation: 0.881 -other: 0.839 -boot: 0.830 -vnc: 0.820 -device: 0.810 -KVM: 0.803 -assembly: 0.768 -network: 0.768 -graphic: 0.763 -socket: 0.758 -instruction: 0.755 -semantic: 0.752 - -[Qemu-devel] [BUG?] aio_get_linux_aio: Assertion `ctx->linux_aio' failed - -Hi, - -I am seeing some strange QEMU assertion failures for qemu on s390x, -which prevents a guest from starting. - -Git bisecting points to the following commit as the source of the error. - -commit ed6e2161715c527330f936d44af4c547f25f687e -Author: Nishanth Aravamudan <address@hidden> -Date: Fri Jun 22 12:37:00 2018 -0700 - - linux-aio: properly bubble up errors from initialization - - laio_init() can fail for a couple of reasons, which will lead to a NULL - pointer dereference in laio_attach_aio_context(). - - To solve this, add a aio_setup_linux_aio() function which is called - early in raw_open_common. If this fails, propagate the error up. The - signature of aio_get_linux_aio() was not modified, because it seems - preferable to return the actual errno from the possible failing - initialization calls. - - Additionally, when the AioContext changes, we need to associate a - LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context - callback and call the new aio_setup_linux_aio(), which will allocate a -new AioContext if needed, and return errors on failures. If it -fails for -any reason, fallback to threaded AIO with an error message, as the - device is already in-use by the guest. - - Add an assert that aio_get_linux_aio() cannot return NULL. - - Signed-off-by: Nishanth Aravamudan <address@hidden> - Message-id: address@hidden - Signed-off-by: Stefan Hajnoczi <address@hidden> -Not sure what is causing this assertion to fail. Here is the qemu -command line of the guest, from qemu log, which throws this error: -LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin -QEMU_AUDIO_DRV=none /usr/local/bin/qemu-system-s390x -name -guest=rt_vm1,debug-threads=on -S -object -secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-21-rt_vm1/master-key.aes --machine s390-ccw-virtio-2.12,accel=kvm,usb=off,dump-guest-core=off -m -1024 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object -iothread,id=iothread1 -uuid 0cde16cd-091d-41bd-9ac2-5243df5c9a0d --display none -no-user-config -nodefaults -chardev -socket,id=charmonitor,fd=28,server,nowait -mon -chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown --boot strict=on -drive -file=/dev/mapper/360050763998b0883980000002a000031,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native --device -virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0001,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on --netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device -virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:3a:c8:67:95:84,devno=fe.0.0000 --netdev tap,fd=32,id=hostnet1,vhost=on,vhostfd=33 -device -virtio-net-ccw,netdev=hostnet1,id=net1,mac=52:54:00:2a:e5:08,devno=fe.0.0002 --chardev pty,id=charconsole0 -device -sclpconsole,chardev=charconsole0,id=console0 -device -virtio-balloon-ccw,id=balloon0,devno=fe.3.ffba -sandbox -on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny --msg timestamp=on -2018-07-17 15:48:42.252+0000: Domain id=21 is tainted: high-privileges -2018-07-17T15:48:42.279380Z qemu-system-s390x: -chardev -pty,id=charconsole0: char device redirected to /dev/pts/3 (label -charconsole0) -qemu-system-s390x: util/async.c:339: aio_get_linux_aio: Assertion -`ctx->linux_aio' failed. -2018-07-17 15:48:43.309+0000: shutting down, reason=failed - - -Any help debugging this would be greatly appreciated. - -Thank you -Farhan - -On 17.07.2018 [13:25:53 -0400], Farhan Ali wrote: -> -Hi, -> -> -I am seeing some strange QEMU assertion failures for qemu on s390x, -> -which prevents a guest from starting. -> -> -Git bisecting points to the following commit as the source of the error. -> -> -commit ed6e2161715c527330f936d44af4c547f25f687e -> -Author: Nishanth Aravamudan <address@hidden> -> -Date: Fri Jun 22 12:37:00 2018 -0700 -> -> -linux-aio: properly bubble up errors from initialization -> -> -laio_init() can fail for a couple of reasons, which will lead to a NULL -> -pointer dereference in laio_attach_aio_context(). -> -> -To solve this, add a aio_setup_linux_aio() function which is called -> -early in raw_open_common. If this fails, propagate the error up. The -> -signature of aio_get_linux_aio() was not modified, because it seems -> -preferable to return the actual errno from the possible failing -> -initialization calls. -> -> -Additionally, when the AioContext changes, we need to associate a -> -LinuxAioState with the new AioContext. Use the bdrv_attach_aio_context -> -callback and call the new aio_setup_linux_aio(), which will allocate a -> -new AioContext if needed, and return errors on failures. If it fails for -> -any reason, fallback to threaded AIO with an error message, as the -> -device is already in-use by the guest. -> -> -Add an assert that aio_get_linux_aio() cannot return NULL. -> -> -Signed-off-by: Nishanth Aravamudan <address@hidden> -> -Message-id: address@hidden -> -Signed-off-by: Stefan Hajnoczi <address@hidden> -> -> -> -Not sure what is causing this assertion to fail. Here is the qemu command -> -line of the guest, from qemu log, which throws this error: -> -> -> -LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin -> -QEMU_AUDIO_DRV=none /usr/local/bin/qemu-system-s390x -name -> -guest=rt_vm1,debug-threads=on -S -object -> -secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-21-rt_vm1/master-key.aes -> --machine s390-ccw-virtio-2.12,accel=kvm,usb=off,dump-guest-core=off -m 1024 -> --realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -object -> -iothread,id=iothread1 -uuid 0cde16cd-091d-41bd-9ac2-5243df5c9a0d -display -> -none -no-user-config -nodefaults -chardev -> -socket,id=charmonitor,fd=28,server,nowait -mon -> -chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot -> -strict=on -drive -> -file=/dev/mapper/360050763998b0883980000002a000031,format=raw,if=none,id=drive-virtio-disk0,cache=none,aio=native -> --device -> -virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0001,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=on -> --netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=31 -device -> -virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:3a:c8:67:95:84,devno=fe.0.0000 -> --netdev tap,fd=32,id=hostnet1,vhost=on,vhostfd=33 -device -> -virtio-net-ccw,netdev=hostnet1,id=net1,mac=52:54:00:2a:e5:08,devno=fe.0.0002 -> --chardev pty,id=charconsole0 -device -> -sclpconsole,chardev=charconsole0,id=console0 -device -> -virtio-balloon-ccw,id=balloon0,devno=fe.3.ffba -sandbox -> -on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg -> -timestamp=on -> -> -> -> -2018-07-17 15:48:42.252+0000: Domain id=21 is tainted: high-privileges -> -2018-07-17T15:48:42.279380Z qemu-system-s390x: -chardev pty,id=charconsole0: -> -char device redirected to /dev/pts/3 (label charconsole0) -> -qemu-system-s390x: util/async.c:339: aio_get_linux_aio: Assertion -> -`ctx->linux_aio' failed. -> -2018-07-17 15:48:43.309+0000: shutting down, reason=failed -> -> -> -Any help debugging this would be greatly appreciated. -iiuc, this possibly implies AIO was not actually used previously on this -guest (it might have silently been falling back to threaded IO?). I -don't have access to s390x, but would it be possible to run qemu under -gdb and see if aio_setup_linux_aio is being called at all (I think it -might not be, but I'm not sure why), and if so, if it's for the context -in question? - -If it's not being called first, could you see what callpath is calling -aio_get_linux_aio when this assertion trips? - -Thanks! --Nish - -On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote: -iiuc, this possibly implies AIO was not actually used previously on this -guest (it might have silently been falling back to threaded IO?). I -don't have access to s390x, but would it be possible to run qemu under -gdb and see if aio_setup_linux_aio is being called at all (I think it -might not be, but I'm not sure why), and if so, if it's for the context -in question? - -If it's not being called first, could you see what callpath is calling -aio_get_linux_aio when this assertion trips? - -Thanks! --Nish -Hi Nishant, -From the coredump of the guest this is the call trace that calls -aio_get_linux_aio: -Stack trace of thread 145158: -#0 0x000003ff94dbe274 raise (libc.so.6) -#1 0x000003ff94da39a8 abort (libc.so.6) -#2 0x000003ff94db62ce __assert_fail_base (libc.so.6) -#3 0x000003ff94db634c __assert_fail (libc.so.6) -#4 0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x) -#5 0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x) -#6 0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x) -#7 0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x) -#8 0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x) -#9 0x000002aa20db3c34 aio_poll (qemu-system-s390x) -#10 0x000002aa20be32a2 iothread_run (qemu-system-s390x) -#11 0x000003ff94f879a8 start_thread (libpthread.so.0) -#12 0x000003ff94e797ee thread_start (libc.so.6) - - -Thanks for taking a look and responding. - -Thanks -Farhan - -On 07/18/2018 09:42 AM, Farhan Ali wrote: -On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote: -iiuc, this possibly implies AIO was not actually used previously on this -guest (it might have silently been falling back to threaded IO?). I -don't have access to s390x, but would it be possible to run qemu under -gdb and see if aio_setup_linux_aio is being called at all (I think it -might not be, but I'm not sure why), and if so, if it's for the context -in question? - -If it's not being called first, could you see what callpath is calling -aio_get_linux_aio when this assertion trips? - -Thanks! --Nish -Hi Nishant, -From the coredump of the guest this is the call trace that calls -aio_get_linux_aio: -Stack trace of thread 145158: -#0 0x000003ff94dbe274 raise (libc.so.6) -#1 0x000003ff94da39a8 abort (libc.so.6) -#2 0x000003ff94db62ce __assert_fail_base (libc.so.6) -#3 0x000003ff94db634c __assert_fail (libc.so.6) -#4 0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x) -#5 0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x) -#6 0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x) -#7 0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x) -#8 0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x) -#9 0x000002aa20db3c34 aio_poll (qemu-system-s390x) -#10 0x000002aa20be32a2 iothread_run (qemu-system-s390x) -#11 0x000003ff94f879a8 start_thread (libpthread.so.0) -#12 0x000003ff94e797ee thread_start (libc.so.6) - - -Thanks for taking a look and responding. - -Thanks -Farhan -Trying to debug a little further, the block device in this case is a -"host device". And looking at your commit carefully you use the -bdrv_attach_aio_context callback to setup a Linux AioContext. -For some reason the "host device" struct (BlockDriver bdrv_host_device -in block/file-posix.c) does not have a bdrv_attach_aio_context defined. -So a simple change of adding the callback to the struct solves the issue -and the guest starts fine. -diff --git a/block/file-posix.c b/block/file-posix.c -index 28824aa..b8d59fb 100644 ---- a/block/file-posix.c -+++ b/block/file-posix.c -@@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = { - .bdrv_refresh_limits = raw_refresh_limits, - .bdrv_io_plug = raw_aio_plug, - .bdrv_io_unplug = raw_aio_unplug, -+ .bdrv_attach_aio_context = raw_aio_attach_aio_context, - - .bdrv_co_truncate = raw_co_truncate, - .bdrv_getlength = raw_getlength, -I am not too familiar with block device code in QEMU, so not sure if -this is the right fix or if there are some underlying problems. -Thanks -Farhan - -On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote: -> -> -> -On 07/18/2018 09:42 AM, Farhan Ali wrote: -> -> -> -> -> -> On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote: -> -> > iiuc, this possibly implies AIO was not actually used previously on this -> -> > guest (it might have silently been falling back to threaded IO?). I -> -> > don't have access to s390x, but would it be possible to run qemu under -> -> > gdb and see if aio_setup_linux_aio is being called at all (I think it -> -> > might not be, but I'm not sure why), and if so, if it's for the context -> -> > in question? -> -> > -> -> > If it's not being called first, could you see what callpath is calling -> -> > aio_get_linux_aio when this assertion trips? -> -> > -> -> > Thanks! -> -> > -Nish -> -> -> -> -> -> Hi Nishant, -> -> -> -> From the coredump of the guest this is the call trace that calls -> -> aio_get_linux_aio: -> -> -> -> -> -> Stack trace of thread 145158: -> -> #0 0x000003ff94dbe274 raise (libc.so.6) -> -> #1 0x000003ff94da39a8 abort (libc.so.6) -> -> #2 0x000003ff94db62ce __assert_fail_base (libc.so.6) -> -> #3 0x000003ff94db634c __assert_fail (libc.so.6) -> -> #4 0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x) -> -> #5 0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x) -> -> #6 0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x) -> -> #7 0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x) -> -> #8 0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x) -> -> #9 0x000002aa20db3c34 aio_poll (qemu-system-s390x) -> -> #10 0x000002aa20be32a2 iothread_run (qemu-system-s390x) -> -> #11 0x000003ff94f879a8 start_thread (libpthread.so.0) -> -> #12 0x000003ff94e797ee thread_start (libc.so.6) -> -> -> -> -> -> Thanks for taking a look and responding. -> -> -> -> Thanks -> -> Farhan -> -> -> -> -> -> -> -> -Trying to debug a little further, the block device in this case is a "host -> -device". And looking at your commit carefully you use the -> -bdrv_attach_aio_context callback to setup a Linux AioContext. -> -> -For some reason the "host device" struct (BlockDriver bdrv_host_device in -> -block/file-posix.c) does not have a bdrv_attach_aio_context defined. -> -So a simple change of adding the callback to the struct solves the issue and -> -the guest starts fine. -> -> -> -diff --git a/block/file-posix.c b/block/file-posix.c -> -index 28824aa..b8d59fb 100644 -> ---- a/block/file-posix.c -> -+++ b/block/file-posix.c -> -@@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = { -> -.bdrv_refresh_limits = raw_refresh_limits, -> -.bdrv_io_plug = raw_aio_plug, -> -.bdrv_io_unplug = raw_aio_unplug, -> -+ .bdrv_attach_aio_context = raw_aio_attach_aio_context, -> -> -.bdrv_co_truncate = raw_co_truncate, -> -.bdrv_getlength = raw_getlength, -> -> -> -> -I am not too familiar with block device code in QEMU, so not sure if -> -this is the right fix or if there are some underlying problems. -Oh this is quite embarassing! I only added the bdrv_attach_aio_context -callback for the file-backed device. Your fix is definitely corect for -host device. Let me make sure there weren't any others missed and I will -send out a properly formatted patch. Thank you for the quick testing and -turnaround! - --Nish - -On 07/18/2018 08:52 PM, Nishanth Aravamudan wrote: -> -On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote: -> -> -> -> -> -> On 07/18/2018 09:42 AM, Farhan Ali wrote: -> ->> -> ->> -> ->> On 07/17/2018 04:52 PM, Nishanth Aravamudan wrote: -> ->>> iiuc, this possibly implies AIO was not actually used previously on this -> ->>> guest (it might have silently been falling back to threaded IO?). I -> ->>> don't have access to s390x, but would it be possible to run qemu under -> ->>> gdb and see if aio_setup_linux_aio is being called at all (I think it -> ->>> might not be, but I'm not sure why), and if so, if it's for the context -> ->>> in question? -> ->>> -> ->>> If it's not being called first, could you see what callpath is calling -> ->>> aio_get_linux_aio when this assertion trips? -> ->>> -> ->>> Thanks! -> ->>> -Nish -> ->> -> ->> -> ->> Hi Nishant, -> ->> -> ->> From the coredump of the guest this is the call trace that calls -> ->> aio_get_linux_aio: -> ->> -> ->> -> ->> Stack trace of thread 145158: -> ->> #0 0x000003ff94dbe274 raise (libc.so.6) -> ->> #1 0x000003ff94da39a8 abort (libc.so.6) -> ->> #2 0x000003ff94db62ce __assert_fail_base (libc.so.6) -> ->> #3 0x000003ff94db634c __assert_fail (libc.so.6) -> ->> #4 0x000002aa20db067a aio_get_linux_aio (qemu-system-s390x) -> ->> #5 0x000002aa20d229a8 raw_aio_plug (qemu-system-s390x) -> ->> #6 0x000002aa20d309ee bdrv_io_plug (qemu-system-s390x) -> ->> #7 0x000002aa20b5a8ea virtio_blk_handle_vq (qemu-system-s390x) -> ->> #8 0x000002aa20db2f6e aio_dispatch_handlers (qemu-system-s390x) -> ->> #9 0x000002aa20db3c34 aio_poll (qemu-system-s390x) -> ->> #10 0x000002aa20be32a2 iothread_run (qemu-system-s390x) -> ->> #11 0x000003ff94f879a8 start_thread (libpthread.so.0) -> ->> #12 0x000003ff94e797ee thread_start (libc.so.6) -> ->> -> ->> -> ->> Thanks for taking a look and responding. -> ->> -> ->> Thanks -> ->> Farhan -> ->> -> ->> -> ->> -> -> -> -> Trying to debug a little further, the block device in this case is a "host -> -> device". And looking at your commit carefully you use the -> -> bdrv_attach_aio_context callback to setup a Linux AioContext. -> -> -> -> For some reason the "host device" struct (BlockDriver bdrv_host_device in -> -> block/file-posix.c) does not have a bdrv_attach_aio_context defined. -> -> So a simple change of adding the callback to the struct solves the issue and -> -> the guest starts fine. -> -> -> -> -> -> diff --git a/block/file-posix.c b/block/file-posix.c -> -> index 28824aa..b8d59fb 100644 -> -> --- a/block/file-posix.c -> -> +++ b/block/file-posix.c -> -> @@ -3135,6 +3135,7 @@ static BlockDriver bdrv_host_device = { -> -> .bdrv_refresh_limits = raw_refresh_limits, -> -> .bdrv_io_plug = raw_aio_plug, -> -> .bdrv_io_unplug = raw_aio_unplug, -> -> + .bdrv_attach_aio_context = raw_aio_attach_aio_context, -> -> -> -> .bdrv_co_truncate = raw_co_truncate, -> -> .bdrv_getlength = raw_getlength, -> -> -> -> -> -> -> -> I am not too familiar with block device code in QEMU, so not sure if -> -> this is the right fix or if there are some underlying problems. -> -> -Oh this is quite embarassing! I only added the bdrv_attach_aio_context -> -callback for the file-backed device. Your fix is definitely corect for -> -host device. Let me make sure there weren't any others missed and I will -> -send out a properly formatted patch. Thank you for the quick testing and -> -turnaround! -Farhan, can you respin your patch with proper sign-off and patch description? -Adding qemu-block. - -Hi Christian, - -On 19.07.2018 [08:55:20 +0200], Christian Borntraeger wrote: -> -> -> -On 07/18/2018 08:52 PM, Nishanth Aravamudan wrote: -> -> On 18.07.2018 [11:10:27 -0400], Farhan Ali wrote: -> ->> -> ->> -> ->> On 07/18/2018 09:42 AM, Farhan Ali wrote: -<snip> - -> ->> I am not too familiar with block device code in QEMU, so not sure if -> ->> this is the right fix or if there are some underlying problems. -> -> -> -> Oh this is quite embarassing! I only added the bdrv_attach_aio_context -> -> callback for the file-backed device. Your fix is definitely corect for -> -> host device. Let me make sure there weren't any others missed and I will -> -> send out a properly formatted patch. Thank you for the quick testing and -> -> turnaround! -> -> -Farhan, can you respin your patch with proper sign-off and patch description? -> -Adding qemu-block. -I sent it yesterday, sorry I didn't cc everyone from this e-mail: -http://lists.nongnu.org/archive/html/qemu-block/2018-07/msg00516.html -Thanks, -Nish - diff --git a/classification_output/05/mistranslation/25842545 b/classification_output/05/mistranslation/25842545 deleted file mode 100644 index 088ed7a1..00000000 --- a/classification_output/05/mistranslation/25842545 +++ /dev/null @@ -1,210 +0,0 @@ -mistranslation: 0.928 -other: 0.912 -KVM: 0.867 -vnc: 0.862 -device: 0.847 -instruction: 0.835 -semantic: 0.829 -boot: 0.824 -assembly: 0.824 -graphic: 0.822 -socket: 0.808 -network: 0.796 - -[Qemu-devel] [Bug?] Guest pause because VMPTRLD failed in KVM - -Hello, - - We encountered a problem that a guest paused because the KMOD report VMPTRLD -failed. - -The related information is as follows: - -1) Qemu command: - /usr/bin/qemu-kvm -name omu1 -S -machine pc-i440fx-2.3,accel=kvm,usb=off -cpu -host -m 15625 -realtime mlock=off -smp 8,sockets=1,cores=8,threads=1 -uuid -a2aacfff-6583-48b4-b6a4-e6830e519931 -no-user-config -nodefaults -chardev -socket,id=charmonitor,path=/var/lib/libvirt/qemu/omu1.monitor,server,nowait --mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown --boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device -virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive -file=/home/env/guest1.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native - -device -virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0 - -drive -file=/home/env/guest_300G.img,if=none,id=drive-virtio-disk1,format=raw,cache=none,aio=native - -device -virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1 - -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device -virtio-net-pci,netdev=hostnet0,id=net0,mac=00:00:80:05:00:00,bus=pci.0,addr=0x3 --netdev tap,fd=27,id=hostnet1,vhost=on,vhostfd=28 -device -virtio-net-pci,netdev=hostnet1,id=net1,mac=00:00:80:05:00:01,bus=pci.0,addr=0x4 --chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 --device usb-tablet,id=input0 -vnc 0.0.0.0:0 -device -cirrus-vga,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device -virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on - - 2) Qemu log: - KVM: entry failed, hardware error 0x4 - RAX=00000000ffffffed RBX=ffff8803fa2d7fd8 RCX=0100000000000000 -RDX=0000000000000000 - RSI=0000000000000000 RDI=0000000000000046 RBP=ffff8803fa2d7e90 -RSP=ffff8803fa2efe90 - R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 -R11=000000000000b69a - R12=0000000000000001 R13=ffffffff81a25b40 R14=0000000000000000 -R15=ffff8803fa2d7fd8 - RIP=ffffffff81053e16 RFL=00000286 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 - ES =0000 0000000000000000 ffffffff 00c00000 - CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] - SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] - DS =0000 0000000000000000 ffffffff 00c00000 - FS =0000 0000000000000000 ffffffff 00c00000 - GS =0000 ffff88040f540000 ffffffff 00c00000 - LDT=0000 0000000000000000 ffffffff 00c00000 - TR =0040 ffff88040f550a40 00002087 00008b00 DPL=0 TSS64-busy - GDT= ffff88040f549000 0000007f - IDT= ffffffffff529000 00000fff - CR0=80050033 CR2=00007f81ca0c5000 CR3=00000003f5081000 CR4=000407e0 - DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 -DR3=0000000000000000 - DR6=00000000ffff0ff0 DR7=0000000000000400 - EFER=0000000000000d01 - Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? -?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? - - 3) Demsg - [347315.028339] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed - klogd 1.4.1, ---------- state change ---------- - [347315.039506] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed - [347315.051728] kvm: vmptrld ffff8817ec5f0000/17ec5f0000 failed - [347315.057472] vmwrite error: reg 6c0a value ffff88307e66e480 (err -2120672384) - [347315.064567] Pid: 69523, comm: qemu-kvm Tainted: GF X -3.0.93-0.8-default #1 - [347315.064569] Call Trace: - [347315.064587] [<ffffffff810049d5>] dump_trace+0x75/0x300 - [347315.064595] [<ffffffff8145e3e3>] dump_stack+0x69/0x6f - [347315.064617] [<ffffffffa03738de>] vmx_vcpu_load+0x11e/0x1d0 [kvm_intel] - [347315.064647] [<ffffffffa029a204>] kvm_arch_vcpu_load+0x44/0x1d0 [kvm] - [347315.064669] [<ffffffff81054ee1>] finish_task_switch+0x81/0xe0 - [347315.064676] [<ffffffff8145f0b4>] thread_return+0x3b/0x2a7 - [347315.064687] [<ffffffffa028d9b5>] kvm_vcpu_block+0x65/0xa0 [kvm] - [347315.064703] [<ffffffffa02a16d1>] __vcpu_run+0xd1/0x260 [kvm] - [347315.064732] [<ffffffffa02a2418>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 -[kvm] - [347315.064759] [<ffffffffa028ecee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] - [347315.064771] [<ffffffff8116bdfb>] do_vfs_ioctl+0x8b/0x3b0 - [347315.064776] [<ffffffff8116c1c1>] sys_ioctl+0xa1/0xb0 - [347315.064783] [<ffffffff81469272>] system_call_fastpath+0x16/0x1b - [347315.064797] [<00007fee51969ce7>] 0x7fee51969ce6 - [347315.064799] vmwrite error: reg 6c0c value ffff88307e664000 (err -2120630272) - [347315.064802] Pid: 69523, comm: qemu-kvm Tainted: GF X -3.0.93-0.8-default #1 - [347315.064803] Call Trace: - [347315.064807] [<ffffffff810049d5>] dump_trace+0x75/0x300 - [347315.064811] [<ffffffff8145e3e3>] dump_stack+0x69/0x6f - [347315.064817] [<ffffffffa03738ec>] vmx_vcpu_load+0x12c/0x1d0 [kvm_intel] - [347315.064832] [<ffffffffa029a204>] kvm_arch_vcpu_load+0x44/0x1d0 [kvm] - [347315.064851] [<ffffffff81054ee1>] finish_task_switch+0x81/0xe0 - [347315.064855] [<ffffffff8145f0b4>] thread_return+0x3b/0x2a7 - [347315.064865] [<ffffffffa028d9b5>] kvm_vcpu_block+0x65/0xa0 [kvm] - [347315.064880] [<ffffffffa02a16d1>] __vcpu_run+0xd1/0x260 [kvm] - [347315.064907] [<ffffffffa02a2418>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 -[kvm] - [347315.064933] [<ffffffffa028ecee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] - [347315.064943] [<ffffffff8116bdfb>] do_vfs_ioctl+0x8b/0x3b0 - [347315.064947] [<ffffffff8116c1c1>] sys_ioctl+0xa1/0xb0 - [347315.064951] [<ffffffff81469272>] system_call_fastpath+0x16/0x1b - [347315.064957] [<00007fee51969ce7>] 0x7fee51969ce6 - [347315.064959] vmwrite error: reg 6c10 value 0 (err 0) - - 4) The isssue can't be reporduced. I search the Intel VMX sepc about reaseons -of vmptrld failure: - The instruction fails if its operand is not properly aligned, sets -unsupported physical-address bits, or is equal to the VMXON - pointer. In addition, the instruction fails if the 32 bits in memory -referenced by the operand do not match the VMCS - revision identifier supported by this processor. - - But I can't find any cues from the KVM source code. It seems each - error conditions is impossible in theory. :( - -Any suggestions will be appreciated! Paolo? - --- -Regards, --Gonglei - -On 10/11/2016 15:10, gong lei wrote: -> -4) The isssue can't be reporduced. I search the Intel VMX sepc about -> -reaseons -> -of vmptrld failure: -> -The instruction fails if its operand is not properly aligned, sets -> -unsupported physical-address bits, or is equal to the VMXON -> -pointer. In addition, the instruction fails if the 32 bits in memory -> -referenced by the operand do not match the VMCS -> -revision identifier supported by this processor. -> -> -But I can't find any cues from the KVM source code. It seems each -> -error conditions is impossible in theory. :( -Yes, it should not happen. :( - -If it's not reproducible, it's really hard to say what it was, except a -random memory corruption elsewhere or even a bit flip (!). - -Paolo - -On 2016/11/17 20:39, Paolo Bonzini wrote: -> -> -On 10/11/2016 15:10, gong lei wrote: -> -> 4) The isssue can't be reporduced. I search the Intel VMX sepc about -> -> reaseons -> -> of vmptrld failure: -> -> The instruction fails if its operand is not properly aligned, sets -> -> unsupported physical-address bits, or is equal to the VMXON -> -> pointer. In addition, the instruction fails if the 32 bits in memory -> -> referenced by the operand do not match the VMCS -> -> revision identifier supported by this processor. -> -> -> -> But I can't find any cues from the KVM source code. It seems each -> -> error conditions is impossible in theory. :( -> -Yes, it should not happen. :( -> -> -If it's not reproducible, it's really hard to say what it was, except a -> -random memory corruption elsewhere or even a bit flip (!). -> -> -Paolo -Thanks for your reply, Paolo :) - --- -Regards, --Gonglei - diff --git a/classification_output/05/mistranslation/64322995 b/classification_output/05/mistranslation/64322995 deleted file mode 100644 index 7330769b..00000000 --- a/classification_output/05/mistranslation/64322995 +++ /dev/null @@ -1,62 +0,0 @@ -mistranslation: 0.936 -device: 0.915 -network: 0.914 -semantic: 0.906 -graphic: 0.904 -other: 0.881 -socket: 0.866 -instruction: 0.864 -vnc: 0.801 -boot: 0.780 -KVM: 0.742 -assembly: 0.653 - -[Qemu-devel] [BUG] trace: QEMU hangs on initialization with the "simple" backend - -While starting the softmmu version of QEMU, the simple backend waits for the -writeout thread to signal a condition variable when initializing the output file -path. But since the writeout thread has not been created, it just waits forever. - -Thanks, - Lluis - -On Tue, Feb 09, 2016 at 09:24:04PM +0100, LluÃs Vilanova wrote: -> -While starting the softmmu version of QEMU, the simple backend waits for the -> -writeout thread to signal a condition variable when initializing the output -> -file -> -path. But since the writeout thread has not been created, it just waits -> -forever. -Denis Lunev posted a fix: -https://patchwork.ozlabs.org/patch/580968/ -Stefan -signature.asc -Description: -PGP signature - -Stefan Hajnoczi writes: - -> -On Tue, Feb 09, 2016 at 09:24:04PM +0100, LluÃs Vilanova wrote: -> -> While starting the softmmu version of QEMU, the simple backend waits for the -> -> writeout thread to signal a condition variable when initializing the output -> -> file -> -> path. But since the writeout thread has not been created, it just waits -> -> forever. -> -Denis Lunev posted a fix: -> -https://patchwork.ozlabs.org/patch/580968/ -Great, thanks. - -Lluis - diff --git a/classification_output/05/mistranslation/70294255 b/classification_output/05/mistranslation/70294255 deleted file mode 100644 index 2f154bf2..00000000 --- a/classification_output/05/mistranslation/70294255 +++ /dev/null @@ -1,1069 +0,0 @@ -mistranslation: 0.862 -assembly: 0.861 -semantic: 0.858 -socket: 0.858 -device: 0.857 -graphic: 0.857 -instruction: 0.856 -other: 0.852 -network: 0.846 -vnc: 0.837 -boot: 0.811 -KVM: 0.806 - -[Qemu-devel] 答复: Re: 答复: Re: 答复: Re: 答复: Re: [BUG]COLO failover hang - -hi: - -yes.it is better. - -And should we delete - - - - -#ifdef WIN32 - - QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) - -#endif - - - - -in qio_channel_socket_acceptï¼ - -qio_channel_socket_new already have it. - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 -æéäººï¼ address@hidden address@hidden address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang - - - - - -Hi, - -On 2017/3/22 9:42, address@hidden wrote: -ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ -ï¼ -ï¼ index 13966f1..d65a0ea 100644 -ï¼ -ï¼ -ï¼ --- a/migration/socket.c -ï¼ -ï¼ -ï¼ +++ b/migration/socket.c -ï¼ -ï¼ -ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ -ï¼ -ï¼ } -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ trace_migration_socket_incoming_accepted() -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") -ï¼ -ï¼ -ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ -ï¼ -ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ -ï¼ -ï¼ QIO_CHANNEL(sioc)) -ï¼ -ï¼ -ï¼ object_unref(OBJECT(sioc)) -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ Is this patch ok? -ï¼ - -Yes, i think this works, but a better way maybe to call -qio_channel_set_feature() -in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the -socket accept fd, -Or fix it by this: - -diff --git a/io/channel-socket.c b/io/channel-socket.c -index f546c68..ce6894c 100644 ---- a/io/channel-socket.c -+++ b/io/channel-socket.c -@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, - Error **errp) - { - QIOChannelSocket *cioc -- -- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) -- cioc-ï¼fd = -1 -+ -+ cioc = qio_channel_socket_new() - cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) - cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) - - -Thanks, -Hailiang - -ï¼ I have test it . The test could not hang any more. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ -ï¼ -ï¼ -ï¼ åä»¶äººï¼ address@hidden -ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden -ï¼ æéäººï¼ address@hidden address@hidden address@hidden -ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 -ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: -ï¼ ï¼ï¼ Hi, -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do -failover, -ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) -ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if -ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() -ï¼ ï¼ï¼ if we tried to cancel migration. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, -Error **errp) -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") -ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -ï¼ ï¼ï¼ and the -ï¼ ï¼ï¼ migrate_fd_cancel() -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { -ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? -ï¼ ï¼ï¼ } -ï¼ ï¼ï¼ } -ï¼ ï¼ -ï¼ ï¼ (cc'd in Daniel Berrange). -ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, -QIO_CHANNEL_FEATURE_SHUTDOWN) at the -ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? -ï¼ ï¼ -ï¼ -ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, -thanks. -ï¼ -ï¼ ï¼ Dave -ï¼ ï¼ -ï¼ ï¼ï¼ Thanks, -ï¼ ï¼ï¼ Hailiang -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: -ï¼ ï¼ï¼ï¼ Thank youã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I have test areadyã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same -placeã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node -qemu will not produce the problem,but Primary Node panic canã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I test a patch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ --- a/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ } -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), -"migration-socket-incoming") -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ My test will not hang any more. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden -ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden -ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden -ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Hi,Wang. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ You can test this branch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -http://wiki.qemu-project.org/Features/COLO -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Thanks -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ hi. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", -ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should -ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ thank you. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ -ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, -ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, -ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized -outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, -errp=0x0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message -(errp=0x7f3d62bfaa48, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ -ï¼ ï¼ -- -ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK -ï¼ ï¼ -ï¼ ï¼ . -ï¼ ï¼ -ï¼ - -On 2017/3/22 16:09, address@hidden wrote: -hi: - -yes.it is better. - -And should we delete -Yes, you are right. -#ifdef WIN32 - - QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) - -#endif - - - - -in qio_channel_socket_acceptï¼ - -qio_channel_socket_new already have it. - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 -æéäººï¼ address@hidden address@hidden address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang - - - - - -Hi, - -On 2017/3/22 9:42, address@hidden wrote: -ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ -ï¼ -ï¼ index 13966f1..d65a0ea 100644 -ï¼ -ï¼ -ï¼ --- a/migration/socket.c -ï¼ -ï¼ -ï¼ +++ b/migration/socket.c -ï¼ -ï¼ -ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ -ï¼ -ï¼ } -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ trace_migration_socket_incoming_accepted() -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") -ï¼ -ï¼ -ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ -ï¼ -ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ -ï¼ -ï¼ QIO_CHANNEL(sioc)) -ï¼ -ï¼ -ï¼ object_unref(OBJECT(sioc)) -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ Is this patch ok? -ï¼ - -Yes, i think this works, but a better way maybe to call -qio_channel_set_feature() -in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the -socket accept fd, -Or fix it by this: - -diff --git a/io/channel-socket.c b/io/channel-socket.c -index f546c68..ce6894c 100644 ---- a/io/channel-socket.c -+++ b/io/channel-socket.c -@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, - Error **errp) - { - QIOChannelSocket *cioc -- -- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) -- cioc-ï¼fd = -1 -+ -+ cioc = qio_channel_socket_new() - cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) - cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) - - -Thanks, -Hailiang - -ï¼ I have test it . The test could not hang any more. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ -ï¼ -ï¼ -ï¼ åä»¶äººï¼ address@hidden -ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden -ï¼ æéäººï¼ address@hidden address@hidden address@hidden -ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 -ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: -ï¼ ï¼ï¼ Hi, -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do -failover, -ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) -ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if -ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() -ï¼ ï¼ï¼ if we tried to cancel migration. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, -Error **errp) -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") -ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -ï¼ ï¼ï¼ and the -ï¼ ï¼ï¼ migrate_fd_cancel() -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { -ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? -ï¼ ï¼ï¼ } -ï¼ ï¼ï¼ } -ï¼ ï¼ -ï¼ ï¼ (cc'd in Daniel Berrange). -ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, -QIO_CHANNEL_FEATURE_SHUTDOWN) at the -ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? -ï¼ ï¼ -ï¼ -ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, -thanks. -ï¼ -ï¼ ï¼ Dave -ï¼ ï¼ -ï¼ ï¼ï¼ Thanks, -ï¼ ï¼ï¼ Hailiang -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: -ï¼ ï¼ï¼ï¼ Thank youã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I have test areadyã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same -placeã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node -qemu will not produce the problem,but Primary Node panic canã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I test a patch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ --- a/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ } -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), -"migration-socket-incoming") -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ My test will not hang any more. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden -ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden -ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden -ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Hi,Wang. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ You can test this branch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -http://wiki.qemu-project.org/Features/COLO -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Thanks -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ hi. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", -ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should -ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ thank you. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ -ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, -ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, -ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized -outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, -errp=0x0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message -(errp=0x7f3d62bfaa48, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ -ï¼ ï¼ -- -ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK -ï¼ ï¼ -ï¼ ï¼ . -ï¼ ï¼ -ï¼ - diff --git a/classification_output/05/mistranslation/74466963 b/classification_output/05/mistranslation/74466963 deleted file mode 100644 index ceba0270..00000000 --- a/classification_output/05/mistranslation/74466963 +++ /dev/null @@ -1,1886 +0,0 @@ -mistranslation: 0.927 -assembly: 0.910 -device: 0.909 -instruction: 0.903 -KVM: 0.903 -graphic: 0.895 -boot: 0.894 -semantic: 0.891 -socket: 0.879 -vnc: 0.878 -other: 0.877 -network: 0.871 - -[Qemu-devel] [TCG only][Migration Bug? ] Occasionally, the content of VM's memory is inconsistent between Source and Destination of migration - -Hi all, - -Does anyboday remember the similar issue post by hailiang months ago -http://patchwork.ozlabs.org/patch/454322/ -At least tow bugs about migration had been fixed since that. -And now we found the same issue at the tcg vm(kvm is fine), after -migration, the content VM's memory is inconsistent. -we add a patch to check memory content, you can find it from affix - -steps to reporduce: -1) apply the patch and re-build qemu -2) prepare the ubuntu guest and run memtest in grub. -soruce side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off -destination side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881 -3) start migration -with 1000M NIC, migration will finish within 3 min. - -at source: -(qemu) migrate tcp:192.168.2.66:8881 -after saving ram complete -e9e725df678d392b1a83b3a917f332bb -qemu-system-x86_64: end ram md5 -(qemu) - -at destination: -...skip... -Completed load of VM with exit code 0 seq iteration 1264 -Completed load of VM with exit code 0 seq iteration 1265 -Completed load of VM with exit code 0 seq iteration 1266 -qemu-system-x86_64: after loading state section id 2(ram) -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 -qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init - -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 - -This occurs occasionally and only at tcg machine. It seems that -some pages dirtied in source side don't transferred to destination. -This problem can be reproduced even if we disable virtio. -Is it OK for some pages that not transferred to destination when do -migration ? Or is it a bug? -Any idea... - -=================md5 check patch============================= - -diff --git a/Makefile.target b/Makefile.target -index 962d004..e2cb8e9 100644 ---- a/Makefile.target -+++ b/Makefile.target -@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o - obj-y += memory_mapping.o - obj-y += dump.o - obj-y += migration/ram.o migration/savevm.o --LIBS := $(libs_softmmu) $(LIBS) -+LIBS := $(libs_softmmu) $(LIBS) -lplumb - - # xen support - obj-$(CONFIG_XEN) += xen-common.o -diff --git a/migration/ram.c b/migration/ram.c -index 1eb155a..3b7a09d 100644 ---- a/migration/ram.c -+++ b/migration/ram.c -@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int -version_id) -} - - rcu_read_unlock(); -- DPRINTF("Completed load of VM with exit code %d seq iteration " -+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration " - "%" PRIu64 "\n", ret, seq_iter); - return ret; - } -diff --git a/migration/savevm.c b/migration/savevm.c -index 0ad1b93..3feaa61 100644 ---- a/migration/savevm.c -+++ b/migration/savevm.c -@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f) - - } - -+#include "exec/ram_addr.h" -+#include "qemu/rcu_queue.h" -+#include <clplumbing/md5.h> -+#ifndef MD5_DIGEST_LENGTH -+#define MD5_DIGEST_LENGTH 16 -+#endif -+ -+static void check_host_md5(void) -+{ -+ int i; -+ unsigned char md[MD5_DIGEST_LENGTH]; -+ rcu_read_lock(); -+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check -'pc.ram' block */ -+ rcu_read_unlock(); -+ -+ MD5(block->host, block->used_length, md); -+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) { -+ fprintf(stderr, "%02x", md[i]); -+ } -+ fprintf(stderr, "\n"); -+ error_report("end ram md5"); -+} -+ - void qemu_savevm_state_begin(QEMUFile *f, - const MigrationParams *params) - { -@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile -*f, bool iterable_only) -save_section_header(f, se, QEMU_VM_SECTION_END); - - ret = se->ops->save_live_complete_precopy(f, se->opaque); -+ -+ fprintf(stderr, "after saving %s complete\n", se->idstr); -+ check_host_md5(); -+ - trace_savevm_section_end(se->idstr, se->section_id, ret); - save_section_footer(f, se); - if (ret < 0) { -@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, -MigrationIncomingState *mis) -section_id, le->se->idstr); - return ret; - } -+ if (section_type == QEMU_VM_SECTION_END) { -+ error_report("after loading state section id %d(%s)", -+ section_id, le->se->idstr); -+ check_host_md5(); -+ } - if (!check_section_footer(f, le)) { - return -EINVAL; - } -@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f) - } - - cpu_synchronize_all_post_init(); -+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__); -+ check_host_md5(); - - return ret; - } - -* Li Zhijian (address@hidden) wrote: -> -Hi all, -> -> -Does anyboday remember the similar issue post by hailiang months ago -> -http://patchwork.ozlabs.org/patch/454322/ -> -At least tow bugs about migration had been fixed since that. -Yes, I wondered what happened to that. - -> -And now we found the same issue at the tcg vm(kvm is fine), after migration, -> -the content VM's memory is inconsistent. -Hmm, TCG only - I don't know much about that; but I guess something must -be accessing memory without using the proper macros/functions so -it doesn't mark it as dirty. - -> -we add a patch to check memory content, you can find it from affix -> -> -steps to reporduce: -> -1) apply the patch and re-build qemu -> -2) prepare the ubuntu guest and run memtest in grub. -> -soruce side: -> -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -> -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -> -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -> -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -> --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -> -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -> -pc-i440fx-2.3,accel=tcg,usb=off -> -> -destination side: -> -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -> -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -> -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -> -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -> --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -> -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -> -pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881 -> -> -3) start migration -> -with 1000M NIC, migration will finish within 3 min. -> -> -at source: -> -(qemu) migrate tcp:192.168.2.66:8881 -> -after saving ram complete -> -e9e725df678d392b1a83b3a917f332bb -> -qemu-system-x86_64: end ram md5 -> -(qemu) -> -> -at destination: -> -...skip... -> -Completed load of VM with exit code 0 seq iteration 1264 -> -Completed load of VM with exit code 0 seq iteration 1265 -> -Completed load of VM with exit code 0 seq iteration 1266 -> -qemu-system-x86_64: after loading state section id 2(ram) -> -49c2dac7bde0e5e22db7280dcb3824f9 -> -qemu-system-x86_64: end ram md5 -> -qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init -> -> -49c2dac7bde0e5e22db7280dcb3824f9 -> -qemu-system-x86_64: end ram md5 -> -> -This occurs occasionally and only at tcg machine. It seems that -> -some pages dirtied in source side don't transferred to destination. -> -This problem can be reproduced even if we disable virtio. -> -> -Is it OK for some pages that not transferred to destination when do -> -migration ? Or is it a bug? -I'm pretty sure that means it's a bug. Hard to find though, I guess -at least memtest is smaller than a big OS. I think I'd dump the whole -of memory on both sides, hexdump and diff them - I'd guess it would -just be one byte/word different, maybe that would offer some idea what -wrote it. - -Dave - -> -Any idea... -> -> -=================md5 check patch============================= -> -> -diff --git a/Makefile.target b/Makefile.target -> -index 962d004..e2cb8e9 100644 -> ---- a/Makefile.target -> -+++ b/Makefile.target -> -@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o -> -obj-y += memory_mapping.o -> -obj-y += dump.o -> -obj-y += migration/ram.o migration/savevm.o -> --LIBS := $(libs_softmmu) $(LIBS) -> -+LIBS := $(libs_softmmu) $(LIBS) -lplumb -> -> -# xen support -> -obj-$(CONFIG_XEN) += xen-common.o -> -diff --git a/migration/ram.c b/migration/ram.c -> -index 1eb155a..3b7a09d 100644 -> ---- a/migration/ram.c -> -+++ b/migration/ram.c -> -@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int -> -version_id) -> -} -> -> -rcu_read_unlock(); -> -- DPRINTF("Completed load of VM with exit code %d seq iteration " -> -+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration " -> -"%" PRIu64 "\n", ret, seq_iter); -> -return ret; -> -} -> -diff --git a/migration/savevm.c b/migration/savevm.c -> -index 0ad1b93..3feaa61 100644 -> ---- a/migration/savevm.c -> -+++ b/migration/savevm.c -> -@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f) -> -> -} -> -> -+#include "exec/ram_addr.h" -> -+#include "qemu/rcu_queue.h" -> -+#include <clplumbing/md5.h> -> -+#ifndef MD5_DIGEST_LENGTH -> -+#define MD5_DIGEST_LENGTH 16 -> -+#endif -> -+ -> -+static void check_host_md5(void) -> -+{ -> -+ int i; -> -+ unsigned char md[MD5_DIGEST_LENGTH]; -> -+ rcu_read_lock(); -> -+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check -> -'pc.ram' block */ -> -+ rcu_read_unlock(); -> -+ -> -+ MD5(block->host, block->used_length, md); -> -+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) { -> -+ fprintf(stderr, "%02x", md[i]); -> -+ } -> -+ fprintf(stderr, "\n"); -> -+ error_report("end ram md5"); -> -+} -> -+ -> -void qemu_savevm_state_begin(QEMUFile *f, -> -const MigrationParams *params) -> -{ -> -@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f, -> -bool iterable_only) -> -save_section_header(f, se, QEMU_VM_SECTION_END); -> -> -ret = se->ops->save_live_complete_precopy(f, se->opaque); -> -+ -> -+ fprintf(stderr, "after saving %s complete\n", se->idstr); -> -+ check_host_md5(); -> -+ -> -trace_savevm_section_end(se->idstr, se->section_id, ret); -> -save_section_footer(f, se); -> -if (ret < 0) { -> -@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, -> -MigrationIncomingState *mis) -> -section_id, le->se->idstr); -> -return ret; -> -} -> -+ if (section_type == QEMU_VM_SECTION_END) { -> -+ error_report("after loading state section id %d(%s)", -> -+ section_id, le->se->idstr); -> -+ check_host_md5(); -> -+ } -> -if (!check_section_footer(f, le)) { -> -return -EINVAL; -> -} -> -@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f) -> -} -> -> -cpu_synchronize_all_post_init(); -> -+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__); -> -+ check_host_md5(); -> -> -return ret; -> -} -> -> -> --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - -On 2015/12/3 17:24, Dr. David Alan Gilbert wrote: -* Li Zhijian (address@hidden) wrote: -Hi all, - -Does anyboday remember the similar issue post by hailiang months ago -http://patchwork.ozlabs.org/patch/454322/ -At least tow bugs about migration had been fixed since that. -Yes, I wondered what happened to that. -And now we found the same issue at the tcg vm(kvm is fine), after migration, -the content VM's memory is inconsistent. -Hmm, TCG only - I don't know much about that; but I guess something must -be accessing memory without using the proper macros/functions so -it doesn't mark it as dirty. -we add a patch to check memory content, you can find it from affix - -steps to reporduce: -1) apply the patch and re-build qemu -2) prepare the ubuntu guest and run memtest in grub. -soruce side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off - -destination side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881 - -3) start migration -with 1000M NIC, migration will finish within 3 min. - -at source: -(qemu) migrate tcp:192.168.2.66:8881 -after saving ram complete -e9e725df678d392b1a83b3a917f332bb -qemu-system-x86_64: end ram md5 -(qemu) - -at destination: -...skip... -Completed load of VM with exit code 0 seq iteration 1264 -Completed load of VM with exit code 0 seq iteration 1265 -Completed load of VM with exit code 0 seq iteration 1266 -qemu-system-x86_64: after loading state section id 2(ram) -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 -qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init - -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 - -This occurs occasionally and only at tcg machine. It seems that -some pages dirtied in source side don't transferred to destination. -This problem can be reproduced even if we disable virtio. - -Is it OK for some pages that not transferred to destination when do -migration ? Or is it a bug? -I'm pretty sure that means it's a bug. Hard to find though, I guess -at least memtest is smaller than a big OS. I think I'd dump the whole -of memory on both sides, hexdump and diff them - I'd guess it would -just be one byte/word different, maybe that would offer some idea what -wrote it. -Maybe one better way to do that is with the help of userfaultfd's write-protect -capability. It is still in the development by Andrea Arcangeli, but there -is a RFC version available, please refer to -http://www.spinics.net/lists/linux-mm/msg97422.html -ï¼I'm developing live memory snapshot which based on it, maybe this is another -scene where we -can use userfaultfd's WP ;) ). -Dave -Any idea... - -=================md5 check patch============================= - -diff --git a/Makefile.target b/Makefile.target -index 962d004..e2cb8e9 100644 ---- a/Makefile.target -+++ b/Makefile.target -@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o - obj-y += memory_mapping.o - obj-y += dump.o - obj-y += migration/ram.o migration/savevm.o --LIBS := $(libs_softmmu) $(LIBS) -+LIBS := $(libs_softmmu) $(LIBS) -lplumb - - # xen support - obj-$(CONFIG_XEN) += xen-common.o -diff --git a/migration/ram.c b/migration/ram.c -index 1eb155a..3b7a09d 100644 ---- a/migration/ram.c -+++ b/migration/ram.c -@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int -version_id) - } - - rcu_read_unlock(); -- DPRINTF("Completed load of VM with exit code %d seq iteration " -+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration " - "%" PRIu64 "\n", ret, seq_iter); - return ret; - } -diff --git a/migration/savevm.c b/migration/savevm.c -index 0ad1b93..3feaa61 100644 ---- a/migration/savevm.c -+++ b/migration/savevm.c -@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f) - - } - -+#include "exec/ram_addr.h" -+#include "qemu/rcu_queue.h" -+#include <clplumbing/md5.h> -+#ifndef MD5_DIGEST_LENGTH -+#define MD5_DIGEST_LENGTH 16 -+#endif -+ -+static void check_host_md5(void) -+{ -+ int i; -+ unsigned char md[MD5_DIGEST_LENGTH]; -+ rcu_read_lock(); -+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check -'pc.ram' block */ -+ rcu_read_unlock(); -+ -+ MD5(block->host, block->used_length, md); -+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) { -+ fprintf(stderr, "%02x", md[i]); -+ } -+ fprintf(stderr, "\n"); -+ error_report("end ram md5"); -+} -+ - void qemu_savevm_state_begin(QEMUFile *f, - const MigrationParams *params) - { -@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f, -bool iterable_only) - save_section_header(f, se, QEMU_VM_SECTION_END); - - ret = se->ops->save_live_complete_precopy(f, se->opaque); -+ -+ fprintf(stderr, "after saving %s complete\n", se->idstr); -+ check_host_md5(); -+ - trace_savevm_section_end(se->idstr, se->section_id, ret); - save_section_footer(f, se); - if (ret < 0) { -@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, -MigrationIncomingState *mis) - section_id, le->se->idstr); - return ret; - } -+ if (section_type == QEMU_VM_SECTION_END) { -+ error_report("after loading state section id %d(%s)", -+ section_id, le->se->idstr); -+ check_host_md5(); -+ } - if (!check_section_footer(f, le)) { - return -EINVAL; - } -@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f) - } - - cpu_synchronize_all_post_init(); -+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__); -+ check_host_md5(); - - return ret; - } --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - -. - -On 12/03/2015 05:37 PM, Hailiang Zhang wrote: -On 2015/12/3 17:24, Dr. David Alan Gilbert wrote: -* Li Zhijian (address@hidden) wrote: -Hi all, - -Does anyboday remember the similar issue post by hailiang months ago -http://patchwork.ozlabs.org/patch/454322/ -At least tow bugs about migration had been fixed since that. -Yes, I wondered what happened to that. -And now we found the same issue at the tcg vm(kvm is fine), after -migration, -the content VM's memory is inconsistent. -Hmm, TCG only - I don't know much about that; but I guess something must -be accessing memory without using the proper macros/functions so -it doesn't mark it as dirty. -we add a patch to check memory content, you can find it from affix - -steps to reporduce: -1) apply the patch and re-build qemu -2) prepare the ubuntu guest and run memtest in grub. -soruce side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 - --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off - -destination side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 - --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881 - -3) start migration -with 1000M NIC, migration will finish within 3 min. - -at source: -(qemu) migrate tcp:192.168.2.66:8881 -after saving ram complete -e9e725df678d392b1a83b3a917f332bb -qemu-system-x86_64: end ram md5 -(qemu) - -at destination: -...skip... -Completed load of VM with exit code 0 seq iteration 1264 -Completed load of VM with exit code 0 seq iteration 1265 -Completed load of VM with exit code 0 seq iteration 1266 -qemu-system-x86_64: after loading state section id 2(ram) -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 -qemu-system-x86_64: qemu_loadvm_state: after -cpu_synchronize_all_post_init - -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 - -This occurs occasionally and only at tcg machine. It seems that -some pages dirtied in source side don't transferred to destination. -This problem can be reproduced even if we disable virtio. - -Is it OK for some pages that not transferred to destination when do -migration ? Or is it a bug? -I'm pretty sure that means it's a bug. Hard to find though, I guess -at least memtest is smaller than a big OS. I think I'd dump the whole -of memory on both sides, hexdump and diff them - I'd guess it would -just be one byte/word different, maybe that would offer some idea what -wrote it. -Maybe one better way to do that is with the help of userfaultfd's -write-protect -capability. It is still in the development by Andrea Arcangeli, but there -is a RFC version available, please refer to -http://www.spinics.net/lists/linux-mm/msg97422.html -ï¼I'm developing live memory snapshot which based on it, maybe this is -another scene where we -can use userfaultfd's WP ;) ). -sounds good. - -thanks -Li -Dave -Any idea... - -=================md5 check patch============================= - -diff --git a/Makefile.target b/Makefile.target -index 962d004..e2cb8e9 100644 ---- a/Makefile.target -+++ b/Makefile.target -@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o - obj-y += memory_mapping.o - obj-y += dump.o - obj-y += migration/ram.o migration/savevm.o --LIBS := $(libs_softmmu) $(LIBS) -+LIBS := $(libs_softmmu) $(LIBS) -lplumb - - # xen support - obj-$(CONFIG_XEN) += xen-common.o -diff --git a/migration/ram.c b/migration/ram.c -index 1eb155a..3b7a09d 100644 ---- a/migration/ram.c -+++ b/migration/ram.c -@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int -version_id) - } - - rcu_read_unlock(); -- DPRINTF("Completed load of VM with exit code %d seq iteration " -+ fprintf(stderr, "Completed load of VM with exit code %d seq -iteration " - "%" PRIu64 "\n", ret, seq_iter); - return ret; - } -diff --git a/migration/savevm.c b/migration/savevm.c -index 0ad1b93..3feaa61 100644 ---- a/migration/savevm.c -+++ b/migration/savevm.c -@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f) - - } - -+#include "exec/ram_addr.h" -+#include "qemu/rcu_queue.h" -+#include <clplumbing/md5.h> -+#ifndef MD5_DIGEST_LENGTH -+#define MD5_DIGEST_LENGTH 16 -+#endif -+ -+static void check_host_md5(void) -+{ -+ int i; -+ unsigned char md[MD5_DIGEST_LENGTH]; -+ rcu_read_lock(); -+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check -'pc.ram' block */ -+ rcu_read_unlock(); -+ -+ MD5(block->host, block->used_length, md); -+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) { -+ fprintf(stderr, "%02x", md[i]); -+ } -+ fprintf(stderr, "\n"); -+ error_report("end ram md5"); -+} -+ - void qemu_savevm_state_begin(QEMUFile *f, - const MigrationParams *params) - { -@@ -1056,6 +1079,10 @@ void -qemu_savevm_state_complete_precopy(QEMUFile *f, -bool iterable_only) - save_section_header(f, se, QEMU_VM_SECTION_END); - - ret = se->ops->save_live_complete_precopy(f, se->opaque); -+ -+ fprintf(stderr, "after saving %s complete\n", se->idstr); -+ check_host_md5(); -+ - trace_savevm_section_end(se->idstr, se->section_id, ret); - save_section_footer(f, se); - if (ret < 0) { -@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, -MigrationIncomingState *mis) - section_id, le->se->idstr); - return ret; - } -+ if (section_type == QEMU_VM_SECTION_END) { -+ error_report("after loading state section id %d(%s)", -+ section_id, le->se->idstr); -+ check_host_md5(); -+ } - if (!check_section_footer(f, le)) { - return -EINVAL; - } -@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f) - } - - cpu_synchronize_all_post_init(); -+ error_report("%s: after cpu_synchronize_all_post_init\n", -__func__); -+ check_host_md5(); - - return ret; - } --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - -. -. --- -Best regards. -Li Zhijian (8555) - -On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote: -* Li Zhijian (address@hidden) wrote: -Hi all, - -Does anyboday remember the similar issue post by hailiang months ago -http://patchwork.ozlabs.org/patch/454322/ -At least tow bugs about migration had been fixed since that. -Yes, I wondered what happened to that. -And now we found the same issue at the tcg vm(kvm is fine), after migration, -the content VM's memory is inconsistent. -Hmm, TCG only - I don't know much about that; but I guess something must -be accessing memory without using the proper macros/functions so -it doesn't mark it as dirty. -we add a patch to check memory content, you can find it from affix - -steps to reporduce: -1) apply the patch and re-build qemu -2) prepare the ubuntu guest and run memtest in grub. -soruce side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off - -destination side: -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881 - -3) start migration -with 1000M NIC, migration will finish within 3 min. - -at source: -(qemu) migrate tcp:192.168.2.66:8881 -after saving ram complete -e9e725df678d392b1a83b3a917f332bb -qemu-system-x86_64: end ram md5 -(qemu) - -at destination: -...skip... -Completed load of VM with exit code 0 seq iteration 1264 -Completed load of VM with exit code 0 seq iteration 1265 -Completed load of VM with exit code 0 seq iteration 1266 -qemu-system-x86_64: after loading state section id 2(ram) -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 -qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init - -49c2dac7bde0e5e22db7280dcb3824f9 -qemu-system-x86_64: end ram md5 - -This occurs occasionally and only at tcg machine. It seems that -some pages dirtied in source side don't transferred to destination. -This problem can be reproduced even if we disable virtio. - -Is it OK for some pages that not transferred to destination when do -migration ? Or is it a bug? -I'm pretty sure that means it's a bug. Hard to find though, I guess -at least memtest is smaller than a big OS. I think I'd dump the whole -of memory on both sides, hexdump and diff them - I'd guess it would -just be one byte/word different, maybe that would offer some idea what -wrote it. -I try to dump and compare them, more than 10 pages are different. -in source side, they are random value rather than always 'FF' 'FB' 'EF' -'BF'... in destination. -and not all of the different pages are continuous. - -thanks -Li -Dave -Any idea... - -=================md5 check patch============================= - -diff --git a/Makefile.target b/Makefile.target -index 962d004..e2cb8e9 100644 ---- a/Makefile.target -+++ b/Makefile.target -@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o - obj-y += memory_mapping.o - obj-y += dump.o - obj-y += migration/ram.o migration/savevm.o --LIBS := $(libs_softmmu) $(LIBS) -+LIBS := $(libs_softmmu) $(LIBS) -lplumb - - # xen support - obj-$(CONFIG_XEN) += xen-common.o -diff --git a/migration/ram.c b/migration/ram.c -index 1eb155a..3b7a09d 100644 ---- a/migration/ram.c -+++ b/migration/ram.c -@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int -version_id) - } - - rcu_read_unlock(); -- DPRINTF("Completed load of VM with exit code %d seq iteration " -+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration " - "%" PRIu64 "\n", ret, seq_iter); - return ret; - } -diff --git a/migration/savevm.c b/migration/savevm.c -index 0ad1b93..3feaa61 100644 ---- a/migration/savevm.c -+++ b/migration/savevm.c -@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f) - - } - -+#include "exec/ram_addr.h" -+#include "qemu/rcu_queue.h" -+#include <clplumbing/md5.h> -+#ifndef MD5_DIGEST_LENGTH -+#define MD5_DIGEST_LENGTH 16 -+#endif -+ -+static void check_host_md5(void) -+{ -+ int i; -+ unsigned char md[MD5_DIGEST_LENGTH]; -+ rcu_read_lock(); -+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check -'pc.ram' block */ -+ rcu_read_unlock(); -+ -+ MD5(block->host, block->used_length, md); -+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) { -+ fprintf(stderr, "%02x", md[i]); -+ } -+ fprintf(stderr, "\n"); -+ error_report("end ram md5"); -+} -+ - void qemu_savevm_state_begin(QEMUFile *f, - const MigrationParams *params) - { -@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f, -bool iterable_only) - save_section_header(f, se, QEMU_VM_SECTION_END); - - ret = se->ops->save_live_complete_precopy(f, se->opaque); -+ -+ fprintf(stderr, "after saving %s complete\n", se->idstr); -+ check_host_md5(); -+ - trace_savevm_section_end(se->idstr, se->section_id, ret); - save_section_footer(f, se); - if (ret < 0) { -@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, -MigrationIncomingState *mis) - section_id, le->se->idstr); - return ret; - } -+ if (section_type == QEMU_VM_SECTION_END) { -+ error_report("after loading state section id %d(%s)", -+ section_id, le->se->idstr); -+ check_host_md5(); -+ } - if (!check_section_footer(f, le)) { - return -EINVAL; - } -@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f) - } - - cpu_synchronize_all_post_init(); -+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__); -+ check_host_md5(); - - return ret; - } --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - - -. --- -Best regards. -Li Zhijian (8555) - -* Li Zhijian (address@hidden) wrote: -> -> -> -On 12/03/2015 05:24 PM, Dr. David Alan Gilbert wrote: -> ->* Li Zhijian (address@hidden) wrote: -> ->>Hi all, -> ->> -> ->>Does anyboday remember the similar issue post by hailiang months ago -> ->> -http://patchwork.ozlabs.org/patch/454322/ -> ->>At least tow bugs about migration had been fixed since that. -> -> -> ->Yes, I wondered what happened to that. -> -> -> ->>And now we found the same issue at the tcg vm(kvm is fine), after migration, -> ->>the content VM's memory is inconsistent. -> -> -> ->Hmm, TCG only - I don't know much about that; but I guess something must -> ->be accessing memory without using the proper macros/functions so -> ->it doesn't mark it as dirty. -> -> -> ->>we add a patch to check memory content, you can find it from affix -> ->> -> ->>steps to reporduce: -> ->>1) apply the patch and re-build qemu -> ->>2) prepare the ubuntu guest and run memtest in grub. -> ->>soruce side: -> ->>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -> ->>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -> ->>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -> ->>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -> ->>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -> ->>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -> ->>pc-i440fx-2.3,accel=tcg,usb=off -> ->> -> ->>destination side: -> ->>x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -> ->>e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -> ->>if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -> ->>virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -> ->>-vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -> ->>tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -> ->>pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881 -> ->> -> ->>3) start migration -> ->>with 1000M NIC, migration will finish within 3 min. -> ->> -> ->>at source: -> ->>(qemu) migrate tcp:192.168.2.66:8881 -> ->>after saving ram complete -> ->>e9e725df678d392b1a83b3a917f332bb -> ->>qemu-system-x86_64: end ram md5 -> ->>(qemu) -> ->> -> ->>at destination: -> ->>...skip... -> ->>Completed load of VM with exit code 0 seq iteration 1264 -> ->>Completed load of VM with exit code 0 seq iteration 1265 -> ->>Completed load of VM with exit code 0 seq iteration 1266 -> ->>qemu-system-x86_64: after loading state section id 2(ram) -> ->>49c2dac7bde0e5e22db7280dcb3824f9 -> ->>qemu-system-x86_64: end ram md5 -> ->>qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init -> ->> -> ->>49c2dac7bde0e5e22db7280dcb3824f9 -> ->>qemu-system-x86_64: end ram md5 -> ->> -> ->>This occurs occasionally and only at tcg machine. It seems that -> ->>some pages dirtied in source side don't transferred to destination. -> ->>This problem can be reproduced even if we disable virtio. -> ->> -> ->>Is it OK for some pages that not transferred to destination when do -> ->>migration ? Or is it a bug? -> -> -> ->I'm pretty sure that means it's a bug. Hard to find though, I guess -> ->at least memtest is smaller than a big OS. I think I'd dump the whole -> ->of memory on both sides, hexdump and diff them - I'd guess it would -> ->just be one byte/word different, maybe that would offer some idea what -> ->wrote it. -> -> -I try to dump and compare them, more than 10 pages are different. -> -in source side, they are random value rather than always 'FF' 'FB' 'EF' -> -'BF'... in destination. -> -> -and not all of the different pages are continuous. -I wonder if it happens on all of memtest's different test patterns, -perhaps it might be possible to narrow it down if you tell memtest -to only run one test at a time. - -Dave - -> -> -thanks -> -Li -> -> -> -> -> ->Dave -> -> -> ->>Any idea... -> ->> -> ->>=================md5 check patch============================= -> ->> -> ->>diff --git a/Makefile.target b/Makefile.target -> ->>index 962d004..e2cb8e9 100644 -> ->>--- a/Makefile.target -> ->>+++ b/Makefile.target -> ->>@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o -> ->> obj-y += memory_mapping.o -> ->> obj-y += dump.o -> ->> obj-y += migration/ram.o migration/savevm.o -> ->>-LIBS := $(libs_softmmu) $(LIBS) -> ->>+LIBS := $(libs_softmmu) $(LIBS) -lplumb -> ->> -> ->> # xen support -> ->> obj-$(CONFIG_XEN) += xen-common.o -> ->>diff --git a/migration/ram.c b/migration/ram.c -> ->>index 1eb155a..3b7a09d 100644 -> ->>--- a/migration/ram.c -> ->>+++ b/migration/ram.c -> ->>@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, int -> ->>version_id) -> ->> } -> ->> -> ->> rcu_read_unlock(); -> ->>- DPRINTF("Completed load of VM with exit code %d seq iteration " -> ->>+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration " -> ->> "%" PRIu64 "\n", ret, seq_iter); -> ->> return ret; -> ->> } -> ->>diff --git a/migration/savevm.c b/migration/savevm.c -> ->>index 0ad1b93..3feaa61 100644 -> ->>--- a/migration/savevm.c -> ->>+++ b/migration/savevm.c -> ->>@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f) -> ->> -> ->> } -> ->> -> ->>+#include "exec/ram_addr.h" -> ->>+#include "qemu/rcu_queue.h" -> ->>+#include <clplumbing/md5.h> -> ->>+#ifndef MD5_DIGEST_LENGTH -> ->>+#define MD5_DIGEST_LENGTH 16 -> ->>+#endif -> ->>+ -> ->>+static void check_host_md5(void) -> ->>+{ -> ->>+ int i; -> ->>+ unsigned char md[MD5_DIGEST_LENGTH]; -> ->>+ rcu_read_lock(); -> ->>+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check -> ->>'pc.ram' block */ -> ->>+ rcu_read_unlock(); -> ->>+ -> ->>+ MD5(block->host, block->used_length, md); -> ->>+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) { -> ->>+ fprintf(stderr, "%02x", md[i]); -> ->>+ } -> ->>+ fprintf(stderr, "\n"); -> ->>+ error_report("end ram md5"); -> ->>+} -> ->>+ -> ->> void qemu_savevm_state_begin(QEMUFile *f, -> ->> const MigrationParams *params) -> ->> { -> ->>@@ -1056,6 +1079,10 @@ void qemu_savevm_state_complete_precopy(QEMUFile *f, -> ->>bool iterable_only) -> ->> save_section_header(f, se, QEMU_VM_SECTION_END); -> ->> -> ->> ret = se->ops->save_live_complete_precopy(f, se->opaque); -> ->>+ -> ->>+ fprintf(stderr, "after saving %s complete\n", se->idstr); -> ->>+ check_host_md5(); -> ->>+ -> ->> trace_savevm_section_end(se->idstr, se->section_id, ret); -> ->> save_section_footer(f, se); -> ->> if (ret < 0) { -> ->>@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, -> ->>MigrationIncomingState *mis) -> ->> section_id, le->se->idstr); -> ->> return ret; -> ->> } -> ->>+ if (section_type == QEMU_VM_SECTION_END) { -> ->>+ error_report("after loading state section id %d(%s)", -> ->>+ section_id, le->se->idstr); -> ->>+ check_host_md5(); -> ->>+ } -> ->> if (!check_section_footer(f, le)) { -> ->> return -EINVAL; -> ->> } -> ->>@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f) -> ->> } -> ->> -> ->> cpu_synchronize_all_post_init(); -> ->>+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__); -> ->>+ check_host_md5(); -> ->> -> ->> return ret; -> ->> } -> ->> -> ->> -> ->> -> ->-- -> ->Dr. David Alan Gilbert / address@hidden / Manchester, UK -> -> -> -> -> ->. -> -> -> -> --- -> -Best regards. -> -Li Zhijian (8555) -> -> --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - -Li Zhijian <address@hidden> wrote: -> -Hi all, -> -> -Does anyboday remember the similar issue post by hailiang months ago -> -http://patchwork.ozlabs.org/patch/454322/ -> -At least tow bugs about migration had been fixed since that. -> -> -And now we found the same issue at the tcg vm(kvm is fine), after -> -migration, the content VM's memory is inconsistent. -> -> -we add a patch to check memory content, you can find it from affix -> -> -steps to reporduce: -> -1) apply the patch and re-build qemu -> -2) prepare the ubuntu guest and run memtest in grub. -> -soruce side: -> -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -> -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -> -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -> -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -> --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -> -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -> -pc-i440fx-2.3,accel=tcg,usb=off -> -> -destination side: -> -x86_64-softmmu/qemu-system-x86_64 -netdev tap,id=hn0 -device -> -e1000,id=net-pci0,netdev=hn0,mac=52:54:00:12:34:65 -boot c -drive -> -if=none,file=/home/lizj/ubuntu.raw,id=drive-virtio-disk0 -device -> -virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -> --vnc :7 -m 128 -smp 1 -device piix3-usb-uhci -device usb-tablet -qmp -> -tcp::4444,server,nowait -monitor stdio -cpu qemu64 -machine -> -pc-i440fx-2.3,accel=tcg,usb=off -incoming tcp:0:8881 -> -> -3) start migration -> -with 1000M NIC, migration will finish within 3 min. -> -> -at source: -> -(qemu) migrate tcp:192.168.2.66:8881 -> -after saving ram complete -> -e9e725df678d392b1a83b3a917f332bb -> -qemu-system-x86_64: end ram md5 -> -(qemu) -> -> -at destination: -> -...skip... -> -Completed load of VM with exit code 0 seq iteration 1264 -> -Completed load of VM with exit code 0 seq iteration 1265 -> -Completed load of VM with exit code 0 seq iteration 1266 -> -qemu-system-x86_64: after loading state section id 2(ram) -> -49c2dac7bde0e5e22db7280dcb3824f9 -> -qemu-system-x86_64: end ram md5 -> -qemu-system-x86_64: qemu_loadvm_state: after cpu_synchronize_all_post_init -> -> -49c2dac7bde0e5e22db7280dcb3824f9 -> -qemu-system-x86_64: end ram md5 -> -> -This occurs occasionally and only at tcg machine. It seems that -> -some pages dirtied in source side don't transferred to destination. -> -This problem can be reproduced even if we disable virtio. -> -> -Is it OK for some pages that not transferred to destination when do -> -migration ? Or is it a bug? -> -> -Any idea... -Thanks for describing how to reproduce the bug. -If some pages are not transferred to destination then it is a bug, so we -need to know what the problem is, notice that the problem can be that -TCG is not marking dirty some page, that Migration code "forgets" about -that page, or anything eles altogether, that is what we need to find. - -There are more posibilities, I am not sure that memtest is on 32bit -mode, and it is inside posibility that we are missing some state when we -are on real mode. - -Will try to take a look at this. - -THanks, again. - - -> -> -=================md5 check patch============================= -> -> -diff --git a/Makefile.target b/Makefile.target -> -index 962d004..e2cb8e9 100644 -> ---- a/Makefile.target -> -+++ b/Makefile.target -> -@@ -139,7 +139,7 @@ obj-y += memory.o cputlb.o -> -obj-y += memory_mapping.o -> -obj-y += dump.o -> -obj-y += migration/ram.o migration/savevm.o -> --LIBS := $(libs_softmmu) $(LIBS) -> -+LIBS := $(libs_softmmu) $(LIBS) -lplumb -> -> -# xen support -> -obj-$(CONFIG_XEN) += xen-common.o -> -diff --git a/migration/ram.c b/migration/ram.c -> -index 1eb155a..3b7a09d 100644 -> ---- a/migration/ram.c -> -+++ b/migration/ram.c -> -@@ -2513,7 +2513,7 @@ static int ram_load(QEMUFile *f, void *opaque, -> -int version_id) -> -} -> -> -rcu_read_unlock(); -> -- DPRINTF("Completed load of VM with exit code %d seq iteration " -> -+ fprintf(stderr, "Completed load of VM with exit code %d seq iteration " -> -"%" PRIu64 "\n", ret, seq_iter); -> -return ret; -> -} -> -diff --git a/migration/savevm.c b/migration/savevm.c -> -index 0ad1b93..3feaa61 100644 -> ---- a/migration/savevm.c -> -+++ b/migration/savevm.c -> -@@ -891,6 +891,29 @@ void qemu_savevm_state_header(QEMUFile *f) -> -> -} -> -> -+#include "exec/ram_addr.h" -> -+#include "qemu/rcu_queue.h" -> -+#include <clplumbing/md5.h> -> -+#ifndef MD5_DIGEST_LENGTH -> -+#define MD5_DIGEST_LENGTH 16 -> -+#endif -> -+ -> -+static void check_host_md5(void) -> -+{ -> -+ int i; -> -+ unsigned char md[MD5_DIGEST_LENGTH]; -> -+ rcu_read_lock(); -> -+ RAMBlock *block = QLIST_FIRST_RCU(&ram_list.blocks);/* Only check -> -'pc.ram' block */ -> -+ rcu_read_unlock(); -> -+ -> -+ MD5(block->host, block->used_length, md); -> -+ for(i = 0; i < MD5_DIGEST_LENGTH; i++) { -> -+ fprintf(stderr, "%02x", md[i]); -> -+ } -> -+ fprintf(stderr, "\n"); -> -+ error_report("end ram md5"); -> -+} -> -+ -> -void qemu_savevm_state_begin(QEMUFile *f, -> -const MigrationParams *params) -> -{ -> -@@ -1056,6 +1079,10 @@ void -> -qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only) -> -save_section_header(f, se, QEMU_VM_SECTION_END); -> -> -ret = se->ops->save_live_complete_precopy(f, se->opaque); -> -+ -> -+ fprintf(stderr, "after saving %s complete\n", se->idstr); -> -+ check_host_md5(); -> -+ -> -trace_savevm_section_end(se->idstr, se->section_id, ret); -> -save_section_footer(f, se); -> -if (ret < 0) { -> -@@ -1791,6 +1818,11 @@ static int qemu_loadvm_state_main(QEMUFile *f, -> -MigrationIncomingState *mis) -> -section_id, le->se->idstr); -> -return ret; -> -} -> -+ if (section_type == QEMU_VM_SECTION_END) { -> -+ error_report("after loading state section id %d(%s)", -> -+ section_id, le->se->idstr); -> -+ check_host_md5(); -> -+ } -> -if (!check_section_footer(f, le)) { -> -return -EINVAL; -> -} -> -@@ -1901,6 +1933,8 @@ int qemu_loadvm_state(QEMUFile *f) -> -} -> -> -cpu_synchronize_all_post_init(); -> -+ error_report("%s: after cpu_synchronize_all_post_init\n", __func__); -> -+ check_host_md5(); -> -> -return ret; -> -} - -> -> -Thanks for describing how to reproduce the bug. -> -If some pages are not transferred to destination then it is a bug, so we need -> -to know what the problem is, notice that the problem can be that TCG is not -> -marking dirty some page, that Migration code "forgets" about that page, or -> -anything eles altogether, that is what we need to find. -> -> -There are more posibilities, I am not sure that memtest is on 32bit mode, and -> -it is inside posibility that we are missing some state when we are on real -> -mode. -> -> -Will try to take a look at this. -> -> -THanks, again. -> -Hi Juan & Amit - - Do you think we should add a mechanism to check the data integrity during LM -like Zhijian's patch did? it may be very helpful for developers. - Actually, I did the similar thing before in order to make sure that I did the -right thing we I change the code related to LM. - -Liang - -On (Fri) 04 Dec 2015 [01:43:07], Li, Liang Z wrote: -> -> -> -> Thanks for describing how to reproduce the bug. -> -> If some pages are not transferred to destination then it is a bug, so we -> -> need -> -> to know what the problem is, notice that the problem can be that TCG is not -> -> marking dirty some page, that Migration code "forgets" about that page, or -> -> anything eles altogether, that is what we need to find. -> -> -> -> There are more posibilities, I am not sure that memtest is on 32bit mode, -> -> and -> -> it is inside posibility that we are missing some state when we are on real -> -> mode. -> -> -> -> Will try to take a look at this. -> -> -> -> THanks, again. -> -> -> -> -Hi Juan & Amit -> -> -Do you think we should add a mechanism to check the data integrity during LM -> -like Zhijian's patch did? it may be very helpful for developers. -> -Actually, I did the similar thing before in order to make sure that I did -> -the right thing we I change the code related to LM. -If you mean for debugging, something that's not always on, then I'm -fine with it. - -A script that goes along that shows the result of comparison of the -diff will be helpful too, something that shows how many pages are -differnt, how many bytes in a page on average, and so on. - - Amit - diff --git a/classification_output/05/mistranslation/74545755 b/classification_output/05/mistranslation/74545755 deleted file mode 100644 index 7f5ace50..00000000 --- a/classification_output/05/mistranslation/74545755 +++ /dev/null @@ -1,352 +0,0 @@ -mistranslation: 0.752 -device: 0.720 -instruction: 0.700 -other: 0.683 -semantic: 0.669 -KVM: 0.661 -graphic: 0.660 -vnc: 0.650 -assembly: 0.648 -boot: 0.607 -network: 0.550 -socket: 0.549 - -[Bug Report][RFC PATCH 0/1] block: fix failing assert on paused VM migration - -There's a bug (failing assert) which is reproduced during migration of -a paused VM. I am able to reproduce it on a stand with 2 nodes and a common -NFS share, with VM's disk on that share. - -root@fedora40-1-vm:~# virsh domblklist alma8-vm - Target Source ------------------------------------------- - sda /mnt/shared/images/alma8.qcow2 - -root@fedora40-1-vm:~# df -Th /mnt/shared -Filesystem Type Size Used Avail Use% Mounted on -127.0.0.1:/srv/nfsd nfs4 63G 16G 48G 25% /mnt/shared - -On the 1st node: - -root@fedora40-1-vm:~# virsh start alma8-vm ; virsh suspend alma8-vm -root@fedora40-1-vm:~# virsh migrate --compressed --p2p --persistent ---undefinesource --live alma8-vm qemu+ssh://fedora40-2-vm/system - -Then on the 2nd node: - -root@fedora40-2-vm:~# virsh migrate --compressed --p2p --persistent ---undefinesource --live alma8-vm qemu+ssh://fedora40-1-vm/system -error: operation failed: domain is not running - -root@fedora40-2-vm:~# tail -3 /var/log/libvirt/qemu/alma8-vm.log -2024-09-19 13:53:33.336+0000: initiating migration -qemu-system-x86_64: ../block.c:6976: int -bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & -BDRV_O_INACTIVE)' failed. -2024-09-19 13:53:42.991+0000: shutting down, reason=crashed - -Backtrace: - -(gdb) bt -#0 0x00007f7eaa2f1664 in __pthread_kill_implementation () at /lib64/libc.so.6 -#1 0x00007f7eaa298c4e in raise () at /lib64/libc.so.6 -#2 0x00007f7eaa280902 in abort () at /lib64/libc.so.6 -#3 0x00007f7eaa28081e in __assert_fail_base.cold () at /lib64/libc.so.6 -#4 0x00007f7eaa290d87 in __assert_fail () at /lib64/libc.so.6 -#5 0x0000563c38b95eb8 in bdrv_inactivate_recurse (bs=0x563c3b6c60c0) at -../block.c:6976 -#6 0x0000563c38b95aeb in bdrv_inactivate_all () at ../block.c:7038 -#7 0x0000563c3884d354 in qemu_savevm_state_complete_precopy_non_iterable -(f=0x563c3b700c20, in_postcopy=false, inactivate_disks=true) - at ../migration/savevm.c:1571 -#8 0x0000563c3884dc1a in qemu_savevm_state_complete_precopy (f=0x563c3b700c20, -iterable_only=false, inactivate_disks=true) at ../migration/savevm.c:1631 -#9 0x0000563c3883a340 in migration_completion_precopy (s=0x563c3b4d51f0, -current_active_state=<optimized out>) at ../migration/migration.c:2780 -#10 migration_completion (s=0x563c3b4d51f0) at ../migration/migration.c:2844 -#11 migration_iteration_run (s=0x563c3b4d51f0) at ../migration/migration.c:3270 -#12 migration_thread (opaque=0x563c3b4d51f0) at ../migration/migration.c:3536 -#13 0x0000563c38dbcf14 in qemu_thread_start (args=0x563c3c2d5bf0) at -../util/qemu-thread-posix.c:541 -#14 0x00007f7eaa2ef6d7 in start_thread () at /lib64/libc.so.6 -#15 0x00007f7eaa373414 in clone () at /lib64/libc.so.6 - -What happens here is that after 1st migration BDS related to HDD remains -inactive as VM is still paused. Then when we initiate 2nd migration, -bdrv_inactivate_all() leads to the attempt to set BDRV_O_INACTIVE flag -on that node which is already set, thus assert fails. - -Attached patch which simply skips setting flag if it's already set is more -of a kludge than a clean solution. Should we use more sophisticated logic -which allows some of the nodes be in inactive state prior to the migration, -and takes them into account during bdrv_inactivate_all()? Comments would -be appreciated. - -Andrey - -Andrey Drobyshev (1): - block: do not fail when inactivating node which is inactive - - block.c | 10 +++++++++- - 1 file changed, 9 insertions(+), 1 deletion(-) - --- -2.39.3 - -Instead of throwing an assert let's just ignore that flag is already set -and return. We assume that it's going to be safe to ignore. Otherwise -this assert fails when migrating a paused VM back and forth. - -Ideally we'd like to have a more sophisticated solution, e.g. not even -scan the nodes which should be inactive at this point. - -Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> ---- - block.c | 10 +++++++++- - 1 file changed, 9 insertions(+), 1 deletion(-) - -diff --git a/block.c b/block.c -index 7d90007cae..c1dcf906d1 100644 ---- a/block.c -+++ b/block.c -@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK -bdrv_inactivate_recurse(BlockDriverState *bs) - return 0; - } - -- assert(!(bs->open_flags & BDRV_O_INACTIVE)); -+ if (bs->open_flags & BDRV_O_INACTIVE) { -+ /* -+ * Return here instead of throwing assert as a workaround to -+ * prevent failure on migrating paused VM. -+ * Here we assume that if we're trying to inactivate BDS that's -+ * already inactive, it's safe to just ignore it. -+ */ -+ return 0; -+ } - - /* Inactivate this node */ - if (bs->drv->bdrv_inactivate) { --- -2.39.3 - -[add migration maintainers] - -On 24.09.24 15:56, Andrey Drobyshev wrote: -Instead of throwing an assert let's just ignore that flag is already set -and return. We assume that it's going to be safe to ignore. Otherwise -this assert fails when migrating a paused VM back and forth. - -Ideally we'd like to have a more sophisticated solution, e.g. not even -scan the nodes which should be inactive at this point. - -Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> ---- - block.c | 10 +++++++++- - 1 file changed, 9 insertions(+), 1 deletion(-) - -diff --git a/block.c b/block.c -index 7d90007cae..c1dcf906d1 100644 ---- a/block.c -+++ b/block.c -@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK -bdrv_inactivate_recurse(BlockDriverState *bs) - return 0; - } -- assert(!(bs->open_flags & BDRV_O_INACTIVE)); -+ if (bs->open_flags & BDRV_O_INACTIVE) { -+ /* -+ * Return here instead of throwing assert as a workaround to -+ * prevent failure on migrating paused VM. -+ * Here we assume that if we're trying to inactivate BDS that's -+ * already inactive, it's safe to just ignore it. -+ */ -+ return 0; -+ } -/* Inactivate this node */ -if (bs->drv->bdrv_inactivate) { -I doubt that this a correct way to go. - -As far as I understand, "inactive" actually means that "storage is not belong to -qemu, but to someone else (another qemu process for example), and may be changed -transparently". In turn this means that Qemu should do nothing with inactive disks. So the -problem is that nobody called bdrv_activate_all on target, and we shouldn't ignore that. - -Hmm, I see in process_incoming_migration_bh() we do call bdrv_activate_all(), -but only in some scenarios. May be, the condition should be less strict here. - -Why we need any condition here at all? Don't we want to activate block-layer on -target after migration anyway? - --- -Best regards, -Vladimir - -On 9/30/24 12:25 PM, Vladimir Sementsov-Ogievskiy wrote: -> -[add migration maintainers] -> -> -On 24.09.24 15:56, Andrey Drobyshev wrote: -> -> [...] -> -> -I doubt that this a correct way to go. -> -> -As far as I understand, "inactive" actually means that "storage is not -> -belong to qemu, but to someone else (another qemu process for example), -> -and may be changed transparently". In turn this means that Qemu should -> -do nothing with inactive disks. So the problem is that nobody called -> -bdrv_activate_all on target, and we shouldn't ignore that. -> -> -Hmm, I see in process_incoming_migration_bh() we do call -> -bdrv_activate_all(), but only in some scenarios. May be, the condition -> -should be less strict here. -> -> -Why we need any condition here at all? Don't we want to activate -> -block-layer on target after migration anyway? -> -Hmm I'm not sure about the unconditional activation, since we at least -have to honor LATE_BLOCK_ACTIVATE cap if it's set (and probably delay it -in such a case). In current libvirt upstream I see such code: - -> -/* Migration capabilities which should always be enabled as long as they -> -> -* are supported by QEMU. If the capability is supposed to be enabled on both -> -> -* sides of migration, it won't be enabled unless both sides support it. -> -> -*/ -> -> -static const qemuMigrationParamsAlwaysOnItem qemuMigrationParamsAlwaysOn[] = -> -{ -> -> -{QEMU_MIGRATION_CAP_PAUSE_BEFORE_SWITCHOVER, -> -> -QEMU_MIGRATION_SOURCE}, -> -> -> -> -{QEMU_MIGRATION_CAP_LATE_BLOCK_ACTIVATE, -> -> -QEMU_MIGRATION_DESTINATION}, -> -> -}; -which means that libvirt always wants LATE_BLOCK_ACTIVATE to be set. - -The code from process_incoming_migration_bh() you're referring to: - -> -/* If capability late_block_activate is set: -> -> -* Only fire up the block code now if we're going to restart the -> -> -* VM, else 'cont' will do it. -> -> -* This causes file locking to happen; so we don't want it to happen -> -> -* unless we really are starting the VM. -> -> -*/ -> -> -if (!migrate_late_block_activate() || -> -> -(autostart && (!global_state_received() || -> -> -runstate_is_live(global_state_get_runstate())))) { -> -> -/* Make sure all file formats throw away their mutable metadata. -> -> -> -* If we get an error here, just don't restart the VM yet. */ -> -> -bdrv_activate_all(&local_err); -> -> -if (local_err) { -> -> -error_report_err(local_err); -> -> -local_err = NULL; -> -> -autostart = false; -> -> -} -> -> -} -It states explicitly that we're either going to start VM right at this -point if (autostart == true), or we wait till "cont" command happens. -None of this is going to happen if we start another migration while -still being in PAUSED state. So I think it seems reasonable to take -such case into account. For instance, this patch does prevent the crash: - -> -diff --git a/migration/migration.c b/migration/migration.c -> -index ae2be31557..3222f6745b 100644 -> ---- a/migration/migration.c -> -+++ b/migration/migration.c -> -@@ -733,7 +733,8 @@ static void process_incoming_migration_bh(void *opaque) -> -*/ -> -if (!migrate_late_block_activate() || -> -(autostart && (!global_state_received() || -> -- runstate_is_live(global_state_get_runstate())))) { -> -+ runstate_is_live(global_state_get_runstate()))) || -> -+ (!autostart && global_state_get_runstate() == RUN_STATE_PAUSED)) { -> -/* Make sure all file formats throw away their mutable metadata. -> -* If we get an error here, just don't restart the VM yet. */ -> -bdrv_activate_all(&local_err); -What are your thoughts on it? - -Andrey - diff --git a/classification_output/05/mistranslation/80604314 b/classification_output/05/mistranslation/80604314 deleted file mode 100644 index cb64e7d6..00000000 --- a/classification_output/05/mistranslation/80604314 +++ /dev/null @@ -1,1488 +0,0 @@ -mistranslation: 0.922 -device: 0.917 -graphic: 0.901 -other: 0.898 -KVM: 0.891 -semantic: 0.890 -assembly: 0.886 -socket: 0.884 -vnc: 0.881 -instruction: 0.877 -network: 0.865 -boot: 0.860 - -[BUG] vhost-vdpa: qemu-system-s390x crashes with second virtio-net-ccw device - -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) - -Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -the autogenerated virtio-net-ccw device is present) works. Specifying -several "-device virtio-net-pci" works as well. - -Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -works (in-between state does not compile). - -This is reproducible with tcg as well. Same problem both with ---enable-vhost-vdpa and --disable-vhost-vdpa. - -Have not yet tried to figure out what might be special with -virtio-ccw... anyone have an idea? - -[This should probably be considered a blocker?] - -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -When I start qemu with a second virtio-net-ccw device (i.e. adding -> --device virtio-net-ccw in addition to the autogenerated device), I get -> -a segfault. gdb points to -> -> -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -config=0x55d6ad9e3f80 "RT") at -> -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> -(backtrace doesn't go further) -> -> -Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -the autogenerated virtio-net-ccw device is present) works. Specifying -> -several "-device virtio-net-pci" works as well. -> -> -Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -works (in-between state does not compile). -Ouch. I didn't test all in-between states :( -But I wish we had a 0-day instrastructure like kernel has, -that catches things like that. - -> -This is reproducible with tcg as well. Same problem both with -> ---enable-vhost-vdpa and --disable-vhost-vdpa. -> -> -Have not yet tried to figure out what might be special with -> -virtio-ccw... anyone have an idea? -> -> -[This should probably be considered a blocker?] - -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin" <mst@redhat.com> wrote: - -> -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> When I start qemu with a second virtio-net-ccw device (i.e. adding -> -> -device virtio-net-ccw in addition to the autogenerated device), I get -> -> a segfault. gdb points to -> -> -> -> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -> config=0x55d6ad9e3f80 "RT") at -> -> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> -> -> (backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. - -> -> -> -> Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -> the autogenerated virtio-net-ccw device is present) works. Specifying -> -> several "-device virtio-net-pci" works as well. -> -> -> -> Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -> client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -> works (in-between state does not compile). -> -> -Ouch. I didn't test all in-between states :( -> -But I wish we had a 0-day instrastructure like kernel has, -> -that catches things like that. -Yep, that would be useful... so patchew only builds the complete series? - -> -> -> This is reproducible with tcg as well. Same problem both with -> -> --enable-vhost-vdpa and --disable-vhost-vdpa. -> -> -> -> Have not yet tried to figure out what might be special with -> -> virtio-ccw... anyone have an idea? -> -> -> -> [This should probably be considered a blocker?] -I think so, as it makes s390x unusable with more that one -virtio-net-ccw device, and I don't even see a workaround. - -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -On Fri, 24 Jul 2020 09:30:58 -0400 -> -"Michael S. Tsirkin" <mst@redhat.com> wrote: -> -> -> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> > When I start qemu with a second virtio-net-ccw device (i.e. adding -> -> > -device virtio-net-ccw in addition to the autogenerated device), I get -> -> > a segfault. gdb points to -> -> > -> -> > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -> > config=0x55d6ad9e3f80 "RT") at -> -> > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > -> -> > (backtrace doesn't go further) -> -> -The core was incomplete, but running under gdb directly shows that it -> -is just a bog-standard config space access (first for that device). -> -> -The cause of the crash is that nc->peer is not set... no idea how that -> -can happen, not that familiar with that part of QEMU. (Should the code -> -check, or is that really something that should not happen?) -> -> -What I don't understand is why it is set correctly for the first, -> -autogenerated virtio-net-ccw device, but not for the second one, and -> -why virtio-net-pci doesn't show these problems. The only difference -> -between -ccw and -pci that comes to my mind here is that config space -> -accesses for ccw are done via an asynchronous operation, so timing -> -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? - -> -> > -> -> > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -> > the autogenerated virtio-net-ccw device is present) works. Specifying -> -> > several "-device virtio-net-pci" works as well. -> -> > -> -> > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -> > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -> > works (in-between state does not compile). -> -> -> -> Ouch. I didn't test all in-between states :( -> -> But I wish we had a 0-day instrastructure like kernel has, -> -> that catches things like that. -> -> -Yep, that would be useful... so patchew only builds the complete series? -> -> -> -> -> > This is reproducible with tcg as well. Same problem both with -> -> > --enable-vhost-vdpa and --disable-vhost-vdpa. -> -> > -> -> > Have not yet tried to figure out what might be special with -> -> > virtio-ccw... anyone have an idea? -> -> > -> -> > [This should probably be considered a blocker?] -> -> -I think so, as it makes s390x unusable with more that one -> -virtio-net-ccw device, and I don't even see a workaround. - -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin" <mst@redhat.com> wrote: - -> -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -> On Fri, 24 Jul 2020 09:30:58 -0400 -> -> "Michael S. Tsirkin" <mst@redhat.com> wrote: -> -> -> -> > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> > > When I start qemu with a second virtio-net-ccw device (i.e. adding -> -> > > -device virtio-net-ccw in addition to the autogenerated device), I get -> -> > > a segfault. gdb points to -> -> > > -> -> > > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -> > > config=0x55d6ad9e3f80 "RT") at -> -> > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > > -> -> > > (backtrace doesn't go further) -> -> -> -> The core was incomplete, but running under gdb directly shows that it -> -> is just a bog-standard config space access (first for that device). -> -> -> -> The cause of the crash is that nc->peer is not set... no idea how that -> -> can happen, not that familiar with that part of QEMU. (Should the code -> -> check, or is that really something that should not happen?) -> -> -> -> What I don't understand is why it is set correctly for the first, -> -> autogenerated virtio-net-ccw device, but not for the second one, and -> -> why virtio-net-pci doesn't show these problems. The only difference -> -> between -ccw and -pci that comes to my mind here is that config space -> -> accesses for ccw are done via an asynchronous operation, so timing -> -> might be different. -> -> -Hopefully Jason has an idea. Could you post a full command line -> -please? Do you need a working guest to trigger this? Does this trigger -> -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 - --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) - -> -> -> > > -> -> > > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -> > > the autogenerated virtio-net-ccw device is present) works. Specifying -> -> > > several "-device virtio-net-pci" works as well. -> -> > > -> -> > > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -> > > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -> > > works (in-between state does not compile). -> -> > -> -> > Ouch. I didn't test all in-between states :( -> -> > But I wish we had a 0-day instrastructure like kernel has, -> -> > that catches things like that. -> -> -> -> Yep, that would be useful... so patchew only builds the complete series? -> -> -> -> > -> -> > > This is reproducible with tcg as well. Same problem both with -> -> > > --enable-vhost-vdpa and --disable-vhost-vdpa. -> -> > > -> -> > > Have not yet tried to figure out what might be special with -> -> > > virtio-ccw... anyone have an idea? -> -> > > -> -> > > [This should probably be considered a blocker?] -> -> -> -> I think so, as it makes s390x unusable with more that one -> -> virtio-net-ccw device, and I don't even see a workaround. -> - -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. - -Thanks -0001-virtio-net-check-the-existence-of-peer-before-accesi.patch -Description: -Text Data - -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang <jasowang@redhat.com> wrote: - -> -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> -> On Fri, 24 Jul 2020 11:17:57 -0400 -> -> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> -> ->> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> ->>> On Fri, 24 Jul 2020 09:30:58 -0400 -> ->>> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> ->>> -> ->>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> ->>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding -> ->>>>> -device virtio-net-ccw in addition to the autogenerated device), I get -> ->>>>> a segfault. gdb points to -> ->>>>> -> ->>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> ->>>>> config=0x55d6ad9e3f80 "RT") at -> ->>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> ->>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> ->>>>> -> ->>>>> (backtrace doesn't go further) -> ->>> The core was incomplete, but running under gdb directly shows that it -> ->>> is just a bog-standard config space access (first for that device). -> ->>> -> ->>> The cause of the crash is that nc->peer is not set... no idea how that -> ->>> can happen, not that familiar with that part of QEMU. (Should the code -> ->>> check, or is that really something that should not happen?) -> ->>> -> ->>> What I don't understand is why it is set correctly for the first, -> ->>> autogenerated virtio-net-ccw device, but not for the second one, and -> ->>> why virtio-net-pci doesn't show these problems. The only difference -> ->>> between -ccw and -pci that comes to my mind here is that config space -> ->>> accesses for ccw are done via an asynchronous operation, so timing -> ->>> might be different. -> ->> Hopefully Jason has an idea. Could you post a full command line -> ->> please? Do you need a working guest to trigger this? Does this trigger -> ->> on an x86 host? -> -> Yes, it does trigger with tcg-on-x86 as well. I've been using -> -> -> -> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu -> -> qemu,zpci=on -> -> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> -> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> -> -device -> -> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> -> -device virtio-net-ccw -> -> -> -> It seems it needs the guest actually doing something with the nics; I -> -> cannot reproduce the crash if I use the old advent calendar moon buggy -> -> image and just add a virtio-net-ccw device. -> -> -> -> (I don't think it's a problem with my local build, as I see the problem -> -> both on my laptop and on an LPAR.) -> -> -> -It looks to me we forget the check the existence of peer. -> -> -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck <cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? - -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang <jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck <cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -For autogenerated virtio-net-cww, I think the reason is that it has -already had a peer set. -Thanks - -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang <jasowang@redhat.com> wrote: - -> -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -> -> On Sat, 25 Jul 2020 08:40:07 +0800 -> -> Jason Wang <jasowang@redhat.com> wrote: -> -> -> ->> On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> ->>> On Fri, 24 Jul 2020 11:17:57 -0400 -> ->>> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> ->>> -> ->>>> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> ->>>>> On Fri, 24 Jul 2020 09:30:58 -0400 -> ->>>>> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> ->>>>> -> ->>>>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> ->>>>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding -> ->>>>>>> -device virtio-net-ccw in addition to the autogenerated device), I get -> ->>>>>>> a segfault. gdb points to -> ->>>>>>> -> ->>>>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> ->>>>>>> config=0x55d6ad9e3f80 "RT") at -> ->>>>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> ->>>>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> ->>>>>>> -> ->>>>>>> (backtrace doesn't go further) -> ->>>>> The core was incomplete, but running under gdb directly shows that it -> ->>>>> is just a bog-standard config space access (first for that device). -> ->>>>> -> ->>>>> The cause of the crash is that nc->peer is not set... no idea how that -> ->>>>> can happen, not that familiar with that part of QEMU. (Should the code -> ->>>>> check, or is that really something that should not happen?) -> ->>>>> -> ->>>>> What I don't understand is why it is set correctly for the first, -> ->>>>> autogenerated virtio-net-ccw device, but not for the second one, and -> ->>>>> why virtio-net-pci doesn't show these problems. The only difference -> ->>>>> between -ccw and -pci that comes to my mind here is that config space -> ->>>>> accesses for ccw are done via an asynchronous operation, so timing -> ->>>>> might be different. -> ->>>> Hopefully Jason has an idea. Could you post a full command line -> ->>>> please? Do you need a working guest to trigger this? Does this trigger -> ->>>> on an x86 host? -> ->>> Yes, it does trigger with tcg-on-x86 as well. I've been using -> ->>> -> ->>> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu -> ->>> qemu,zpci=on -> ->>> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> ->>> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> ->>> -device -> ->>> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> ->>> -device virtio-net-ccw -> ->>> -> ->>> It seems it needs the guest actually doing something with the nics; I -> ->>> cannot reproduce the crash if I use the old advent calendar moon buggy -> ->>> image and just add a virtio-net-ccw device. -> ->>> -> ->>> (I don't think it's a problem with my local build, as I see the problem -> ->>> both on my laptop and on an LPAR.) -> ->> -> ->> It looks to me we forget the check the existence of peer. -> ->> -> ->> Please try the attached patch to see if it works. -> -> Thanks, that patch gets my guest up and running again. So, FWIW, -> -> -> -> Tested-by: Cornelia Huck <cohuck@redhat.com> -> -> -> -> Any idea why this did not hit with virtio-net-pci (or the autogenerated -> -> virtio-net-ccw device)? -> -> -> -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. - -> -> -For autogenerated virtio-net-cww, I think the reason is that it has -> -already had a peer set. -Ok, that might well be. - -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang <jasowang@redhat.com> wrote: -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang <jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck <cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. -Sorry for being unclear, I meant for networking part, you just need -start without peer, and you need a real guest (any Linux) that is trying -to access the config space of virtio-net. -Thanks -For autogenerated virtio-net-cww, I think the reason is that it has -already had a peer set. -Ok, that might well be. - -On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -> -> -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -> -> On Mon, 27 Jul 2020 15:38:12 +0800 -> -> Jason Wang <jasowang@redhat.com> wrote: -> -> -> -> > On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -> -> > > On Sat, 25 Jul 2020 08:40:07 +0800 -> -> > > Jason Wang <jasowang@redhat.com> wrote: -> -> > > > On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> -> > > > > On Fri, 24 Jul 2020 11:17:57 -0400 -> -> > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -> > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400 -> -> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> > > > > > > > > When I start qemu with a second virtio-net-ccw device (i.e. -> -> > > > > > > > > adding -> -> > > > > > > > > -device virtio-net-ccw in addition to the autogenerated -> -> > > > > > > > > device), I get -> -> > > > > > > > > a segfault. gdb points to -> -> > > > > > > > > -> -> > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config -> -> > > > > > > > > (vdev=<optimized out>, -> -> > > > > > > > > config=0x55d6ad9e3f80 "RT") at -> -> > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > > > > > > > > 146 if (nc->peer->info->type == -> -> > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > > > > > > > > -> -> > > > > > > > > (backtrace doesn't go further) -> -> > > > > > > The core was incomplete, but running under gdb directly shows -> -> > > > > > > that it -> -> > > > > > > is just a bog-standard config space access (first for that -> -> > > > > > > device). -> -> > > > > > > -> -> > > > > > > The cause of the crash is that nc->peer is not set... no idea -> -> > > > > > > how that -> -> > > > > > > can happen, not that familiar with that part of QEMU. (Should -> -> > > > > > > the code -> -> > > > > > > check, or is that really something that should not happen?) -> -> > > > > > > -> -> > > > > > > What I don't understand is why it is set correctly for the -> -> > > > > > > first, -> -> > > > > > > autogenerated virtio-net-ccw device, but not for the second -> -> > > > > > > one, and -> -> > > > > > > why virtio-net-pci doesn't show these problems. The only -> -> > > > > > > difference -> -> > > > > > > between -ccw and -pci that comes to my mind here is that config -> -> > > > > > > space -> -> > > > > > > accesses for ccw are done via an asynchronous operation, so -> -> > > > > > > timing -> -> > > > > > > might be different. -> -> > > > > > Hopefully Jason has an idea. Could you post a full command line -> -> > > > > > please? Do you need a working guest to trigger this? Does this -> -> > > > > > trigger -> -> > > > > > on an x86 host? -> -> > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using -> -> > > > > -> -> > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu -> -> > > > > qemu,zpci=on -> -> > > > > -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> -> > > > > -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> -> > > > > -device -> -> > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> -> > > > > -device virtio-net-ccw -> -> > > > > -> -> > > > > It seems it needs the guest actually doing something with the nics; -> -> > > > > I -> -> > > > > cannot reproduce the crash if I use the old advent calendar moon -> -> > > > > buggy -> -> > > > > image and just add a virtio-net-ccw device. -> -> > > > > -> -> > > > > (I don't think it's a problem with my local build, as I see the -> -> > > > > problem -> -> > > > > both on my laptop and on an LPAR.) -> -> > > > It looks to me we forget the check the existence of peer. -> -> > > > -> -> > > > Please try the attached patch to see if it works. -> -> > > Thanks, that patch gets my guest up and running again. So, FWIW, -> -> > > -> -> > > Tested-by: Cornelia Huck <cohuck@redhat.com> -> -> > > -> -> > > Any idea why this did not hit with virtio-net-pci (or the autogenerated -> -> > > virtio-net-ccw device)? -> -> > -> -> > It can be hit with virtio-net-pci as well (just start without peer). -> -> Hm, I had not been able to reproduce the crash with a 'naked' -device -> -> virtio-net-pci. But checking seems to be the right idea anyway. -> -> -> -Sorry for being unclear, I meant for networking part, you just need start -> -without peer, and you need a real guest (any Linux) that is trying to access -> -the config space of virtio-net. -> -> -Thanks -A pxe guest will do it, but that doesn't support ccw, right? - -I'm still unclear why this triggers with ccw but not pci - -any idea? - -> -> -> -> -> > For autogenerated virtio-net-cww, I think the reason is that it has -> -> > already had a peer set. -> -> Ok, that might well be. -> -> -> -> - -On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: -On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck<cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. -Sorry for being unclear, I meant for networking part, you just need start -without peer, and you need a real guest (any Linux) that is trying to access -the config space of virtio-net. - -Thanks -A pxe guest will do it, but that doesn't support ccw, right? -Yes, it depends on the cli actually. -I'm still unclear why this triggers with ccw but not pci - -any idea? -I don't test pxe but I can reproduce this with pci (just start a linux -guest without a peer). -Thanks - -On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote: -> -> -On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: -> -> On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -> -> > On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -> -> > > On Mon, 27 Jul 2020 15:38:12 +0800 -> -> > > Jason Wang<jasowang@redhat.com> wrote: -> -> > > -> -> > > > On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -> -> > > > > On Sat, 25 Jul 2020 08:40:07 +0800 -> -> > > > > Jason Wang<jasowang@redhat.com> wrote: -> -> > > > > > On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> -> > > > > > > On Fri, 24 Jul 2020 11:17:57 -0400 -> -> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -> > > > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400 -> -> > > > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck -> -> > > > > > > > > > wrote: -> -> > > > > > > > > > > When I start qemu with a second virtio-net-ccw device -> -> > > > > > > > > > > (i.e. adding -> -> > > > > > > > > > > -device virtio-net-ccw in addition to the autogenerated -> -> > > > > > > > > > > device), I get -> -> > > > > > > > > > > a segfault. gdb points to -> -> > > > > > > > > > > -> -> > > > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config -> -> > > > > > > > > > > (vdev=<optimized out>, -> -> > > > > > > > > > > config=0x55d6ad9e3f80 "RT") at -> -> > > > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > > > > > > > > > > 146 if (nc->peer->info->type == -> -> > > > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > > > > > > > > > > -> -> > > > > > > > > > > (backtrace doesn't go further) -> -> > > > > > > > > The core was incomplete, but running under gdb directly -> -> > > > > > > > > shows that it -> -> > > > > > > > > is just a bog-standard config space access (first for that -> -> > > > > > > > > device). -> -> > > > > > > > > -> -> > > > > > > > > The cause of the crash is that nc->peer is not set... no -> -> > > > > > > > > idea how that -> -> > > > > > > > > can happen, not that familiar with that part of QEMU. -> -> > > > > > > > > (Should the code -> -> > > > > > > > > check, or is that really something that should not happen?) -> -> > > > > > > > > -> -> > > > > > > > > What I don't understand is why it is set correctly for the -> -> > > > > > > > > first, -> -> > > > > > > > > autogenerated virtio-net-ccw device, but not for the second -> -> > > > > > > > > one, and -> -> > > > > > > > > why virtio-net-pci doesn't show these problems. The only -> -> > > > > > > > > difference -> -> > > > > > > > > between -ccw and -pci that comes to my mind here is that -> -> > > > > > > > > config space -> -> > > > > > > > > accesses for ccw are done via an asynchronous operation, so -> -> > > > > > > > > timing -> -> > > > > > > > > might be different. -> -> > > > > > > > Hopefully Jason has an idea. Could you post a full command -> -> > > > > > > > line -> -> > > > > > > > please? Do you need a working guest to trigger this? Does -> -> > > > > > > > this trigger -> -> > > > > > > > on an x86 host? -> -> > > > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using -> -> > > > > > > -> -> > > > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -> -> > > > > > > -cpu qemu,zpci=on -> -> > > > > > > -m 1024 -nographic -device -> -> > > > > > > virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> -> > > > > > > -drive -> -> > > > > > > file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> -> > > > > > > -device -> -> > > > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> -> > > > > > > -device virtio-net-ccw -> -> > > > > > > -> -> > > > > > > It seems it needs the guest actually doing something with the -> -> > > > > > > nics; I -> -> > > > > > > cannot reproduce the crash if I use the old advent calendar -> -> > > > > > > moon buggy -> -> > > > > > > image and just add a virtio-net-ccw device. -> -> > > > > > > -> -> > > > > > > (I don't think it's a problem with my local build, as I see the -> -> > > > > > > problem -> -> > > > > > > both on my laptop and on an LPAR.) -> -> > > > > > It looks to me we forget the check the existence of peer. -> -> > > > > > -> -> > > > > > Please try the attached patch to see if it works. -> -> > > > > Thanks, that patch gets my guest up and running again. So, FWIW, -> -> > > > > -> -> > > > > Tested-by: Cornelia Huck<cohuck@redhat.com> -> -> > > > > -> -> > > > > Any idea why this did not hit with virtio-net-pci (or the -> -> > > > > autogenerated -> -> > > > > virtio-net-ccw device)? -> -> > > > It can be hit with virtio-net-pci as well (just start without peer). -> -> > > Hm, I had not been able to reproduce the crash with a 'naked' -device -> -> > > virtio-net-pci. But checking seems to be the right idea anyway. -> -> > Sorry for being unclear, I meant for networking part, you just need start -> -> > without peer, and you need a real guest (any Linux) that is trying to -> -> > access -> -> > the config space of virtio-net. -> -> > -> -> > Thanks -> -> A pxe guest will do it, but that doesn't support ccw, right? -> -> -> -Yes, it depends on the cli actually. -> -> -> -> -> -> I'm still unclear why this triggers with ccw but not pci - -> -> any idea? -> -> -> -I don't test pxe but I can reproduce this with pci (just start a linux guest -> -without a peer). -> -> -Thanks -> -Might be a good addition to a unit test. Not sure what would the -test do exactly: just make sure guest runs? Looks like a lot of work -for an empty test ... maybe we can poke at the guest config with -qtest commands at least. - --- -MST - -On 2020/7/27 ä¸å9:16, Michael S. Tsirkin wrote: -On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote: -On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: -On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck<cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. -Sorry for being unclear, I meant for networking part, you just need start -without peer, and you need a real guest (any Linux) that is trying to access -the config space of virtio-net. - -Thanks -A pxe guest will do it, but that doesn't support ccw, right? -Yes, it depends on the cli actually. -I'm still unclear why this triggers with ccw but not pci - -any idea? -I don't test pxe but I can reproduce this with pci (just start a linux guest -without a peer). - -Thanks -Might be a good addition to a unit test. Not sure what would the -test do exactly: just make sure guest runs? Looks like a lot of work -for an empty test ... maybe we can poke at the guest config with -qtest commands at least. -That should work or we can simply extend the exist virtio-net qtest to -do that. -Thanks - |
