diff options
| author | Christian Krinitsin <mail@krinitsin.com> | 2025-07-03 19:39:53 +0200 |
|---|---|---|
| committer | Christian Krinitsin <mail@krinitsin.com> | 2025-07-03 19:39:53 +0200 |
| commit | dee4dcba78baf712cab403d47d9db319ab7f95d6 (patch) | |
| tree | 418478faf06786701a56268672f73d6b0b4eb239 /results/classifier/016/none | |
| parent | 4d9e26c0333abd39bdbd039dcdb30ed429c475ba (diff) | |
| download | emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.tar.gz emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.zip | |
restructure results
Diffstat (limited to 'results/classifier/016/none')
| -rw-r--r-- | results/classifier/016/none/23300761 | 340 | ||||
| -rw-r--r-- | results/classifier/016/none/42613410 | 176 | ||||
| -rw-r--r-- | results/classifier/016/none/42974450 | 456 | ||||
| -rw-r--r-- | results/classifier/016/none/48245039 | 557 | ||||
| -rw-r--r-- | results/classifier/016/none/50773216 | 137 | ||||
| -rw-r--r-- | results/classifier/016/none/55753058 | 320 | ||||
| -rw-r--r-- | results/classifier/016/none/56309929 | 207 | ||||
| -rw-r--r-- | results/classifier/016/none/65781993 | 2820 | ||||
| -rw-r--r-- | results/classifier/016/none/70294255 | 1088 | ||||
| -rw-r--r-- | results/classifier/016/none/70868267 | 67 | ||||
| -rw-r--r-- | results/classifier/016/none/80604314 | 1507 |
11 files changed, 0 insertions, 7675 deletions
diff --git a/results/classifier/016/none/23300761 b/results/classifier/016/none/23300761 deleted file mode 100644 index 2a3e6f16..00000000 --- a/results/classifier/016/none/23300761 +++ /dev/null @@ -1,340 +0,0 @@ -i386: 0.475 -x86: 0.171 -debug: 0.052 -files: 0.038 -performance: 0.030 -register: 0.029 -virtual: 0.027 -PID: 0.025 -TCG: 0.019 -semantic: 0.018 -operating system: 0.017 -socket: 0.013 -boot: 0.013 -hypervisor: 0.012 -device: 0.012 -user-level: 0.011 -risc-v: 0.010 -alpha: 0.007 -ppc: 0.006 -VMM: 0.005 -vnc: 0.004 -network: 0.004 -architecture: 0.003 -permissions: 0.003 -assembly: 0.003 -peripherals: 0.003 -kernel: 0.002 -arm: 0.002 -graphic: 0.002 -mistranslation: 0.001 -KVM: 0.000 - -[Qemu-devel] [BUG] 216 Alerts reported by LGTM for QEMU (some might be release critical) - -Hi, -LGTM reports 16 errors, 81 warnings and 119 recommendations: -https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list -. -Some of them are already know (wrong format strings), others look like -real errors: -- several multiplication results which don't work as they should in -contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only -32 bit!), target/i386/translate.c and other files -- potential buffer overflows in gdbstub.c and other files -I am afraid that the overflows in the block code are release critical, -maybe that in target/i386/translate.c and other errors, too. -About half of the alerts are issues which can be fixed later. - -Regards - -Stefan - -On 13/07/19 19:46, Stefan Weil wrote: -> -> -LGTM reports 16 errors, 81 warnings and 119 recommendations: -> -https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list -. -> -> -Some of them are already know (wrong format strings), others look like -> -real errors: -> -> -- several multiplication results which don't work as they should in -> -contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only -> -32 bit!), target/i386/translate.c and other files -m->nb_clusters here is limited by s->l2_slice_size (see for example -handle_alloc) so I wouldn't be surprised if this is a false positive. I -couldn't find this particular multiplication in Coverity, but it has -about 250 issues marked as intentional or false positive so there's -probably a lot of overlap with what LGTM found. - -Paolo - -Am 13.07.2019 um 21:42 schrieb Paolo Bonzini: -> -On 13/07/19 19:46, Stefan Weil wrote: -> -> LGTM reports 16 errors, 81 warnings and 119 recommendations: -> -> -https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list -. -> -> -> -> Some of them are already known (wrong format strings), others look like -> -> real errors: -> -> -> -> - several multiplication results which don't work as they should in -> -> contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only -> -> 32 bit!), target/i386/translate.c and other files -> -m->nb_clusters here is limited by s->l2_slice_size (see for example -> -handle_alloc) so I wouldn't be surprised if this is a false positive. I -> -couldn't find this particular multiplication in Coverity, but it has -> -about 250 issues marked as intentional or false positive so there's -> -probably a lot of overlap with what LGTM found. -> -> -Paolo -> -From other projects I know that there is a certain overlap between the -results from Coverity Scan an LGTM, but it is good to have both -analyzers, and the results from LGTM are typically quite reliable. - -Even if we know that there is no multiplication overflow, the code could -be modified. Either the assigned value should use the same data type as -the factors (possible when there is never an overflow, avoids a size -extension), or the multiplication could use the larger data type by -adding a type cast to one of the factors (then an overflow cannot -happen, static code analysers and human reviewers have an easier job, -but the multiplication costs more time). - -Stefan - -Am 14.07.2019 um 15:28 hat Stefan Weil geschrieben: -> -Am 13.07.2019 um 21:42 schrieb Paolo Bonzini: -> -> On 13/07/19 19:46, Stefan Weil wrote: -> ->> LGTM reports 16 errors, 81 warnings and 119 recommendations: -> ->> -https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list -. -> ->> -> ->> Some of them are already known (wrong format strings), others look like -> ->> real errors: -> ->> -> ->> - several multiplication results which don't work as they should in -> ->> contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only -> ->> 32 bit!), target/i386/translate.c and other files -Request sizes are limited to 32 bit in the generic block layer before -they are even passed to the individual block drivers, so most if not all -of these are going to be false positives. - -> -> m->nb_clusters here is limited by s->l2_slice_size (see for example -> -> handle_alloc) so I wouldn't be surprised if this is a false positive. I -> -> couldn't find this particular multiplication in Coverity, but it has -> -> about 250 issues marked as intentional or false positive so there's -> -> probably a lot of overlap with what LGTM found. -> -> -> -> Paolo -> -> -From other projects I know that there is a certain overlap between the -> -results from Coverity Scan an LGTM, but it is good to have both -> -analyzers, and the results from LGTM are typically quite reliable. -> -> -Even if we know that there is no multiplication overflow, the code could -> -be modified. Either the assigned value should use the same data type as -> -the factors (possible when there is never an overflow, avoids a size -> -extension), or the multiplication could use the larger data type by -> -adding a type cast to one of the factors (then an overflow cannot -> -happen, static code analysers and human reviewers have an easier job, -> -but the multiplication costs more time). -But if you look at the code we're talking about, you see that it's -complaining about things where being more explicit would make things -less readable. - -For example, if complains about the multiplication in this line: - - s->file_size += n * s->header.cluster_size; - -We know that n * s->header.cluster_size fits in 32 bits, but -s->file_size is 64 bits (and has to be 64 bits). Do you really think we -should introduce another uint32_t variable to store the intermediate -result? And if we cast n to uint64_t, not only might the multiplication -cost more time, but also human readers would wonder why the result could -become larger than 32 bits. So a cast would be misleading. - - -It also complains about this line: - - ret = bdrv_truncate(bs->file, (3 + l1_clusters) * s->cluster_size, - PREALLOC_MODE_OFF, &local_err); - -Here, we don't even assign the result to a 64 bit variable, but just -pass it to a function which takes a 64 bit parameter. Again, I don't -think introducing additional variables for the intermediate result or -adding casts would be an improvement of the situation. - - -So I don't think this is a good enough tool to base our code on what it -does and doesn't understand. It would have too much of a negative impact -on our code. We'd rather need a way to mark false positives as such and -move on without changing the code in such cases. - -Kevin - -On Sat, 13 Jul 2019 at 18:46, Stefan Weil <address@hidden> wrote: -> -LGTM reports 16 errors, 81 warnings and 119 recommendations: -> -https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list -. -I had a look at some of these before, but mostly I came -to the conclusion that it wasn't worth trying to put the -effort into keeping up with the site because they didn't -seem to provide any useful way to mark things as false -positives. Coverity has its flaws but at least you can do -that kind of thing in its UI (it runs at about a 33% fp -rate, I think.) "Analyzer thinks this multiply can overflow -but in fact it's not possible" is quite a common false -positive cause... - -Anyway, if you want to fish out specific issues, analyse -whether they're false positive or real, and report them -to the mailing list as followups to the patches which -introduced the issue, that's probably the best way for -us to make use of this analyzer. (That is essentially -what I do for coverity.) - -thanks --- PMM - -Am 14.07.2019 um 19:30 schrieb Peter Maydell: -[...] -> -"Analyzer thinks this multiply can overflow -> -but in fact it's not possible" is quite a common false -> -positive cause... -The analysers don't complain because a multiply can overflow. - -They complain because the code indicates that a larger result is -expected, for example uint64_t = uint32_t * uint32_t. They would not -complain for the same multiplication if it were assigned to a uint32_t. - -So there is a simple solution to write the code in a way which avoids -false positives... - -Stefan - -Stefan Weil <address@hidden> writes: - -> -Am 14.07.2019 um 19:30 schrieb Peter Maydell: -> -[...] -> -> "Analyzer thinks this multiply can overflow -> -> but in fact it's not possible" is quite a common false -> -> positive cause... -> -> -> -The analysers don't complain because a multiply can overflow. -> -> -They complain because the code indicates that a larger result is -> -expected, for example uint64_t = uint32_t * uint32_t. They would not -> -complain for the same multiplication if it were assigned to a uint32_t. -I agree this is an anti-pattern. - -> -So there is a simple solution to write the code in a way which avoids -> -false positives... -You wrote elsewhere in this thread: - - Either the assigned value should use the same data type as the - factors (possible when there is never an overflow, avoids a size - extension), or the multiplication could use the larger data type by - adding a type cast to one of the factors (then an overflow cannot - happen, static code analysers and human reviewers have an easier - job, but the multiplication costs more time). - -Makes sense to me. - -On 7/14/19 5:30 PM, Peter Maydell wrote: -> -I had a look at some of these before, but mostly I came -> -to the conclusion that it wasn't worth trying to put the -> -effort into keeping up with the site because they didn't -> -seem to provide any useful way to mark things as false -> -positives. Coverity has its flaws but at least you can do -> -that kind of thing in its UI (it runs at about a 33% fp -> -rate, I think.) -Yes, LGTM wants you to modify the source code with - - /* lgtm [cpp/some-warning-code] */ - -and on the same line as the reported problem. Which is mildly annoying in that -you're definitely committing to LGTM in the long term. Also for any -non-trivial bit of code, it will almost certainly run over 80 columns. - - -r~ - diff --git a/results/classifier/016/none/42613410 b/results/classifier/016/none/42613410 deleted file mode 100644 index 387e80bd..00000000 --- a/results/classifier/016/none/42613410 +++ /dev/null @@ -1,176 +0,0 @@ -network: 0.116 -x86: 0.043 -TCG: 0.038 -operating system: 0.031 -files: 0.031 -register: 0.030 -socket: 0.029 -virtual: 0.026 -i386: 0.021 -ppc: 0.020 -PID: 0.020 -VMM: 0.020 -hypervisor: 0.018 -arm: 0.018 -device: 0.017 -risc-v: 0.016 -alpha: 0.016 -boot: 0.013 -vnc: 0.013 -semantic: 0.012 -debug: 0.010 -KVM: 0.006 -kernel: 0.005 -user-level: 0.005 -performance: 0.004 -peripherals: 0.003 -architecture: 0.003 -permissions: 0.002 -graphic: 0.002 -assembly: 0.001 -mistranslation: 0.001 - -[Qemu-devel] [PATCH, Bug 1612908] scripts: Add TCP endpoints for qom-* scripts - -From: Carl Allendorph <address@hidden> - -I've created a patch for bug #1612908. The current docs for the scripts -in the "scripts/qmp/" directory suggest that both unix sockets and -tcp endpoints can be used. The TCP endpoints don't work for most of the -scripts, with notable exception of 'qmp-shell'. This patch attempts to -refactor the process of distinguishing between unix path endpoints and -tcp endpoints to work for all of these scripts. - -Carl Allendorph (1): - scripts: Add ability for qom-* python scripts to target tcp endpoints - - scripts/qmp/qmp-shell | 22 ++-------------------- - scripts/qmp/qmp.py | 23 ++++++++++++++++++++--- - 2 files changed, 22 insertions(+), 23 deletions(-) - --- -2.7.4 - -From: Carl Allendorph <address@hidden> - -The current code for QEMUMonitorProtocol accepts both a unix socket -endpoint as a string and a tcp endpoint as a tuple. Most of the scripts -that use this class don't massage the command line argument to generate -a tuple. This patch refactors qmp-shell slightly to reuse the existing -parsing of the "host:port" string for all the qom-* scripts. - -Signed-off-by: Carl Allendorph <address@hidden> ---- - scripts/qmp/qmp-shell | 22 ++-------------------- - scripts/qmp/qmp.py | 23 ++++++++++++++++++++--- - 2 files changed, 22 insertions(+), 23 deletions(-) - -diff --git a/scripts/qmp/qmp-shell b/scripts/qmp/qmp-shell -index 0373b24..8a2a437 100755 ---- a/scripts/qmp/qmp-shell -+++ b/scripts/qmp/qmp-shell -@@ -83,9 +83,6 @@ class QMPCompleter(list): - class QMPShellError(Exception): - pass - --class QMPShellBadPort(QMPShellError): -- pass -- - class FuzzyJSON(ast.NodeTransformer): - '''This extension of ast.NodeTransformer filters literal "true/false/null" - values in an AST and replaces them by proper "True/False/None" values that -@@ -103,28 +100,13 @@ class FuzzyJSON(ast.NodeTransformer): - # _execute_cmd()). Let's design a better one. - class QMPShell(qmp.QEMUMonitorProtocol): - def __init__(self, address, pretty=False): -- qmp.QEMUMonitorProtocol.__init__(self, self.__get_address(address)) -+ qmp.QEMUMonitorProtocol.__init__(self, address) - self._greeting = None - self._completer = None - self._pretty = pretty - self._transmode = False - self._actions = list() - -- def __get_address(self, arg): -- """ -- Figure out if the argument is in the port:host form, if it's not it's -- probably a file path. -- """ -- addr = arg.split(':') -- if len(addr) == 2: -- try: -- port = int(addr[1]) -- except ValueError: -- raise QMPShellBadPort -- return ( addr[0], port ) -- # socket path -- return arg -- - def _fill_completion(self): - for cmd in self.cmd('query-commands')['return']: - self._completer.append(cmd['name']) -@@ -400,7 +382,7 @@ def main(): - - if qemu is None: - fail_cmdline() -- except QMPShellBadPort: -+ except qmp.QMPShellBadPort: - die('bad port number in command-line') - - try: -diff --git a/scripts/qmp/qmp.py b/scripts/qmp/qmp.py -index 62d3651..261ece8 100644 ---- a/scripts/qmp/qmp.py -+++ b/scripts/qmp/qmp.py -@@ -25,21 +25,23 @@ class QMPCapabilitiesError(QMPError): - class QMPTimeoutError(QMPError): - pass - -+class QMPShellBadPort(QMPError): -+ pass -+ - class QEMUMonitorProtocol: - def __init__(self, address, server=False, debug=False): - """ - Create a QEMUMonitorProtocol class. - - @param address: QEMU address, can be either a unix socket path (string) -- or a tuple in the form ( address, port ) for a TCP -- connection -+ or a TCP endpoint (string in the format "host:port") - @param server: server mode listens on the socket (bool) - @raise socket.error on socket connection errors - @note No connection is established, this is done by the connect() or - accept() methods - """ - self.__events = [] -- self.__address = address -+ self.__address = self.__get_address(address) - self._debug = debug - self.__sock = self.__get_sock() - if server: -@@ -47,6 +49,21 @@ class QEMUMonitorProtocol: - self.__sock.bind(self.__address) - self.__sock.listen(1) - -+ def __get_address(self, arg): -+ """ -+ Figure out if the argument is in the port:host form, if it's not it's -+ probably a file path. -+ """ -+ addr = arg.split(':') -+ if len(addr) == 2: -+ try: -+ port = int(addr[1]) -+ except ValueError: -+ raise QMPShellBadPort -+ return ( addr[0], port ) -+ # socket path -+ return arg -+ - def __get_sock(self): - if isinstance(self.__address, tuple): - family = socket.AF_INET --- -2.7.4 - diff --git a/results/classifier/016/none/42974450 b/results/classifier/016/none/42974450 deleted file mode 100644 index 9ab3582a..00000000 --- a/results/classifier/016/none/42974450 +++ /dev/null @@ -1,456 +0,0 @@ -operating system: 0.713 -kernel: 0.463 -debug: 0.442 -hypervisor: 0.390 -x86: 0.334 -virtual: 0.259 -files: 0.192 -TCG: 0.182 -register: 0.171 -device: 0.116 -KVM: 0.071 -i386: 0.064 -VMM: 0.054 -PID: 0.052 -ppc: 0.049 -boot: 0.047 -assembly: 0.037 -architecture: 0.035 -socket: 0.033 -network: 0.028 -user-level: 0.023 -risc-v: 0.023 -arm: 0.022 -semantic: 0.017 -vnc: 0.014 -alpha: 0.007 -peripherals: 0.007 -performance: 0.005 -permissions: 0.004 -graphic: 0.002 -mistranslation: 0.001 - -[Bug Report] Possible Missing Endianness Conversion - -The virtio packed virtqueue support patch[1] suggests converting -endianness by lines: - -virtio_tswap16s(vdev, &e->off_wrap); -virtio_tswap16s(vdev, &e->flags); - -Though both of these conversion statements aren't present in the -latest qemu code here[2] - -Is this intentional? - -[1]: -https://mail.gnu.org/archive/html/qemu-block/2019-10/msg01492.html -[2]: -https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L314 - -CCing Jason. - -On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: -> -> -The virtio packed virtqueue support patch[1] suggests converting -> -endianness by lines: -> -> -virtio_tswap16s(vdev, &e->off_wrap); -> -virtio_tswap16s(vdev, &e->flags); -> -> -Though both of these conversion statements aren't present in the -> -latest qemu code here[2] -> -> -Is this intentional? -Good catch! - -It looks like it was removed (maybe by mistake) by commit -d152cdd6f6 ("virtio: use virtio accessor to access packed event") - -Jason can you confirm that? - -Thanks, -Stefano - -> -> -[1]: -https://mail.gnu.org/archive/html/qemu-block/2019-10/msg01492.html -> -[2]: -https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L314 -> - -On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: -> -> -CCing Jason. -> -> -On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: -> -> -> -> The virtio packed virtqueue support patch[1] suggests converting -> -> endianness by lines: -> -> -> -> virtio_tswap16s(vdev, &e->off_wrap); -> -> virtio_tswap16s(vdev, &e->flags); -> -> -> -> Though both of these conversion statements aren't present in the -> -> latest qemu code here[2] -> -> -> -> Is this intentional? -> -> -Good catch! -> -> -It looks like it was removed (maybe by mistake) by commit -> -d152cdd6f6 ("virtio: use virtio accessor to access packed event") -That commit changes from: - -- address_space_read_cached(cache, off_off, &e->off_wrap, -- sizeof(e->off_wrap)); -- virtio_tswap16s(vdev, &e->off_wrap); - -which does a byte read of 2 bytes and then swaps the bytes -depending on the host endianness and the value of -virtio_access_is_big_endian() - -to this: - -+ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); - -virtio_lduw_phys_cached() is a small function which calls -either lduw_be_phys_cached() or lduw_le_phys_cached() -depending on the value of virtio_access_is_big_endian(). -(And lduw_be_phys_cached() and lduw_le_phys_cached() do -the right thing for the host-endianness to do a "load -a specifically big or little endian 16-bit value".) - -Which is to say that because we use a load/store function that's -explicit about the size of the data type it is accessing, the -function itself can handle doing the load as big or little -endian, rather than the calling code having to do a manual swap after -it has done a load-as-bag-of-bytes. This is generally preferable -as it's less error-prone. - -(Explicit swap-after-loading still has a place where the -code is doing a load of a whole structure out of the -guest and then swapping each struct field after the fact, -because it means we can do a single load-from-guest-memory -rather than a whole sequence of calls all the way down -through the memory subsystem.) - -thanks --- PMM - -On Mon, Jun 24, 2024 at 04:19:52PM GMT, Peter Maydell wrote: -On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: -CCing Jason. - -On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: -> -> The virtio packed virtqueue support patch[1] suggests converting -> endianness by lines: -> -> virtio_tswap16s(vdev, &e->off_wrap); -> virtio_tswap16s(vdev, &e->flags); -> -> Though both of these conversion statements aren't present in the -> latest qemu code here[2] -> -> Is this intentional? - -Good catch! - -It looks like it was removed (maybe by mistake) by commit -d152cdd6f6 ("virtio: use virtio accessor to access packed event") -That commit changes from: - -- address_space_read_cached(cache, off_off, &e->off_wrap, -- sizeof(e->off_wrap)); -- virtio_tswap16s(vdev, &e->off_wrap); - -which does a byte read of 2 bytes and then swaps the bytes -depending on the host endianness and the value of -virtio_access_is_big_endian() - -to this: - -+ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); - -virtio_lduw_phys_cached() is a small function which calls -either lduw_be_phys_cached() or lduw_le_phys_cached() -depending on the value of virtio_access_is_big_endian(). -(And lduw_be_phys_cached() and lduw_le_phys_cached() do -the right thing for the host-endianness to do a "load -a specifically big or little endian 16-bit value".) - -Which is to say that because we use a load/store function that's -explicit about the size of the data type it is accessing, the -function itself can handle doing the load as big or little -endian, rather than the calling code having to do a manual swap after -it has done a load-as-bag-of-bytes. This is generally preferable -as it's less error-prone. -Thanks for the details! - -So, should we also remove `virtio_tswap16s(vdev, &e->flags);` ? - -I mean: -diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c -index 893a072c9d..2e5e67bdb9 100644 ---- a/hw/virtio/virtio.c -+++ b/hw/virtio/virtio.c -@@ -323,7 +323,6 @@ static void vring_packed_event_read(VirtIODevice *vdev, - /* Make sure flags is seen before off_wrap */ - smp_rmb(); - e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); -- virtio_tswap16s(vdev, &e->flags); - } - - static void vring_packed_off_wrap_write(VirtIODevice *vdev, - -Thanks, -Stefano -(Explicit swap-after-loading still has a place where the -code is doing a load of a whole structure out of the -guest and then swapping each struct field after the fact, -because it means we can do a single load-from-guest-memory -rather than a whole sequence of calls all the way down -through the memory subsystem.) - -thanks --- PMM - -On Tue, 25 Jun 2024 at 08:18, Stefano Garzarella <sgarzare@redhat.com> wrote: -> -> -On Mon, Jun 24, 2024 at 04:19:52PM GMT, Peter Maydell wrote: -> ->On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: -> ->> -> ->> CCing Jason. -> ->> -> ->> On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: -> ->> > -> ->> > The virtio packed virtqueue support patch[1] suggests converting -> ->> > endianness by lines: -> ->> > -> ->> > virtio_tswap16s(vdev, &e->off_wrap); -> ->> > virtio_tswap16s(vdev, &e->flags); -> ->> > -> ->> > Though both of these conversion statements aren't present in the -> ->> > latest qemu code here[2] -> ->> > -> ->> > Is this intentional? -> ->> -> ->> Good catch! -> ->> -> ->> It looks like it was removed (maybe by mistake) by commit -> ->> d152cdd6f6 ("virtio: use virtio accessor to access packed event") -> -> -> ->That commit changes from: -> -> -> ->- address_space_read_cached(cache, off_off, &e->off_wrap, -> ->- sizeof(e->off_wrap)); -> ->- virtio_tswap16s(vdev, &e->off_wrap); -> -> -> ->which does a byte read of 2 bytes and then swaps the bytes -> ->depending on the host endianness and the value of -> ->virtio_access_is_big_endian() -> -> -> ->to this: -> -> -> ->+ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); -> -> -> ->virtio_lduw_phys_cached() is a small function which calls -> ->either lduw_be_phys_cached() or lduw_le_phys_cached() -> ->depending on the value of virtio_access_is_big_endian(). -> ->(And lduw_be_phys_cached() and lduw_le_phys_cached() do -> ->the right thing for the host-endianness to do a "load -> ->a specifically big or little endian 16-bit value".) -> -> -> ->Which is to say that because we use a load/store function that's -> ->explicit about the size of the data type it is accessing, the -> ->function itself can handle doing the load as big or little -> ->endian, rather than the calling code having to do a manual swap after -> ->it has done a load-as-bag-of-bytes. This is generally preferable -> ->as it's less error-prone. -> -> -Thanks for the details! -> -> -So, should we also remove `virtio_tswap16s(vdev, &e->flags);` ? -> -> -I mean: -> -diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c -> -index 893a072c9d..2e5e67bdb9 100644 -> ---- a/hw/virtio/virtio.c -> -+++ b/hw/virtio/virtio.c -> -@@ -323,7 +323,6 @@ static void vring_packed_event_read(VirtIODevice *vdev, -> -/* Make sure flags is seen before off_wrap */ -> -smp_rmb(); -> -e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); -> -- virtio_tswap16s(vdev, &e->flags); -> -} -That definitely looks like it's probably not correct... - --- PMM - -On Fri, Jun 28, 2024 at 03:53:09PM GMT, Peter Maydell wrote: -On Tue, 25 Jun 2024 at 08:18, Stefano Garzarella <sgarzare@redhat.com> wrote: -On Mon, Jun 24, 2024 at 04:19:52PM GMT, Peter Maydell wrote: ->On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: ->> ->> CCing Jason. ->> ->> On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: ->> > ->> > The virtio packed virtqueue support patch[1] suggests converting ->> > endianness by lines: ->> > ->> > virtio_tswap16s(vdev, &e->off_wrap); ->> > virtio_tswap16s(vdev, &e->flags); ->> > ->> > Though both of these conversion statements aren't present in the ->> > latest qemu code here[2] ->> > ->> > Is this intentional? ->> ->> Good catch! ->> ->> It looks like it was removed (maybe by mistake) by commit ->> d152cdd6f6 ("virtio: use virtio accessor to access packed event") -> ->That commit changes from: -> ->- address_space_read_cached(cache, off_off, &e->off_wrap, ->- sizeof(e->off_wrap)); ->- virtio_tswap16s(vdev, &e->off_wrap); -> ->which does a byte read of 2 bytes and then swaps the bytes ->depending on the host endianness and the value of ->virtio_access_is_big_endian() -> ->to this: -> ->+ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); -> ->virtio_lduw_phys_cached() is a small function which calls ->either lduw_be_phys_cached() or lduw_le_phys_cached() ->depending on the value of virtio_access_is_big_endian(). ->(And lduw_be_phys_cached() and lduw_le_phys_cached() do ->the right thing for the host-endianness to do a "load ->a specifically big or little endian 16-bit value".) -> ->Which is to say that because we use a load/store function that's ->explicit about the size of the data type it is accessing, the ->function itself can handle doing the load as big or little ->endian, rather than the calling code having to do a manual swap after ->it has done a load-as-bag-of-bytes. This is generally preferable ->as it's less error-prone. - -Thanks for the details! - -So, should we also remove `virtio_tswap16s(vdev, &e->flags);` ? - -I mean: -diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c -index 893a072c9d..2e5e67bdb9 100644 ---- a/hw/virtio/virtio.c -+++ b/hw/virtio/virtio.c -@@ -323,7 +323,6 @@ static void vring_packed_event_read(VirtIODevice *vdev, - /* Make sure flags is seen before off_wrap */ - smp_rmb(); - e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); -- virtio_tswap16s(vdev, &e->flags); - } -That definitely looks like it's probably not correct... -Yeah, I just sent that patch: -20240701075208.19634-1-sgarzare@redhat.com -">https://lore.kernel.org/qemu-devel/ -20240701075208.19634-1-sgarzare@redhat.com -We can continue the discussion there. - -Thanks, -Stefano - diff --git a/results/classifier/016/none/48245039 b/results/classifier/016/none/48245039 deleted file mode 100644 index 913c2333..00000000 --- a/results/classifier/016/none/48245039 +++ /dev/null @@ -1,557 +0,0 @@ -user-level: 0.787 -performance: 0.642 -operating system: 0.416 -risc-v: 0.375 -debug: 0.341 -x86: 0.185 -TCG: 0.172 -ppc: 0.166 -device: 0.139 -arm: 0.119 -VMM: 0.111 -boot: 0.111 -files: 0.104 -PID: 0.099 -vnc: 0.095 -register: 0.088 -socket: 0.085 -network: 0.081 -i386: 0.071 -alpha: 0.059 -hypervisor: 0.056 -virtual: 0.055 -peripherals: 0.054 -kernel: 0.026 -semantic: 0.025 -architecture: 0.011 -KVM: 0.010 -mistranslation: 0.005 -assembly: 0.004 -graphic: 0.004 -permissions: 0.002 - -[Qemu-devel] [BUG] gcov support appears to be broken - -Hello, according to out docs, here is the procedure that should produce -coverage report for execution of the complete "make check": - -#./configure --enable-gcov -#make -#make check -#make coverage-report - -It seems that first three commands execute as expected. (For example, there are -plenty of files generated by "make check" that would've not been generated if -"enable-gcov" hadn't been chosen.) However, the last command complains about -some missing files related to FP support. If those files are added (for -example, artificially, using "touch <missing-file"), that it starts complaining -about missing some decodetree-generated files. Other kinds of files are -involved too. - -It would be nice to have coverage support working. Please somebody take a look, -or explain if I make a mistake or misunderstood our gcov support. - -Yours, -Aleksandar - -On Mon, 5 Aug 2019 at 11:39, Aleksandar Markovic <address@hidden> wrote: -> -> -Hello, according to out docs, here is the procedure that should produce -> -coverage report for execution of the complete "make check": -> -> -#./configure --enable-gcov -> -#make -> -#make check -> -#make coverage-report -> -> -It seems that first three commands execute as expected. (For example, there -> -are plenty of files generated by "make check" that would've not been -> -generated if "enable-gcov" hadn't been chosen.) However, the last command -> -complains about some missing files related to FP support. If those files are -> -added (for example, artificially, using "touch <missing-file"), that it -> -starts complaining about missing some decodetree-generated files. Other kinds -> -of files are involved too. -> -> -It would be nice to have coverage support working. Please somebody take a -> -look, or explain if I make a mistake or misunderstood our gcov support. -Cc'ing Alex who's probably the closest we have to a gcov expert. - -(make/make check of a --enable-gcov build is in the set of things our -Travis CI setup runs, so we do defend that part against regressions.) - -thanks --- PMM - -Peter Maydell <address@hidden> writes: - -> -On Mon, 5 Aug 2019 at 11:39, Aleksandar Markovic <address@hidden> wrote: -> -> -> -> Hello, according to out docs, here is the procedure that should produce -> -> coverage report for execution of the complete "make check": -> -> -> -> #./configure --enable-gcov -> -> #make -> -> #make check -> -> #make coverage-report -> -> -> -> It seems that first three commands execute as expected. (For example, -> -> there are plenty of files generated by "make check" that would've not -> -> been generated if "enable-gcov" hadn't been chosen.) However, the -> -> last command complains about some missing files related to FP -> -> support. If those files are added (for example, artificially, using -> -> "touch <missing-file"), that it starts complaining about missing some -> -> decodetree-generated files. Other kinds of files are involved too. -The gcov tool is fairly noisy about missing files but that just -indicates the tests haven't exercised those code paths. "make check" -especially doesn't touch much of the TCG code and a chunk of floating -point. - -> -> -> -> It would be nice to have coverage support working. Please somebody -> -> take a look, or explain if I make a mistake or misunderstood our gcov -> -> support. -So your failure mode is no report is generated at all? It's working for -me here. - -> -> -Cc'ing Alex who's probably the closest we have to a gcov expert. -> -> -(make/make check of a --enable-gcov build is in the set of things our -> -Travis CI setup runs, so we do defend that part against regressions.) -We defend the build but I have just checked and it seems our -check_coverage script is currently failing: -https://travis-ci.org/stsquad/qemu/jobs/567809808#L10328 -But as it's an after_success script it doesn't fail the build. - -> -> -thanks -> --- PMM --- -Alex Bennée - -> -> #./configure --enable-gcov -> -> #make -> -> #make check -> -> #make coverage-report -> -> -> -> It seems that first three commands execute as expected. (For example, -> -> there are plenty of files generated by "make check" that would've not -> -> been generated if "enable-gcov" hadn't been chosen.) However, the -> -> last command complains about some missing files related to FP -> -So your failure mode is no report is generated at all? It's working for -> -me here. -Alex, no report is generated for my test setups - in fact, "make -coverage-report" even says that it explicitly deletes what appears to be the -main coverage report html file). - -This is the terminal output of an unsuccessful executions of "make -coverage-report" for recent ToT: - -~/Build/qemu-TOT-TEST$ make coverage-report -make[1]: Entering directory '/home/user/Build/qemu-TOT-TEST/slirp' -make[1]: Nothing to be done for 'all'. -make[1]: Leaving directory '/home/user/Build/qemu-TOT-TEST/slirp' - CHK version_gen.h - GEN coverage-report.html -Traceback (most recent call last): - File "/usr/bin/gcovr", line 1970, in <module> - print_html_report(covdata, options.html_details) - File "/usr/bin/gcovr", line 1473, in print_html_report - INPUT = open(data['FILENAME'], 'r') -IOError: [Errno 2] No such file or directory: 'wrap.inc.c' -Makefile:1048: recipe for target -'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' failed -make: *** -[/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html] Error 1 -make: *** Deleting file -'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' - -This instance is executed in QEMU 3.0 source tree: (so, it looks the problem -existed for quite some time) - -~/Build/qemu-3.0$ make coverage-report - CHK version_gen.h - GEN coverage-report.html -Traceback (most recent call last): - File "/usr/bin/gcovr", line 1970, in <module> - print_html_report(covdata, options.html_details) - File "/usr/bin/gcovr", line 1473, in print_html_report - INPUT = open(data['FILENAME'], 'r') -IOError: [Errno 2] No such file or directory: -'/home/user/Build/qemu-3.0/target/openrisc/decode.inc.c' -Makefile:992: recipe for target -'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' failed -make: *** [/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html] -Error 1 -make: *** Deleting file -'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' - -Fond regards, -Aleksandar - - -> -Alex Bennée - -> -> #./configure --enable-gcov -> -> #make -> -> #make check -> -> #make coverage-report -> -> -> -> It seems that first three commands execute as expected. (For example, -> -> there are plenty of files generated by "make check" that would've not -> -> been generated if "enable-gcov" hadn't been chosen.) However, the -> -> last command complains about some missing files related to FP -> -So your failure mode is no report is generated at all? It's working for -> -me here. -Another piece of info: - -~/Build/qemu-TOT-TEST$ gcov --version -gcov (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010 -Copyright (C) 2015 Free Software Foundation, Inc. -This is free software; see the source for copying conditions. -There is NO warranty; not even for MERCHANTABILITY or -FITNESS FOR A PARTICULAR PURPOSE. - -:~/Build/qemu-TOT-TEST$ gcc --version -gcc (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0 -Copyright (C) 2017 Free Software Foundation, Inc. -This is free software; see the source for copying conditions. There is NO -warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. - - - - -Alex, no report is generated for my test setups - in fact, "make -coverage-report" even says that it explicitly deletes what appears to be the -main coverage report html file). - -This is the terminal output of an unsuccessful executions of "make -coverage-report" for recent ToT: - -~/Build/qemu-TOT-TEST$ make coverage-report -make[1]: Entering directory '/home/user/Build/qemu-TOT-TEST/slirp' -make[1]: Nothing to be done for 'all'. -make[1]: Leaving directory '/home/user/Build/qemu-TOT-TEST/slirp' - CHK version_gen.h - GEN coverage-report.html -Traceback (most recent call last): - File "/usr/bin/gcovr", line 1970, in <module> - print_html_report(covdata, options.html_details) - File "/usr/bin/gcovr", line 1473, in print_html_report - INPUT = open(data['FILENAME'], 'r') -IOError: [Errno 2] No such file or directory: 'wrap.inc.c' -Makefile:1048: recipe for target -'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' failed -make: *** -[/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html] Error 1 -make: *** Deleting file -'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' - -This instance is executed in QEMU 3.0 source tree: (so, it looks the problem -existed for quite some time) - -~/Build/qemu-3.0$ make coverage-report - CHK version_gen.h - GEN coverage-report.html -Traceback (most recent call last): - File "/usr/bin/gcovr", line 1970, in <module> - print_html_report(covdata, options.html_details) - File "/usr/bin/gcovr", line 1473, in print_html_report - INPUT = open(data['FILENAME'], 'r') -IOError: [Errno 2] No such file or directory: -'/home/user/Build/qemu-3.0/target/openrisc/decode.inc.c' -Makefile:992: recipe for target -'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' failed -make: *** [/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html] -Error 1 -make: *** Deleting file -'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' - -Fond regards, -Aleksandar - - -> -Alex Bennée - -> -> #./configure --enable-gcov -> -> #make -> -> #make check -> -> #make coverage-report -> -> -> -> It seems that first three commands execute as expected. (For example, -> -> there are plenty of files generated by "make check" that would've not -> -> been generated if "enable-gcov" hadn't been chosen.) However, the -> -> last command complains about some missing files related to FP -> -So your failure mode is no report is generated at all? It's working for -> -me here. -Alex, here is the thing: - -Seeing that my gcovr is relatively old (2014) 3.2 version, I upgraded it from -git repo to the most recent 4.1 (actually, to a dev version, from the very tip -of the tree), and "make coverage-report" started generating coverage reports. -It did emit some error messages (totally different than previous), but still it -did not stop like it used to do with gcovr 3.2. - -Perhaps you would want to add some gcov/gcovr minimal version info in our docs. -(or at least a statement "this was tested with such and such gcc, gcov and -gcovr", etc.?) - -Coverage report looked fine at first glance, but it a kind of disappointed me -when I digged deeper into its content - for example, it shows very low coverage -for our FP code (softfloat), while, in fact, we know that "make check" contains -detailed tests on FP functionalities. But this is most likely a separate -problem of a very different nature, perhaps the issue of separate git repo for -FP tests (testfloat) that our FP tests use as a mid-layer. - -I'll try how everything works with my test examples, and will let you know. - -Your help is greatly appreciated, -Aleksandar - -Fond regards, -Aleksandar - - -> -Alex Bennée - -Aleksandar Markovic <address@hidden> writes: - -> ->> #./configure --enable-gcov -> ->> #make -> ->> #make check -> ->> #make coverage-report -> ->> -> ->> It seems that first three commands execute as expected. (For example, -> ->> there are plenty of files generated by "make check" that would've not -> ->> been generated if "enable-gcov" hadn't been chosen.) However, the -> ->> last command complains about some missing files related to FP -> -> -> So your failure mode is no report is generated at all? It's working for -> -> me here. -> -> -Alex, here is the thing: -> -> -Seeing that my gcovr is relatively old (2014) 3.2 version, I upgraded it from -> -git repo to the most recent 4.1 (actually, to a dev version, from the very -> -tip of the tree), and "make coverage-report" started generating coverage -> -reports. It did emit some error messages (totally different than previous), -> -but still it did not stop like it used to do with gcovr 3.2. -> -> -Perhaps you would want to add some gcov/gcovr minimal version info in our -> -docs. (or at least a statement "this was tested with such and such gcc, gcov -> -and gcovr", etc.?) -> -> -Coverage report looked fine at first glance, but it a kind of -> -disappointed me when I digged deeper into its content - for example, -> -it shows very low coverage for our FP code (softfloat), while, in -> -fact, we know that "make check" contains detailed tests on FP -> -functionalities. But this is most likely a separate problem of a very -> -different nature, perhaps the issue of separate git repo for FP tests -> -(testfloat) that our FP tests use as a mid-layer. -I get: - -68.6 % 2593 / 3782 62.2 % 1690 / 2718 - -Which is not bad considering we don't exercise the 80 and 128 bit -softfloat code at all (which is not shared by the re-factored 16/32/64 -bit code). - -> -> -I'll try how everything works with my test examples, and will let you know. -> -> -Your help is greatly appreciated, -> -Aleksandar -> -> -Fond regards, -> -Aleksandar -> -> -> -> Alex Bennée --- -Alex Bennée - -> -> it shows very low coverage for our FP code (softfloat), while, in -> -> fact, we know that "make check" contains detailed tests on FP -> -> functionalities. But this is most likely a separate problem of a very -> -> different nature, perhaps the issue of separate git repo for FP tests -> -> (testfloat) that our FP tests use as a mid-layer. -> -> -I get: -> -> -68.6 % 2593 / 3782 62.2 % 1690 / 2718 -> -I would expect that kind of result too. - -However, I get: - -File: fpu/softfloat.c Lines: 8 3334 0.2 % -Date: 2019-08-05 19:56:58 Branches: 3 2376 0.1 % - -:( - -OK, I'll try to figure that out, and most likely I could live with it if it is -an isolated problem. - -Thank you for your assistance in this matter, -Aleksandar - -> -Which is not bad considering we don't exercise the 80 and 128 bit -> -softfloat code at all (which is not shared by the re-factored 16/32/64 -> -bit code). -> -> -Alex Bennée - -> -> it shows very low coverage for our FP code (softfloat), while, in -> -> fact, we know that "make check" contains detailed tests on FP -> -> functionalities. But this is most likely a separate problem of a very -> -> different nature, perhaps the issue of separate git repo for FP tests -> -> (testfloat) that our FP tests use as a mid-layer. -> -> -I get: -> -> -68.6 % 2593 / 3782 62.2 % 1690 / 2718 -> -This problem is solved too. (and it is my fault) - -I worked with multiple versions of QEMU, and my previous low-coverage results -were for QEMU 3.0, and for that version the directory tests/fp did not even -exist. :D (<blush>) - -For QEMU ToT, I get now: - -fpu/softfloat.c - 68.8 % 2592 / 3770 62.3 % 1693 / 2718 - -which is identical for all intents and purposes to your result. - -Yours cordially, -Aleksandar - diff --git a/results/classifier/016/none/50773216 b/results/classifier/016/none/50773216 deleted file mode 100644 index 5a856c2f..00000000 --- a/results/classifier/016/none/50773216 +++ /dev/null @@ -1,137 +0,0 @@ -virtual: 0.431 -debug: 0.366 -register: 0.178 -x86: 0.170 -vnc: 0.116 -operating system: 0.087 -hypervisor: 0.084 -files: 0.082 -PID: 0.068 -TCG: 0.058 -i386: 0.053 -network: 0.041 -user-level: 0.040 -performance: 0.039 -kernel: 0.038 -semantic: 0.031 -socket: 0.027 -alpha: 0.027 -ppc: 0.026 -device: 0.018 -boot: 0.017 -permissions: 0.007 -assembly: 0.006 -arm: 0.006 -peripherals: 0.004 -risc-v: 0.004 -VMM: 0.004 -graphic: 0.003 -architecture: 0.003 -KVM: 0.002 -mistranslation: 0.002 - -[Qemu-devel] Can I have someone's feedback on [bug 1809075] Concurrency bug on keyboard events: capslock LED messing up keycode streams causes character misses at guest kernel - -Hi everyone. -Can I please have someone's feedback on this bug? -https://bugs.launchpad.net/qemu/+bug/1809075 -Briefly, guest OS loses characters sent to it via vnc. And I spot the -bug in relation to ps2 driver. -I'm thinking of possible fixes and I might want to use a memory barrier. -But I would really like to have some suggestion from a qemu developer -first. For example, can we brutally drop capslock LED key events in ps2 -queue? -It is actually relevant to openQA, an automated QA tool for openSUSE. -And this bug blocks a few test cases for us. -Thank you in advance! - -Kind regards, -Gao Zhiyuan - -Cc'ing Marc-André & Gerd. - -On 12/19/18 10:31 AM, Gao Zhiyuan wrote: -> -Hi everyone. -> -> -Can I please have someone's feedback on this bug? -> -https://bugs.launchpad.net/qemu/+bug/1809075 -> -Briefly, guest OS loses characters sent to it via vnc. And I spot the -> -bug in relation to ps2 driver. -> -> -I'm thinking of possible fixes and I might want to use a memory barrier. -> -But I would really like to have some suggestion from a qemu developer -> -first. For example, can we brutally drop capslock LED key events in ps2 -> -queue? -> -> -It is actually relevant to openQA, an automated QA tool for openSUSE. -> -And this bug blocks a few test cases for us. -> -> -Thank you in advance! -> -> -Kind regards, -> -Gao Zhiyuan -> - -On Thu, Jan 03, 2019 at 12:05:54PM +0100, Philippe Mathieu-Daudé wrote: -> -Cc'ing Marc-André & Gerd. -> -> -On 12/19/18 10:31 AM, Gao Zhiyuan wrote: -> -> Hi everyone. -> -> -> -> Can I please have someone's feedback on this bug? -> -> -https://bugs.launchpad.net/qemu/+bug/1809075 -> -> Briefly, guest OS loses characters sent to it via vnc. And I spot the -> -> bug in relation to ps2 driver. -> -> -> -> I'm thinking of possible fixes and I might want to use a memory barrier. -> -> But I would really like to have some suggestion from a qemu developer -> -> first. For example, can we brutally drop capslock LED key events in ps2 -> -> queue? -There is no "capslock LED key event". 0xfa is KBD_REPLY_ACK, and the -device queues it in response to guest port writes. Yes, the ack can -race with actual key events. But IMO that isn't a bug in qemu. - -Probably the linux kernel just throws away everything until it got the -ack for the port write, and that way the key event gets lost. On -physical hardware you will not notice because it is next to impossible -to type fast enough to hit the race window. - -So, go fix the kernel. - -Alternatively fix vncdotool to send uppercase letters properly with -shift key pressed. Then qemu wouldn't generate capslock key events -(that happens because qemu thinks guest and host capslock state is out -of sync) and the guests's capslock led update request wouldn't get into -the way. - -cheers, - Gerd - diff --git a/results/classifier/016/none/55753058 b/results/classifier/016/none/55753058 deleted file mode 100644 index 7cfee9ee..00000000 --- a/results/classifier/016/none/55753058 +++ /dev/null @@ -1,320 +0,0 @@ -x86: 0.784 -operating system: 0.778 -kernel: 0.648 -debug: 0.645 -user-level: 0.550 -hypervisor: 0.097 -files: 0.093 -performance: 0.091 -assembly: 0.071 -virtual: 0.065 -PID: 0.040 -TCG: 0.037 -register: 0.035 -ppc: 0.022 -semantic: 0.010 -network: 0.007 -device: 0.006 -boot: 0.005 -architecture: 0.004 -i386: 0.004 -alpha: 0.003 -arm: 0.003 -socket: 0.002 -permissions: 0.002 -risc-v: 0.002 -vnc: 0.002 -graphic: 0.002 -peripherals: 0.001 -VMM: 0.001 -mistranslation: 0.001 -KVM: 0.000 - -[RESEND][BUG FIX HELP] QEMU main thread endlessly hangs in __ppoll() - -Hi Genius, -I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may still -exist in the mainline. -Thanks in advance to heroes who can take a look and share understanding. - -The qemu main thread endlessly hangs in the handle of the qmp statement: -{'execute': 'human-monitor-command', 'arguments':{ 'command-line': -'drive_del replication0' } } -and we have the call trace looks like: -#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1, -timeout=<optimized out>, timeout@entry=0x7ffc56c66db0, -sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44 -#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0, -__nfds=<optimized out>, __fds=<optimized out>) -at /usr/include/x86_64-linux-gnu/bits/poll2.h:77 -#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, -timeout=<optimized out>) at util/qemu-timer.c:348 -#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0, -blocking=blocking@entry=true) at util/aio-posix.c:669 -#4 0x000055561019268d in bdrv_do_drained_begin (poll=true, -ignore_bds_parents=false, parent=0x0, recursive=false, -bs=0x55561138b0a0) at block/io.c:430 -#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>, -parent=0x0, ignore_bds_parents=<optimized out>, -poll=<optimized out>) at block/io.c:396 -#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0, -child=0x7f36dc0ce380, errp=<optimized out>) -at block/quorum.c:1063 -#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120 -"colo-disk0", has_child=<optimized out>, -child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0, -errp=0x7ffc56c66f98) at blockdev.c:4494 -#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized -out>, ret=<optimized out>, errp=0x7ffc56c67018) -at qapi/qapi-commands-block-core.c:1538 -#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010, -allow_oob=<optimized out>, request=<optimized out>, -cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132 -#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized -out>, allow_oob=<optimized out>) -at qapi/qmp-dispatch.c:175 -#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40, -req=<optimized out>) at monitor/qmp.c:145 -#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized out>) -at monitor/qmp.c:234 -#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at -util/async.c:117 -#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117 -#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at -util/aio-posix.c:459 -#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>, -callback=<optimized out>, user_data=<optimized out>) -at util/async.c:260 -#17 0x00007f3c22302fbd in g_main_context_dispatch () from -/lib/x86_64-linux-gnu/libglib-2.0.so.0 -#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219 -#19 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:242 -#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518 -#21 0x000055560ff600fe in main_loop () at vl.c:1814 -#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized out>, -envp=<optimized out>) at vl.c:4503 -We found that we're doing endless check in the line of -block/io.c:bdrv_do_drained_begin(): -BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent)); -and it turns out that the bdrv_drain_poll() always get true from: -- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents) -- AND atomic_read(&bs->in_flight) - -I personally think this is a deadlock issue in the a QEMU block layer -(as we know, we have some #FIXME comments in related codes, such as block -permisson update). -Any comments are welcome and appreciated. - ---- -thx,likexu - -On 2/28/21 9:39 PM, Like Xu wrote: -Hi Genius, -I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may -still exist in the mainline. -Thanks in advance to heroes who can take a look and share understanding. -Do you have a test case that reproduces on 5.2? It'd be nice to know if -it was still a problem in the latest source tree or not. ---js -The qemu main thread endlessly hangs in the handle of the qmp statement: -{'execute': 'human-monitor-command', 'arguments':{ 'command-line': -'drive_del replication0' } } -and we have the call trace looks like: -#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1, -timeout=<optimized out>, timeout@entry=0x7ffc56c66db0, -sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44 -#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0, -__nfds=<optimized out>, __fds=<optimized out>) -at /usr/include/x86_64-linux-gnu/bits/poll2.h:77 -#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, -timeout=<optimized out>) at util/qemu-timer.c:348 -#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0, -blocking=blocking@entry=true) at util/aio-posix.c:669 -#4 0x000055561019268d in bdrv_do_drained_begin (poll=true, -ignore_bds_parents=false, parent=0x0, recursive=false, -bs=0x55561138b0a0) at block/io.c:430 -#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>, -parent=0x0, ignore_bds_parents=<optimized out>, -poll=<optimized out>) at block/io.c:396 -#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0, -child=0x7f36dc0ce380, errp=<optimized out>) -at block/quorum.c:1063 -#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120 -"colo-disk0", has_child=<optimized out>, -child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0, -errp=0x7ffc56c66f98) at blockdev.c:4494 -#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized -out>, ret=<optimized out>, errp=0x7ffc56c67018) -at qapi/qapi-commands-block-core.c:1538 -#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010, -allow_oob=<optimized out>, request=<optimized out>, -cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132 -#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized -out>, allow_oob=<optimized out>) -at qapi/qmp-dispatch.c:175 -#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40, -req=<optimized out>) at monitor/qmp.c:145 -#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized -out>) at monitor/qmp.c:234 -#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at -util/async.c:117 -#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117 -#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at -util/aio-posix.c:459 -#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>, -callback=<optimized out>, user_data=<optimized out>) -at util/async.c:260 -#17 0x00007f3c22302fbd in g_main_context_dispatch () from -/lib/x86_64-linux-gnu/libglib-2.0.so.0 -#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219 -#19 os_host_main_loop_wait (timeout=<optimized out>) at -util/main-loop.c:242 -#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518 -#21 0x000055560ff600fe in main_loop () at vl.c:1814 -#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized -out>, envp=<optimized out>) at vl.c:4503 -We found that we're doing endless check in the line of -block/io.c:bdrv_do_drained_begin(): -    BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent)); -and it turns out that the bdrv_drain_poll() always get true from: -- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents) -- AND atomic_read(&bs->in_flight) - -I personally think this is a deadlock issue in the a QEMU block layer -(as we know, we have some #FIXME comments in related codes, such as -block permisson update). -Any comments are welcome and appreciated. - ---- -thx,likexu - -Hi John, - -Thanks for your comment. - -On 2021/3/5 7:53, John Snow wrote: -On 2/28/21 9:39 PM, Like Xu wrote: -Hi Genius, -I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may -still exist in the mainline. -Thanks in advance to heroes who can take a look and share understanding. -Do you have a test case that reproduces on 5.2? It'd be nice to know if it -was still a problem in the latest source tree or not. -We narrowed down the source of the bug, which basically came from -the following qmp usage: -{'execute': 'human-monitor-command', 'arguments':{ 'command-line': -'drive_del replication0' } } -One of the test cases is the COLO usage (docs/colo-proxy.txt). - -This issue is sporadic,the probability may be 1/15 for a io-heavy guest. - -I believe it's reproducible on 5.2 and the latest tree. ---js -The qemu main thread endlessly hangs in the handle of the qmp statement: -{'execute': 'human-monitor-command', 'arguments':{ 'command-line': -'drive_del replication0' } } -and we have the call trace looks like: -#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1, -timeout=<optimized out>, timeout@entry=0x7ffc56c66db0, -sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44 -#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0, -__nfds=<optimized out>, __fds=<optimized out>) -at /usr/include/x86_64-linux-gnu/bits/poll2.h:77 -#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, -timeout=<optimized out>) at util/qemu-timer.c:348 -#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0, -blocking=blocking@entry=true) at util/aio-posix.c:669 -#4 0x000055561019268d in bdrv_do_drained_begin (poll=true, -ignore_bds_parents=false, parent=0x0, recursive=false, -bs=0x55561138b0a0) at block/io.c:430 -#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>, -parent=0x0, ignore_bds_parents=<optimized out>, -poll=<optimized out>) at block/io.c:396 -#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0, -child=0x7f36dc0ce380, errp=<optimized out>) -at block/quorum.c:1063 -#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120 -"colo-disk0", has_child=<optimized out>, -child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0, -errp=0x7ffc56c66f98) at blockdev.c:4494 -#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized -out>, ret=<optimized out>, errp=0x7ffc56c67018) -at qapi/qapi-commands-block-core.c:1538 -#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010, -allow_oob=<optimized out>, request=<optimized out>, -cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132 -#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized -out>, allow_oob=<optimized out>) -at qapi/qmp-dispatch.c:175 -#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40, -req=<optimized out>) at monitor/qmp.c:145 -#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized -out>) at monitor/qmp.c:234 -#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at -util/async.c:117 -#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117 -#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at -util/aio-posix.c:459 -#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>, -callback=<optimized out>, user_data=<optimized out>) -at util/async.c:260 -#17 0x00007f3c22302fbd in g_main_context_dispatch () from -/lib/x86_64-linux-gnu/libglib-2.0.so.0 -#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219 -#19 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:242 -#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518 -#21 0x000055560ff600fe in main_loop () at vl.c:1814 -#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized -out>, envp=<optimized out>) at vl.c:4503 -We found that we're doing endless check in the line of -block/io.c:bdrv_do_drained_begin(): -     BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent)); -and it turns out that the bdrv_drain_poll() always get true from: -- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents) -- AND atomic_read(&bs->in_flight) - -I personally think this is a deadlock issue in the a QEMU block layer -(as we know, we have some #FIXME comments in related codes, such as block -permisson update). -Any comments are welcome and appreciated. - ---- -thx,likexu - -On 3/4/21 10:08 PM, Like Xu wrote: -Hi John, - -Thanks for your comment. - -On 2021/3/5 7:53, John Snow wrote: -On 2/28/21 9:39 PM, Like Xu wrote: -Hi Genius, -I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may -still exist in the mainline. -Thanks in advance to heroes who can take a look and share understanding. -Do you have a test case that reproduces on 5.2? It'd be nice to know -if it was still a problem in the latest source tree or not. -We narrowed down the source of the bug, which basically came from -the following qmp usage: -{'execute': 'human-monitor-command', 'arguments':{ 'command-line': -'drive_del replication0' } } -One of the test cases is the COLO usage (docs/colo-proxy.txt). - -This issue is sporadic,the probability may be 1/15 for a io-heavy guest. - -I believe it's reproducible on 5.2 and the latest tree. -Can you please test and confirm that this is the case, and then file a -bug report on the LP: -https://launchpad.net/qemu -and include: -- The exact commit you used (current origin/master debug build would be -the most ideal.) -- Which QEMU binary you are using (qemu-system-x86_64?) -- The shortest command line you are aware of that reproduces the problem -- The host OS and kernel version -- An updated call trace -- Any relevant commands issued prior to the one that caused the hang; or -detailed reproduction steps if possible. -Thanks, ---js - diff --git a/results/classifier/016/none/56309929 b/results/classifier/016/none/56309929 deleted file mode 100644 index 9eb7151e..00000000 --- a/results/classifier/016/none/56309929 +++ /dev/null @@ -1,207 +0,0 @@ -kernel: 0.698 -files: 0.249 -operating system: 0.101 -semantic: 0.071 -TCG: 0.056 -debug: 0.041 -virtual: 0.024 -ppc: 0.017 -PID: 0.015 -hypervisor: 0.015 -register: 0.013 -VMM: 0.013 -x86: 0.011 -user-level: 0.007 -device: 0.007 -performance: 0.007 -network: 0.004 -architecture: 0.003 -risc-v: 0.003 -alpha: 0.003 -KVM: 0.003 -permissions: 0.002 -arm: 0.002 -peripherals: 0.002 -vnc: 0.002 -socket: 0.002 -boot: 0.002 -graphic: 0.001 -assembly: 0.001 -mistranslation: 0.001 -i386: 0.001 - -[Qemu-devel] [BUG 2.6] Broken CONFIG_TPM? - -A compilation test with clang -Weverything reported this problem: - -config-host.h:112:20: warning: '$' in identifier -[-Wdollar-in-identifier-extension] - -The line of code looks like this: - -#define CONFIG_TPM $(CONFIG_SOFTMMU) - -This is fine for Makefile code, but won't work as expected in C code. - -Am 28.04.2016 um 22:33 schrieb Stefan Weil: -> -A compilation test with clang -Weverything reported this problem: -> -> -config-host.h:112:20: warning: '$' in identifier -> -[-Wdollar-in-identifier-extension] -> -> -The line of code looks like this: -> -> -#define CONFIG_TPM $(CONFIG_SOFTMMU) -> -> -This is fine for Makefile code, but won't work as expected in C code. -> -A complete 64 bit build with clang -Weverything creates a log file of -1.7 GB. -Here are the uniq warnings sorted by their frequency: - - 1 -Wflexible-array-extensions - 1 -Wgnu-folding-constant - 1 -Wunknown-pragmas - 1 -Wunknown-warning-option - 1 -Wunreachable-code-loop-increment - 2 -Warray-bounds-pointer-arithmetic - 2 -Wdollar-in-identifier-extension - 3 -Woverlength-strings - 3 -Wweak-vtables - 4 -Wgnu-empty-struct - 4 -Wstring-conversion - 6 -Wclass-varargs - 7 -Wc99-extensions - 7 -Wc++-compat - 8 -Wfloat-equal - 11 -Wformat-nonliteral - 16 -Wshift-negative-value - 19 -Wglobal-constructors - 28 -Wc++11-long-long - 29 -Wembedded-directive - 38 -Wvla - 40 -Wcovered-switch-default - 40 -Wmissing-variable-declarations - 49 -Wold-style-cast - 53 -Wgnu-conditional-omitted-operand - 56 -Wformat-pedantic - 61 -Wvariadic-macros - 77 -Wc++11-extensions - 83 -Wgnu-flexible-array-initializer - 83 -Wzero-length-array - 96 -Wgnu-designator - 102 -Wmissing-noreturn - 103 -Wconditional-uninitialized - 107 -Wdisabled-macro-expansion - 115 -Wunreachable-code-return - 134 -Wunreachable-code - 243 -Wunreachable-code-break - 257 -Wfloat-conversion - 280 -Wswitch-enum - 291 -Wpointer-arith - 298 -Wshadow - 378 -Wassign-enum - 395 -Wused-but-marked-unused - 420 -Wreserved-id-macro - 493 -Wdocumentation - 510 -Wshift-sign-overflow - 565 -Wgnu-case-range - 566 -Wgnu-zero-variadic-macro-arguments - 650 -Wbad-function-cast - 705 -Wmissing-field-initializers - 817 -Wgnu-statement-expression - 968 -Wdocumentation-unknown-command - 1021 -Wextra-semi - 1112 -Wgnu-empty-initializer - 1138 -Wcast-qual - 1509 -Wcast-align - 1766 -Wextended-offsetof - 1937 -Wsign-compare - 2130 -Wpacked - 2404 -Wunused-macros - 3081 -Wpadded - 4182 -Wconversion - 5430 -Wlanguage-extension-token - 6655 -Wshorten-64-to-32 - 6995 -Wpedantic - 7354 -Wunused-parameter - 27659 -Wsign-conversion - -Stefan Weil <address@hidden> writes: - -> -A compilation test with clang -Weverything reported this problem: -> -> -config-host.h:112:20: warning: '$' in identifier -> -[-Wdollar-in-identifier-extension] -> -> -The line of code looks like this: -> -> -#define CONFIG_TPM $(CONFIG_SOFTMMU) -> -> -This is fine for Makefile code, but won't work as expected in C code. -Broken in commit 3b8acc1 "configure: fix TPM logic". Cc'ing Paolo. - -Impact: #ifdef CONFIG_TPM never disables code. There are no other uses -of CONFIG_TPM in C code. - -I had a quick peek at configure and create_config, but refrained from -attempting to fix this, since I don't understand when exactly CONFIG_TPM -should be defined. - -On 29 April 2016 at 08:42, Markus Armbruster <address@hidden> wrote: -> -Stefan Weil <address@hidden> writes: -> -> -> A compilation test with clang -Weverything reported this problem: -> -> -> -> config-host.h:112:20: warning: '$' in identifier -> -> [-Wdollar-in-identifier-extension] -> -> -> -> The line of code looks like this: -> -> -> -> #define CONFIG_TPM $(CONFIG_SOFTMMU) -> -> -> -> This is fine for Makefile code, but won't work as expected in C code. -> -> -Broken in commit 3b8acc1 "configure: fix TPM logic". Cc'ing Paolo. -> -> -Impact: #ifdef CONFIG_TPM never disables code. There are no other uses -> -of CONFIG_TPM in C code. -> -> -I had a quick peek at configure and create_config, but refrained from -> -attempting to fix this, since I don't understand when exactly CONFIG_TPM -> -should be defined. -Looking at 'git blame' suggests this has been wrong like this for -some years, so we don't need to scramble to fix it for 2.6. - -thanks --- PMM - diff --git a/results/classifier/016/none/65781993 b/results/classifier/016/none/65781993 deleted file mode 100644 index 92cd0275..00000000 --- a/results/classifier/016/none/65781993 +++ /dev/null @@ -1,2820 +0,0 @@ -debug: 0.668 -hypervisor: 0.630 -operating system: 0.229 -socket: 0.188 -files: 0.071 -performance: 0.053 -network: 0.043 -x86: 0.027 -virtual: 0.022 -register: 0.021 -TCG: 0.019 -kernel: 0.013 -i386: 0.011 -device: 0.010 -permissions: 0.009 -alpha: 0.008 -PID: 0.008 -semantic: 0.006 -ppc: 0.006 -assembly: 0.004 -risc-v: 0.004 -user-level: 0.004 -boot: 0.003 -architecture: 0.003 -arm: 0.002 -vnc: 0.002 -VMM: 0.002 -mistranslation: 0.002 -graphic: 0.002 -peripherals: 0.001 -KVM: 0.001 - -[Qemu-devel] 答复: Re: 答复: Re: [BUG]COLO failover hang - -Thank youã - -I have test areadyã - -When the Primary Node panic,the Secondary Node qemu hang at the same placeã - -Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node qemu -will not produce the problem,but Primary Node panic canã - -I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. - - -when failover,channel_shutdown could not shut down the channel. - - -so the colo_process_incoming_thread will hang at recvmsg. - - -I test a patch: - - -diff --git a/migration/socket.c b/migration/socket.c - - -index 13966f1..d65a0ea 100644 - - ---- a/migration/socket.c - - -+++ b/migration/socket.c - - -@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel -*ioc, - - - } - - - - - - trace_migration_socket_incoming_accepted() - - - - - - qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") - - -+ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) - - - migration_channel_process_incoming(migrate_get_current(), - - - QIO_CHANNEL(sioc)) - - - object_unref(OBJECT(sioc)) - - - - -My test will not hang any more. - - - - - - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 address@hidden -æéäººï¼ address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang - - - - - -Hi,Wang. - -You can test this branch: -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -and please follow wiki ensure your own configuration correctly. -http://wiki.qemu-project.org/Features/COLO -Thanks - -Zhang Chen - - -On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ -ï¼ hi. -ï¼ -ï¼ I test the git qemu master have the same problem. -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ -ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ (address@hidden, address@hidden "", -ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ -ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ migration/qemu-file-channel.c:78 -ï¼ -ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ migration/qemu-file.c:295 -ï¼ -ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ -ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ migration/qemu-file.c:568 -ï¼ -ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ migration/qemu-file.c:648 -ï¼ -ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ address@hidden) at migration/colo.c:244 -ï¼ -ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ outï¼, address@hidden, -ï¼ address@hidden) -ï¼ -ï¼ at migration/colo.c:264 -ï¼ -ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ -ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ -ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ -ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ -ï¼ $3 = 0 -ï¼ -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ -ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ gmain.c:3054 -ï¼ -ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ address@hidden) at gmain.c:3630 -ï¼ -ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ -ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ util/main-loop.c:258 -ï¼ -ï¼ #5 main_loop_wait (address@hidden) at -ï¼ util/main-loop.c:506 -ï¼ -ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ -ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ outï¼) at vl.c:4709 -ï¼ -ï¼ (gdb) p ioc-ï¼features -ï¼ -ï¼ $1 = 6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ -ï¼ -ï¼ May be socket_accept_incoming_migration should -ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ -ï¼ -ï¼ thank you. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ address@hidden -ï¼ address@hidden -ï¼ address@hidden@huawei.comï¼ -ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ -ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ -ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ Do the colo have any plan for development? -ï¼ -ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ -ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ -ï¼ In our internal version can run it successfully, -ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ Next time if you have some question about COLO, -ï¼ please cc me and zhanghailiang address@hidden -ï¼ -ï¼ -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ (gdb) bt -ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, -ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at -ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, -ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -- -ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ -ï¼ -- -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ - --- -Thanks -Zhang Chen - -Hi, - -On 2017/3/21 16:10, address@hidden wrote: -Thank youã - -I have test areadyã - -When the Primary Node panic,the Secondary Node qemu hang at the same placeã - -Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node qemu -will not produce the problem,but Primary Node panic canã - -I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. -Yes, you are right, when we do failover for primary/secondary VM, we will -shutdown the related -fd in case it is stuck in the read/write fd. - -It seems that you didn't follow the above introduction exactly to do the test. -Could you -share your test procedures ? Especially the commands used in the test. - -Thanks, -Hailiang -when failover,channel_shutdown could not shut down the channel. - - -so the colo_process_incoming_thread will hang at recvmsg. - - -I test a patch: - - -diff --git a/migration/socket.c b/migration/socket.c - - -index 13966f1..d65a0ea 100644 - - ---- a/migration/socket.c - - -+++ b/migration/socket.c - - -@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel -*ioc, - - - } - - - - - - trace_migration_socket_incoming_accepted() - - - - - - qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") - - -+ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) - - - migration_channel_process_incoming(migrate_get_current(), - - - QIO_CHANNEL(sioc)) - - - object_unref(OBJECT(sioc)) - - - - -My test will not hang any more. - - - - - - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 address@hidden -æéäººï¼ address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang - - - - - -Hi,Wang. - -You can test this branch: -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -and please follow wiki ensure your own configuration correctly. -http://wiki.qemu-project.org/Features/COLO -Thanks - -Zhang Chen - - -On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ -ï¼ hi. -ï¼ -ï¼ I test the git qemu master have the same problem. -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ -ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ (address@hidden, address@hidden "", -ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ -ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ migration/qemu-file-channel.c:78 -ï¼ -ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ migration/qemu-file.c:295 -ï¼ -ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ -ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ migration/qemu-file.c:568 -ï¼ -ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ migration/qemu-file.c:648 -ï¼ -ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ address@hidden) at migration/colo.c:244 -ï¼ -ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ outï¼, address@hidden, -ï¼ address@hidden) -ï¼ -ï¼ at migration/colo.c:264 -ï¼ -ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ -ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ -ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ -ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ -ï¼ $3 = 0 -ï¼ -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ -ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ gmain.c:3054 -ï¼ -ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ address@hidden) at gmain.c:3630 -ï¼ -ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ -ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ util/main-loop.c:258 -ï¼ -ï¼ #5 main_loop_wait (address@hidden) at -ï¼ util/main-loop.c:506 -ï¼ -ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ -ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ outï¼) at vl.c:4709 -ï¼ -ï¼ (gdb) p ioc-ï¼features -ï¼ -ï¼ $1 = 6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ -ï¼ -ï¼ May be socket_accept_incoming_migration should -ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ -ï¼ -ï¼ thank you. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ address@hidden -ï¼ address@hidden -ï¼ address@hidden@huawei.comï¼ -ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ -ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ -ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ Do the colo have any plan for development? -ï¼ -ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ -ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ -ï¼ In our internal version can run it successfully, -ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ Next time if you have some question about COLO, -ï¼ please cc me and zhanghailiang address@hidden -ï¼ -ï¼ -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ (gdb) bt -ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, -ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at -ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, -ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -- -ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ -ï¼ -- -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ - -Hi, - -Thanks for reporting this, and i confirmed it in my test, and it is a bug. - -Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -case COLO thread/incoming thread is stuck in read/write() while do failover, -but it didn't take effect, because all the fd used by COLO (also migration) -has been wrapped by qio channel, and it will not call the shutdown API if -we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). - -Cc: Dr. David Alan Gilbert <address@hidden> - -I doubted migration cancel has the same problem, it may be stuck in write() -if we tried to cancel migration. - -void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error -**errp) -{ - qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); - migration_channel_connect(s, ioc, NULL); - ... ... -We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -and the -migrate_fd_cancel() -{ - ... ... - if (s->state == MIGRATION_STATUS_CANCELLING && f) { - qemu_file_shutdown(f); --> This will not take effect. No ? - } -} - -Thanks, -Hailiang - -On 2017/3/21 16:10, address@hidden wrote: -Thank youã - -I have test areadyã - -When the Primary Node panic,the Secondary Node qemu hang at the same placeã - -Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node qemu -will not produce the problem,but Primary Node panic canã - -I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. - - -when failover,channel_shutdown could not shut down the channel. - - -so the colo_process_incoming_thread will hang at recvmsg. - - -I test a patch: - - -diff --git a/migration/socket.c b/migration/socket.c - - -index 13966f1..d65a0ea 100644 - - ---- a/migration/socket.c - - -+++ b/migration/socket.c - - -@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel -*ioc, - - - } - - - - - - trace_migration_socket_incoming_accepted() - - - - - - qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") - - -+ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) - - - migration_channel_process_incoming(migrate_get_current(), - - - QIO_CHANNEL(sioc)) - - - object_unref(OBJECT(sioc)) - - - - -My test will not hang any more. - - - - - - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 address@hidden -æéäººï¼ address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang - - - - - -Hi,Wang. - -You can test this branch: -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -and please follow wiki ensure your own configuration correctly. -http://wiki.qemu-project.org/Features/COLO -Thanks - -Zhang Chen - - -On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ -ï¼ hi. -ï¼ -ï¼ I test the git qemu master have the same problem. -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ -ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ (address@hidden, address@hidden "", -ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ -ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ migration/qemu-file-channel.c:78 -ï¼ -ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ migration/qemu-file.c:295 -ï¼ -ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ -ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ migration/qemu-file.c:568 -ï¼ -ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ migration/qemu-file.c:648 -ï¼ -ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ address@hidden) at migration/colo.c:244 -ï¼ -ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ outï¼, address@hidden, -ï¼ address@hidden) -ï¼ -ï¼ at migration/colo.c:264 -ï¼ -ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ -ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ -ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ -ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ -ï¼ $3 = 0 -ï¼ -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ -ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ gmain.c:3054 -ï¼ -ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ address@hidden) at gmain.c:3630 -ï¼ -ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ -ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ util/main-loop.c:258 -ï¼ -ï¼ #5 main_loop_wait (address@hidden) at -ï¼ util/main-loop.c:506 -ï¼ -ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ -ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ outï¼) at vl.c:4709 -ï¼ -ï¼ (gdb) p ioc-ï¼features -ï¼ -ï¼ $1 = 6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ -ï¼ -ï¼ May be socket_accept_incoming_migration should -ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ -ï¼ -ï¼ thank you. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ address@hidden -ï¼ address@hidden -ï¼ address@hidden@huawei.comï¼ -ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ -ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ -ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ Do the colo have any plan for development? -ï¼ -ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ -ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ -ï¼ In our internal version can run it successfully, -ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ Next time if you have some question about COLO, -ï¼ please cc me and zhanghailiang address@hidden -ï¼ -ï¼ -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ (gdb) bt -ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, -ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at -ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, -ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -- -ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ -ï¼ -- -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ - -* Hailiang Zhang (address@hidden) wrote: -> -Hi, -> -> -Thanks for reporting this, and i confirmed it in my test, and it is a bug. -> -> -Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -> -case COLO thread/incoming thread is stuck in read/write() while do failover, -> -but it didn't take effect, because all the fd used by COLO (also migration) -> -has been wrapped by qio channel, and it will not call the shutdown API if -> -we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -> -QIO_CHANNEL_FEATURE_SHUTDOWN). -> -> -Cc: Dr. David Alan Gilbert <address@hidden> -> -> -I doubted migration cancel has the same problem, it may be stuck in write() -> -if we tried to cancel migration. -> -> -void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error -> -**errp) -> -{ -> -qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); -> -migration_channel_connect(s, ioc, NULL); -> -... ... -> -We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -> -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -> -and the -> -migrate_fd_cancel() -> -{ -> -... ... -> -if (s->state == MIGRATION_STATUS_CANCELLING && f) { -> -qemu_file_shutdown(f); --> This will not take effect. No ? -> -} -> -} -(cc'd in Daniel Berrange). -I see that we call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); -at the -top of qio_channel_socket_new; so I think that's safe isn't it? - -Dave - -> -Thanks, -> -Hailiang -> -> -On 2017/3/21 16:10, address@hidden wrote: -> -> Thank youã -> -> -> -> I have test areadyã -> -> -> -> When the Primary Node panic,the Secondary Node qemu hang at the same placeã -> -> -> -> Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node -> -> qemu will not produce the problem,but Primary Node panic canã -> -> -> -> I think due to the feature of channel does not support -> -> QIO_CHANNEL_FEATURE_SHUTDOWN. -> -> -> -> -> -> when failover,channel_shutdown could not shut down the channel. -> -> -> -> -> -> so the colo_process_incoming_thread will hang at recvmsg. -> -> -> -> -> -> I test a patch: -> -> -> -> -> -> diff --git a/migration/socket.c b/migration/socket.c -> -> -> -> -> -> index 13966f1..d65a0ea 100644 -> -> -> -> -> -> --- a/migration/socket.c -> -> -> -> -> -> +++ b/migration/socket.c -> -> -> -> -> -> @@ -147,8 +147,9 @@ static gboolean -> -> socket_accept_incoming_migration(QIOChannel *ioc, -> -> -> -> -> -> } -> -> -> -> -> -> -> -> -> -> -> -> trace_migration_socket_incoming_accepted() -> -> -> -> -> -> -> -> -> -> -> -> qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") -> -> -> -> -> -> + qio_channel_set_feature(QIO_CHANNEL(sioc), -> -> QIO_CHANNEL_FEATURE_SHUTDOWN) -> -> -> -> -> -> migration_channel_process_incoming(migrate_get_current(), -> -> -> -> -> -> QIO_CHANNEL(sioc)) -> -> -> -> -> -> object_unref(OBJECT(sioc)) -> -> -> -> -> -> -> -> -> -> My test will not hang any more. -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> -> åå§é®ä»¶ -> -> -> -> -> -> -> -> åä»¶äººï¼ address@hidden -> -> æ¶ä»¶äººï¼ç广10165992 address@hidden -> -> æéäººï¼ address@hidden address@hidden -> -> æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -> -> 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -> -> -> -> -> -> -> -> -> -> -> -> Hi,Wang. -> -> -> -> You can test this branch: -> -> -> -> -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -> -> -> -> and please follow wiki ensure your own configuration correctly. -> -> -> -> -http://wiki.qemu-project.org/Features/COLO -> -> -> -> -> -> Thanks -> -> -> -> Zhang Chen -> -> -> -> -> -> On 03/21/2017 03:27 PM, address@hidden wrote: -> -> ï¼ -> -> ï¼ hi. -> -> ï¼ -> -> ï¼ I test the git qemu master have the same problem. -> -> ï¼ -> -> ï¼ (gdb) bt -> -> ï¼ -> -> ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -> -> ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -> -> ï¼ -> -> ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -> -> ï¼ (address@hidden, address@hidden "", -> -> ï¼ address@hidden, address@hidden) at io/channel.c:114 -> -> ï¼ -> -> ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -> -> ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -> -> ï¼ migration/qemu-file-channel.c:78 -> -> ï¼ -> -> ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -> -> ï¼ migration/qemu-file.c:295 -> -> ï¼ -> -> ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -> -> ï¼ address@hidden) at migration/qemu-file.c:555 -> -> ï¼ -> -> ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -> -> ï¼ migration/qemu-file.c:568 -> -> ï¼ -> -> ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -> -> ï¼ migration/qemu-file.c:648 -> -> ï¼ -> -> ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -> -> ï¼ address@hidden) at migration/colo.c:244 -> -> ï¼ -> -> ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -> -> ï¼ outï¼, address@hidden, -> -> ï¼ address@hidden) -> -> ï¼ -> -> ï¼ at migration/colo.c:264 -> -> ï¼ -> -> ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -> -> ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -> -> ï¼ -> -> ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -> -> ï¼ -> -> ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -> -> ï¼ -> -> ï¼ (gdb) p ioc-ï¼name -> -> ï¼ -> -> ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -> -> ï¼ -> -> ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -> -> ï¼ -> -> ï¼ $3 = 0 -> -> ï¼ -> -> ï¼ -> -> ï¼ (gdb) bt -> -> ï¼ -> -> ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -> -> ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -> -> ï¼ -> -> ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -> -> ï¼ gmain.c:3054 -> -> ï¼ -> -> ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -> -> ï¼ address@hidden) at gmain.c:3630 -> -> ï¼ -> -> ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -> -> ï¼ -> -> ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -> -> ï¼ util/main-loop.c:258 -> -> ï¼ -> -> ï¼ #5 main_loop_wait (address@hidden) at -> -> ï¼ util/main-loop.c:506 -> -> ï¼ -> -> ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -> -> ï¼ -> -> ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -> -> ï¼ outï¼) at vl.c:4709 -> -> ï¼ -> -> ï¼ (gdb) p ioc-ï¼features -> -> ï¼ -> -> ï¼ $1 = 6 -> -> ï¼ -> -> ï¼ (gdb) p ioc-ï¼name -> -> ï¼ -> -> ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -> -> ï¼ -> -> ï¼ -> -> ï¼ May be socket_accept_incoming_migration should -> -> ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -> -> ï¼ -> -> ï¼ -> -> ï¼ thank you. -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> ï¼ åå§é®ä»¶ -> -> ï¼ address@hidden -> -> ï¼ address@hidden -> -> ï¼ address@hidden@huawei.comï¼ -> -> ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -> -> ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -> -> ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -> -> ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -> -> ï¼ ï¼ -> -> ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -> -> ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -> -> ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -> -> ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -> -> ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -> -> ï¼ ï¼ -> -> ï¼ ï¼ I found that the colo in qemu is not complete yet. -> -> ï¼ ï¼ Do the colo have any plan for development? -> -> ï¼ -> -> ï¼ Yes, We are developing. You can see some of patch we pushing. -> -> ï¼ -> -> ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -> -> ï¼ -> -> ï¼ In our internal version can run it successfully, -> -> ï¼ The failover detail you can ask Zhanghailiang for help. -> -> ï¼ Next time if you have some question about COLO, -> -> ï¼ please cc me and zhanghailiang address@hidden -> -> ï¼ -> -> ï¼ -> -> ï¼ Thanks -> -> ï¼ Zhang Chen -> -> ï¼ -> -> ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ centos7.2+qemu2.7.50 -> -> ï¼ ï¼ (gdb) bt -> -> ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -> -> ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, -> -> ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) -> -> at -> -> ï¼ ï¼ io/channel-socket.c:497 -> -> ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -> -> ï¼ ï¼ address@hidden "", address@hidden, -> -> ï¼ ï¼ address@hidden) at io/channel.c:97 -> -> ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -> -> ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -> -> ï¼ ï¼ migration/qemu-file-channel.c:78 -> -> ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -> -> ï¼ ï¼ migration/qemu-file.c:257 -> -> ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -> -> ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -> -> ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -> -> ï¼ ï¼ migration/qemu-file.c:523 -> -> ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -> -> ï¼ ï¼ migration/qemu-file.c:603 -> -> ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -> -> ï¼ ï¼ address@hidden) at migration/colo.c:215 -> -> ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, -> -> ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -> -> ï¼ ï¼ migration/colo.c:546 -> -> ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -> -> ï¼ ï¼ migration/colo.c:649 -> -> ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -> -> ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -- -> -> ï¼ ï¼ View this message in context: -> -> -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -> -> ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ ï¼ -> -> ï¼ -> -> ï¼ -- -> -> ï¼ Thanks -> -> ï¼ Zhang Chen -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> ï¼ -> -> -> --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - -On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -* Hailiang Zhang (address@hidden) wrote: -Hi, - -Thanks for reporting this, and i confirmed it in my test, and it is a bug. - -Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -case COLO thread/incoming thread is stuck in read/write() while do failover, -but it didn't take effect, because all the fd used by COLO (also migration) -has been wrapped by qio channel, and it will not call the shutdown API if -we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). - -Cc: Dr. David Alan Gilbert <address@hidden> - -I doubted migration cancel has the same problem, it may be stuck in write() -if we tried to cancel migration. - -void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error -**errp) -{ - qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); - migration_channel_connect(s, ioc, NULL); - ... ... -We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -and the -migrate_fd_cancel() -{ - ... ... - if (s->state == MIGRATION_STATUS_CANCELLING && f) { - qemu_file_shutdown(f); --> This will not take effect. No ? - } -} -(cc'd in Daniel Berrange). -I see that we call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); -at the -top of qio_channel_socket_new; so I think that's safe isn't it? -Hmm, you are right, this problem is only exist for the migration incoming fd, -thanks. -Dave -Thanks, -Hailiang - -On 2017/3/21 16:10, address@hidden wrote: -Thank youã - -I have test areadyã - -When the Primary Node panic,the Secondary Node qemu hang at the same placeã - -Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node qemu -will not produce the problem,but Primary Node panic canã - -I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. - - -when failover,channel_shutdown could not shut down the channel. - - -so the colo_process_incoming_thread will hang at recvmsg. - - -I test a patch: - - -diff --git a/migration/socket.c b/migration/socket.c - - -index 13966f1..d65a0ea 100644 - - ---- a/migration/socket.c - - -+++ b/migration/socket.c - - -@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel -*ioc, - - - } - - - - - - trace_migration_socket_incoming_accepted() - - - - - - qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") - - -+ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) - - - migration_channel_process_incoming(migrate_get_current(), - - - QIO_CHANNEL(sioc)) - - - object_unref(OBJECT(sioc)) - - - - -My test will not hang any more. - - - - - - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 address@hidden -æéäººï¼ address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang - - - - - -Hi,Wang. - -You can test this branch: -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -and please follow wiki ensure your own configuration correctly. -http://wiki.qemu-project.org/Features/COLO -Thanks - -Zhang Chen - - -On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ -ï¼ hi. -ï¼ -ï¼ I test the git qemu master have the same problem. -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ -ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ (address@hidden, address@hidden "", -ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ -ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ migration/qemu-file-channel.c:78 -ï¼ -ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ migration/qemu-file.c:295 -ï¼ -ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ -ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ migration/qemu-file.c:568 -ï¼ -ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ migration/qemu-file.c:648 -ï¼ -ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ address@hidden) at migration/colo.c:244 -ï¼ -ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ outï¼, address@hidden, -ï¼ address@hidden) -ï¼ -ï¼ at migration/colo.c:264 -ï¼ -ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ -ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ -ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ -ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ -ï¼ $3 = 0 -ï¼ -ï¼ -ï¼ (gdb) bt -ï¼ -ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ -ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ gmain.c:3054 -ï¼ -ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ address@hidden) at gmain.c:3630 -ï¼ -ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ -ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ util/main-loop.c:258 -ï¼ -ï¼ #5 main_loop_wait (address@hidden) at -ï¼ util/main-loop.c:506 -ï¼ -ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ -ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ outï¼) at vl.c:4709 -ï¼ -ï¼ (gdb) p ioc-ï¼features -ï¼ -ï¼ $1 = 6 -ï¼ -ï¼ (gdb) p ioc-ï¼name -ï¼ -ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ -ï¼ -ï¼ May be socket_accept_incoming_migration should -ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ -ï¼ -ï¼ thank you. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ address@hidden -ï¼ address@hidden -ï¼ address@hidden@huawei.comï¼ -ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ -ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ -ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ Do the colo have any plan for development? -ï¼ -ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ -ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ -ï¼ In our internal version can run it successfully, -ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ Next time if you have some question about COLO, -ï¼ please cc me and zhanghailiang address@hidden -ï¼ -ï¼ -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ (gdb) bt -ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, -ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at -ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, -ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -- -ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ ï¼ -ï¼ -ï¼ -- -ï¼ Thanks -ï¼ Zhang Chen -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - -. - -* Hailiang Zhang (address@hidden) wrote: -> -On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -> -> * Hailiang Zhang (address@hidden) wrote: -> -> > Hi, -> -> > -> -> > Thanks for reporting this, and i confirmed it in my test, and it is a bug. -> -> > -> -> > Though we tried to call qemu_file_shutdown() to shutdown the related fd, -> -> > in -> -> > case COLO thread/incoming thread is stuck in read/write() while do -> -> > failover, -> -> > but it didn't take effect, because all the fd used by COLO (also -> -> > migration) -> -> > has been wrapped by qio channel, and it will not call the shutdown API if -> -> > we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -> -> > QIO_CHANNEL_FEATURE_SHUTDOWN). -> -> > -> -> > Cc: Dr. David Alan Gilbert <address@hidden> -> -> > -> -> > I doubted migration cancel has the same problem, it may be stuck in -> -> > write() -> -> > if we tried to cancel migration. -> -> > -> -> > void fd_start_outgoing_migration(MigrationState *s, const char *fdname, -> -> > Error **errp) -> -> > { -> -> > qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); -> -> > migration_channel_connect(s, ioc, NULL); -> -> > ... ... -> -> > We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -> -> > QIO_CHANNEL_FEATURE_SHUTDOWN) above, -> -> > and the -> -> > migrate_fd_cancel() -> -> > { -> -> > ... ... -> -> > if (s->state == MIGRATION_STATUS_CANCELLING && f) { -> -> > qemu_file_shutdown(f); --> This will not take effect. No ? -> -> > } -> -> > } -> -> -> -> (cc'd in Daniel Berrange). -> -> I see that we call qio_channel_set_feature(ioc, -> -> QIO_CHANNEL_FEATURE_SHUTDOWN); at the -> -> top of qio_channel_socket_new; so I think that's safe isn't it? -> -> -> -> -Hmm, you are right, this problem is only exist for the migration incoming fd, -> -thanks. -Yes, and I don't think we normally do a cancel on the incoming side of a -migration. - -Dave - -> -> Dave -> -> -> -> > Thanks, -> -> > Hailiang -> -> > -> -> > On 2017/3/21 16:10, address@hidden wrote: -> -> > > Thank youã -> -> > > -> -> > > I have test areadyã -> -> > > -> -> > > When the Primary Node panic,the Secondary Node qemu hang at the same -> -> > > placeã -> -> > > -> -> > > Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary -> -> > > Node qemu will not produce the problem,but Primary Node panic canã -> -> > > -> -> > > I think due to the feature of channel does not support -> -> > > QIO_CHANNEL_FEATURE_SHUTDOWN. -> -> > > -> -> > > -> -> > > when failover,channel_shutdown could not shut down the channel. -> -> > > -> -> > > -> -> > > so the colo_process_incoming_thread will hang at recvmsg. -> -> > > -> -> > > -> -> > > I test a patch: -> -> > > -> -> > > -> -> > > diff --git a/migration/socket.c b/migration/socket.c -> -> > > -> -> > > -> -> > > index 13966f1..d65a0ea 100644 -> -> > > -> -> > > -> -> > > --- a/migration/socket.c -> -> > > -> -> > > -> -> > > +++ b/migration/socket.c -> -> > > -> -> > > -> -> > > @@ -147,8 +147,9 @@ static gboolean -> -> > > socket_accept_incoming_migration(QIOChannel *ioc, -> -> > > -> -> > > -> -> > > } -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > trace_migration_socket_incoming_accepted() -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > qio_channel_set_name(QIO_CHANNEL(sioc), -> -> > > "migration-socket-incoming") -> -> > > -> -> > > -> -> > > + qio_channel_set_feature(QIO_CHANNEL(sioc), -> -> > > QIO_CHANNEL_FEATURE_SHUTDOWN) -> -> > > -> -> > > -> -> > > migration_channel_process_incoming(migrate_get_current(), -> -> > > -> -> > > -> -> > > QIO_CHANNEL(sioc)) -> -> > > -> -> > > -> -> > > object_unref(OBJECT(sioc)) -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > My test will not hang any more. -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > åå§é®ä»¶ -> -> > > -> -> > > -> -> > > -> -> > > åä»¶äººï¼ address@hidden -> -> > > æ¶ä»¶äººï¼ç广10165992 address@hidden -> -> > > æéäººï¼ address@hidden address@hidden -> -> > > æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -> -> > > 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > -> -> > > Hi,Wang. -> -> > > -> -> > > You can test this branch: -> -> > > -> -> > > -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -> -> > > -> -> > > and please follow wiki ensure your own configuration correctly. -> -> > > -> -> > > -http://wiki.qemu-project.org/Features/COLO -> -> > > -> -> > > -> -> > > Thanks -> -> > > -> -> > > Zhang Chen -> -> > > -> -> > > -> -> > > On 03/21/2017 03:27 PM, address@hidden wrote: -> -> > > ï¼ -> -> > > ï¼ hi. -> -> > > ï¼ -> -> > > ï¼ I test the git qemu master have the same problem. -> -> > > ï¼ -> -> > > ï¼ (gdb) bt -> -> > > ï¼ -> -> > > ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -> -> > > ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -> -> > > ï¼ -> -> > > ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -> -> > > ï¼ (address@hidden, address@hidden "", -> -> > > ï¼ address@hidden, address@hidden) at io/channel.c:114 -> -> > > ï¼ -> -> > > ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -> -> > > ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -> -> > > ï¼ migration/qemu-file-channel.c:78 -> -> > > ï¼ -> -> > > ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -> -> > > ï¼ migration/qemu-file.c:295 -> -> > > ï¼ -> -> > > ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -> -> > > ï¼ address@hidden) at migration/qemu-file.c:555 -> -> > > ï¼ -> -> > > ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -> -> > > ï¼ migration/qemu-file.c:568 -> -> > > ï¼ -> -> > > ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -> -> > > ï¼ migration/qemu-file.c:648 -> -> > > ï¼ -> -> > > ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -> -> > > ï¼ address@hidden) at migration/colo.c:244 -> -> > > ï¼ -> -> > > ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -> -> > > ï¼ outï¼, address@hidden, -> -> > > ï¼ address@hidden) -> -> > > ï¼ -> -> > > ï¼ at migration/colo.c:264 -> -> > > ï¼ -> -> > > ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -> -> > > ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -> -> > > ï¼ -> -> > > ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -> -> > > ï¼ -> -> > > ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -> -> > > ï¼ -> -> > > ï¼ (gdb) p ioc-ï¼name -> -> > > ï¼ -> -> > > ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -> -> > > ï¼ -> -> > > ï¼ (gdb) p ioc-ï¼features Do not support -> -> > > QIO_CHANNEL_FEATURE_SHUTDOWN -> -> > > ï¼ -> -> > > ï¼ $3 = 0 -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ (gdb) bt -> -> > > ï¼ -> -> > > ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -> -> > > ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -> -> > > ï¼ -> -> > > ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -> -> > > ï¼ gmain.c:3054 -> -> > > ï¼ -> -> > > ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -> -> > > ï¼ address@hidden) at gmain.c:3630 -> -> > > ï¼ -> -> > > ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -> -> > > ï¼ -> -> > > ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -> -> > > ï¼ util/main-loop.c:258 -> -> > > ï¼ -> -> > > ï¼ #5 main_loop_wait (address@hidden) at -> -> > > ï¼ util/main-loop.c:506 -> -> > > ï¼ -> -> > > ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -> -> > > ï¼ -> -> > > ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -> -> > > ï¼ outï¼) at vl.c:4709 -> -> > > ï¼ -> -> > > ï¼ (gdb) p ioc-ï¼features -> -> > > ï¼ -> -> > > ï¼ $1 = 6 -> -> > > ï¼ -> -> > > ï¼ (gdb) p ioc-ï¼name -> -> > > ï¼ -> -> > > ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ May be socket_accept_incoming_migration should -> -> > > ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ thank you. -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ åå§é®ä»¶ -> -> > > ï¼ address@hidden -> -> > > ï¼ address@hidden -> -> > > ï¼ address@hidden@huawei.comï¼ -> -> > > ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -> -> > > ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -> -> > > ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -> -> > > ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -> -> > > ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -> -> > > ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -> -> > > ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -> -> > > ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ I found that the colo in qemu is not complete yet. -> -> > > ï¼ ï¼ Do the colo have any plan for development? -> -> > > ï¼ -> -> > > ï¼ Yes, We are developing. You can see some of patch we pushing. -> -> > > ï¼ -> -> > > ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -> -> > > ï¼ -> -> > > ï¼ In our internal version can run it successfully, -> -> > > ï¼ The failover detail you can ask Zhanghailiang for help. -> -> > > ï¼ Next time if you have some question about COLO, -> -> > > ï¼ please cc me and zhanghailiang address@hidden -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ Thanks -> -> > > ï¼ Zhang Chen -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ centos7.2+qemu2.7.50 -> -> > > ï¼ ï¼ (gdb) bt -> -> > > ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -> -> > > ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized -> -> > > outï¼, -> -> > > ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, -> -> > > errp=0x0) at -> -> > > ï¼ ï¼ io/channel-socket.c:497 -> -> > > ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -> -> > > ï¼ ï¼ address@hidden "", address@hidden, -> -> > > ï¼ ï¼ address@hidden) at io/channel.c:97 -> -> > > ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized -> -> > > outï¼, -> -> > > ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -> -> > > ï¼ ï¼ migration/qemu-file-channel.c:78 -> -> > > ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -> -> > > ï¼ ï¼ migration/qemu-file.c:257 -> -> > > ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -> -> > > ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -> -> > > ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -> -> > > ï¼ ï¼ migration/qemu-file.c:523 -> -> > > ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -> -> > > ï¼ ï¼ migration/qemu-file.c:603 -> -> > > ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -> -> > > ï¼ ï¼ address@hidden) at migration/colo.c:215 -> -> > > ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message -> -> > > (errp=0x7f3d62bfaa48, -> -> > > ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -> -> > > ï¼ ï¼ migration/colo.c:546 -> -> > > ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -> -> > > ï¼ ï¼ migration/colo.c:649 -> -> > > ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from -> -> > > /lib64/libpthread.so.0 -> -> > > ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -- -> -> > > ï¼ ï¼ View this message in context: -> -> > > -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -> -> > > ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ ï¼ -> -> > > ï¼ -> -> > > ï¼ -- -> -> > > ï¼ Thanks -> -> > > ï¼ Zhang Chen -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > ï¼ -> -> > > -> -> > -> -> -- -> -> Dr. David Alan Gilbert / address@hidden / Manchester, UK -> -> -> -> . -> -> -> --- -Dr. David Alan Gilbert / address@hidden / Manchester, UK - diff --git a/results/classifier/016/none/70294255 b/results/classifier/016/none/70294255 deleted file mode 100644 index 286c1cc4..00000000 --- a/results/classifier/016/none/70294255 +++ /dev/null @@ -1,1088 +0,0 @@ -socket: 0.753 -debug: 0.454 -network: 0.316 -operating system: 0.278 -files: 0.213 -hypervisor: 0.060 -virtual: 0.046 -kernel: 0.045 -performance: 0.044 -i386: 0.043 -alpha: 0.037 -TCG: 0.036 -permissions: 0.032 -device: 0.025 -x86: 0.024 -PID: 0.021 -boot: 0.014 -KVM: 0.013 -semantic: 0.011 -risc-v: 0.010 -VMM: 0.010 -register: 0.009 -assembly: 0.009 -architecture: 0.008 -arm: 0.007 -vnc: 0.006 -ppc: 0.006 -peripherals: 0.005 -user-level: 0.004 -graphic: 0.003 -mistranslation: 0.002 - -[Qemu-devel] 答复: Re: 答复: Re: 答复: Re: 答复: Re: [BUG]COLO failover hang - -hi: - -yes.it is better. - -And should we delete - - - - -#ifdef WIN32 - - QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) - -#endif - - - - -in qio_channel_socket_acceptï¼ - -qio_channel_socket_new already have it. - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 -æéäººï¼ address@hidden address@hidden address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang - - - - - -Hi, - -On 2017/3/22 9:42, address@hidden wrote: -ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ -ï¼ -ï¼ index 13966f1..d65a0ea 100644 -ï¼ -ï¼ -ï¼ --- a/migration/socket.c -ï¼ -ï¼ -ï¼ +++ b/migration/socket.c -ï¼ -ï¼ -ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ -ï¼ -ï¼ } -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ trace_migration_socket_incoming_accepted() -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") -ï¼ -ï¼ -ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ -ï¼ -ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ -ï¼ -ï¼ QIO_CHANNEL(sioc)) -ï¼ -ï¼ -ï¼ object_unref(OBJECT(sioc)) -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ Is this patch ok? -ï¼ - -Yes, i think this works, but a better way maybe to call -qio_channel_set_feature() -in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the -socket accept fd, -Or fix it by this: - -diff --git a/io/channel-socket.c b/io/channel-socket.c -index f546c68..ce6894c 100644 ---- a/io/channel-socket.c -+++ b/io/channel-socket.c -@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, - Error **errp) - { - QIOChannelSocket *cioc -- -- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) -- cioc-ï¼fd = -1 -+ -+ cioc = qio_channel_socket_new() - cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) - cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) - - -Thanks, -Hailiang - -ï¼ I have test it . The test could not hang any more. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ -ï¼ -ï¼ -ï¼ åä»¶äººï¼ address@hidden -ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden -ï¼ æéäººï¼ address@hidden address@hidden address@hidden -ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 -ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: -ï¼ ï¼ï¼ Hi, -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do -failover, -ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) -ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if -ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() -ï¼ ï¼ï¼ if we tried to cancel migration. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, -Error **errp) -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") -ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -ï¼ ï¼ï¼ and the -ï¼ ï¼ï¼ migrate_fd_cancel() -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { -ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? -ï¼ ï¼ï¼ } -ï¼ ï¼ï¼ } -ï¼ ï¼ -ï¼ ï¼ (cc'd in Daniel Berrange). -ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, -QIO_CHANNEL_FEATURE_SHUTDOWN) at the -ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? -ï¼ ï¼ -ï¼ -ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, -thanks. -ï¼ -ï¼ ï¼ Dave -ï¼ ï¼ -ï¼ ï¼ï¼ Thanks, -ï¼ ï¼ï¼ Hailiang -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: -ï¼ ï¼ï¼ï¼ Thank youã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I have test areadyã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same -placeã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node -qemu will not produce the problem,but Primary Node panic canã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I test a patch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ --- a/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ } -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), -"migration-socket-incoming") -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ My test will not hang any more. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden -ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden -ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden -ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Hi,Wang. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ You can test this branch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -http://wiki.qemu-project.org/Features/COLO -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Thanks -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ hi. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", -ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should -ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ thank you. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ -ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, -ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, -ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized -outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, -errp=0x0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message -(errp=0x7f3d62bfaa48, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ -ï¼ ï¼ -- -ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK -ï¼ ï¼ -ï¼ ï¼ . -ï¼ ï¼ -ï¼ - -On 2017/3/22 16:09, address@hidden wrote: -hi: - -yes.it is better. - -And should we delete -Yes, you are right. -#ifdef WIN32 - - QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) - -#endif - - - - -in qio_channel_socket_acceptï¼ - -qio_channel_socket_new already have it. - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 -æéäººï¼ address@hidden address@hidden address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang - - - - - -Hi, - -On 2017/3/22 9:42, address@hidden wrote: -ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ -ï¼ -ï¼ index 13966f1..d65a0ea 100644 -ï¼ -ï¼ -ï¼ --- a/migration/socket.c -ï¼ -ï¼ -ï¼ +++ b/migration/socket.c -ï¼ -ï¼ -ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ -ï¼ -ï¼ } -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ trace_migration_socket_incoming_accepted() -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") -ï¼ -ï¼ -ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ -ï¼ -ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ -ï¼ -ï¼ QIO_CHANNEL(sioc)) -ï¼ -ï¼ -ï¼ object_unref(OBJECT(sioc)) -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ Is this patch ok? -ï¼ - -Yes, i think this works, but a better way maybe to call -qio_channel_set_feature() -in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the -socket accept fd, -Or fix it by this: - -diff --git a/io/channel-socket.c b/io/channel-socket.c -index f546c68..ce6894c 100644 ---- a/io/channel-socket.c -+++ b/io/channel-socket.c -@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, - Error **errp) - { - QIOChannelSocket *cioc -- -- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) -- cioc-ï¼fd = -1 -+ -+ cioc = qio_channel_socket_new() - cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) - cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) - - -Thanks, -Hailiang - -ï¼ I have test it . The test could not hang any more. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ -ï¼ -ï¼ -ï¼ åä»¶äººï¼ address@hidden -ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden -ï¼ æéäººï¼ address@hidden address@hidden address@hidden -ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 -ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: -ï¼ ï¼ï¼ Hi, -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do -failover, -ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) -ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if -ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() -ï¼ ï¼ï¼ if we tried to cancel migration. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, -Error **errp) -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") -ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -ï¼ ï¼ï¼ and the -ï¼ ï¼ï¼ migrate_fd_cancel() -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { -ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? -ï¼ ï¼ï¼ } -ï¼ ï¼ï¼ } -ï¼ ï¼ -ï¼ ï¼ (cc'd in Daniel Berrange). -ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, -QIO_CHANNEL_FEATURE_SHUTDOWN) at the -ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? -ï¼ ï¼ -ï¼ -ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, -thanks. -ï¼ -ï¼ ï¼ Dave -ï¼ ï¼ -ï¼ ï¼ï¼ Thanks, -ï¼ ï¼ï¼ Hailiang -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: -ï¼ ï¼ï¼ï¼ Thank youã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I have test areadyã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same -placeã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node -qemu will not produce the problem,but Primary Node panic canã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I test a patch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ --- a/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ } -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), -"migration-socket-incoming") -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ My test will not hang any more. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden -ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden -ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden -ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Hi,Wang. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ You can test this branch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -http://wiki.qemu-project.org/Features/COLO -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Thanks -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ hi. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", -ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should -ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ thank you. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ -ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, -ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, -ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized -outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, -errp=0x0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message -(errp=0x7f3d62bfaa48, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ -ï¼ ï¼ -- -ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK -ï¼ ï¼ -ï¼ ï¼ . -ï¼ ï¼ -ï¼ - diff --git a/results/classifier/016/none/70868267 b/results/classifier/016/none/70868267 deleted file mode 100644 index 3f50c2ef..00000000 --- a/results/classifier/016/none/70868267 +++ /dev/null @@ -1,67 +0,0 @@ -x86: 0.245 -operating system: 0.079 -files: 0.026 -hypervisor: 0.023 -TCG: 0.023 -debug: 0.020 -network: 0.019 -PID: 0.018 -i386: 0.011 -virtual: 0.008 -register: 0.006 -user-level: 0.004 -ppc: 0.003 -semantic: 0.003 -device: 0.002 -socket: 0.002 -assembly: 0.002 -VMM: 0.002 -kernel: 0.002 -performance: 0.001 -arm: 0.001 -alpha: 0.001 -graphic: 0.001 -vnc: 0.001 -peripherals: 0.001 -architecture: 0.001 -boot: 0.001 -risc-v: 0.001 -permissions: 0.000 -KVM: 0.000 -mistranslation: 0.000 - -[Qemu-devel] [BUG] Failed to compile using gcc7.1 - -Hi all, - -After upgrading gcc from 6.3.1 to 7.1.1, qemu can't be compiled with gcc. - -The error is: - ------- - CC block/blkdebug.o -block/blkdebug.c: In function 'blkdebug_refresh_filename': -block/blkdebug.c:693:31: error: '%s' directive output may be truncated -writing up to 4095 bytes into a region of size 4086 -[-Werror=format-truncation=] -"blkdebug:%s:%s", s->config_file ?: "", - ^~ -In file included from /usr/include/stdio.h:939:0, - from /home/adam/qemu/include/qemu/osdep.h:68, - from block/blkdebug.c:25: -/usr/include/bits/stdio2.h:64:10: note: '__builtin___snprintf_chk' -output 11 or more bytes (assuming 4106) into a destination of size 4096 -return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1, - ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - __bos (__s), __fmt, __va_arg_pack ()); - ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -cc1: all warnings being treated as errors -make: *** [/home/adam/qemu/rules.mak:69: block/blkdebug.o] Error 1 ------- - -It seems that gcc 7 is introducing more restrict check for printf. -If using clang, although there are some extra warning, it can at least -pass the compile. -Thanks, -Qu - diff --git a/results/classifier/016/none/80604314 b/results/classifier/016/none/80604314 deleted file mode 100644 index 8112f757..00000000 --- a/results/classifier/016/none/80604314 +++ /dev/null @@ -1,1507 +0,0 @@ -hypervisor: 0.669 -network: 0.654 -debug: 0.554 -operating system: 0.404 -virtual: 0.190 -files: 0.103 -TCG: 0.102 -PID: 0.097 -boot: 0.095 -device: 0.090 -user-level: 0.089 -VMM: 0.084 -vnc: 0.081 -register: 0.062 -socket: 0.055 -kernel: 0.042 -KVM: 0.021 -risc-v: 0.019 -performance: 0.014 -assembly: 0.011 -semantic: 0.008 -architecture: 0.007 -alpha: 0.004 -ppc: 0.003 -permissions: 0.003 -graphic: 0.003 -peripherals: 0.002 -mistranslation: 0.002 -x86: 0.001 -arm: 0.001 -i386: 0.000 - -[BUG] vhost-vdpa: qemu-system-s390x crashes with second virtio-net-ccw device - -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) - -Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -the autogenerated virtio-net-ccw device is present) works. Specifying -several "-device virtio-net-pci" works as well. - -Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -works (in-between state does not compile). - -This is reproducible with tcg as well. Same problem both with ---enable-vhost-vdpa and --disable-vhost-vdpa. - -Have not yet tried to figure out what might be special with -virtio-ccw... anyone have an idea? - -[This should probably be considered a blocker?] - -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -When I start qemu with a second virtio-net-ccw device (i.e. adding -> --device virtio-net-ccw in addition to the autogenerated device), I get -> -a segfault. gdb points to -> -> -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -config=0x55d6ad9e3f80 "RT") at -> -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> -(backtrace doesn't go further) -> -> -Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -the autogenerated virtio-net-ccw device is present) works. Specifying -> -several "-device virtio-net-pci" works as well. -> -> -Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -works (in-between state does not compile). -Ouch. I didn't test all in-between states :( -But I wish we had a 0-day instrastructure like kernel has, -that catches things like that. - -> -This is reproducible with tcg as well. Same problem both with -> ---enable-vhost-vdpa and --disable-vhost-vdpa. -> -> -Have not yet tried to figure out what might be special with -> -virtio-ccw... anyone have an idea? -> -> -[This should probably be considered a blocker?] - -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin" <mst@redhat.com> wrote: - -> -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> When I start qemu with a second virtio-net-ccw device (i.e. adding -> -> -device virtio-net-ccw in addition to the autogenerated device), I get -> -> a segfault. gdb points to -> -> -> -> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -> config=0x55d6ad9e3f80 "RT") at -> -> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> -> -> (backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. - -> -> -> -> Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -> the autogenerated virtio-net-ccw device is present) works. Specifying -> -> several "-device virtio-net-pci" works as well. -> -> -> -> Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -> client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -> works (in-between state does not compile). -> -> -Ouch. I didn't test all in-between states :( -> -But I wish we had a 0-day instrastructure like kernel has, -> -that catches things like that. -Yep, that would be useful... so patchew only builds the complete series? - -> -> -> This is reproducible with tcg as well. Same problem both with -> -> --enable-vhost-vdpa and --disable-vhost-vdpa. -> -> -> -> Have not yet tried to figure out what might be special with -> -> virtio-ccw... anyone have an idea? -> -> -> -> [This should probably be considered a blocker?] -I think so, as it makes s390x unusable with more that one -virtio-net-ccw device, and I don't even see a workaround. - -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -On Fri, 24 Jul 2020 09:30:58 -0400 -> -"Michael S. Tsirkin" <mst@redhat.com> wrote: -> -> -> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> > When I start qemu with a second virtio-net-ccw device (i.e. adding -> -> > -device virtio-net-ccw in addition to the autogenerated device), I get -> -> > a segfault. gdb points to -> -> > -> -> > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -> > config=0x55d6ad9e3f80 "RT") at -> -> > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > -> -> > (backtrace doesn't go further) -> -> -The core was incomplete, but running under gdb directly shows that it -> -is just a bog-standard config space access (first for that device). -> -> -The cause of the crash is that nc->peer is not set... no idea how that -> -can happen, not that familiar with that part of QEMU. (Should the code -> -check, or is that really something that should not happen?) -> -> -What I don't understand is why it is set correctly for the first, -> -autogenerated virtio-net-ccw device, but not for the second one, and -> -why virtio-net-pci doesn't show these problems. The only difference -> -between -ccw and -pci that comes to my mind here is that config space -> -accesses for ccw are done via an asynchronous operation, so timing -> -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? - -> -> > -> -> > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -> > the autogenerated virtio-net-ccw device is present) works. Specifying -> -> > several "-device virtio-net-pci" works as well. -> -> > -> -> > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -> > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -> > works (in-between state does not compile). -> -> -> -> Ouch. I didn't test all in-between states :( -> -> But I wish we had a 0-day instrastructure like kernel has, -> -> that catches things like that. -> -> -Yep, that would be useful... so patchew only builds the complete series? -> -> -> -> -> > This is reproducible with tcg as well. Same problem both with -> -> > --enable-vhost-vdpa and --disable-vhost-vdpa. -> -> > -> -> > Have not yet tried to figure out what might be special with -> -> > virtio-ccw... anyone have an idea? -> -> > -> -> > [This should probably be considered a blocker?] -> -> -I think so, as it makes s390x unusable with more that one -> -virtio-net-ccw device, and I don't even see a workaround. - -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin" <mst@redhat.com> wrote: - -> -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -> On Fri, 24 Jul 2020 09:30:58 -0400 -> -> "Michael S. Tsirkin" <mst@redhat.com> wrote: -> -> -> -> > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> > > When I start qemu with a second virtio-net-ccw device (i.e. adding -> -> > > -device virtio-net-ccw in addition to the autogenerated device), I get -> -> > > a segfault. gdb points to -> -> > > -> -> > > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> -> > > config=0x55d6ad9e3f80 "RT") at -> -> > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > > -> -> > > (backtrace doesn't go further) -> -> -> -> The core was incomplete, but running under gdb directly shows that it -> -> is just a bog-standard config space access (first for that device). -> -> -> -> The cause of the crash is that nc->peer is not set... no idea how that -> -> can happen, not that familiar with that part of QEMU. (Should the code -> -> check, or is that really something that should not happen?) -> -> -> -> What I don't understand is why it is set correctly for the first, -> -> autogenerated virtio-net-ccw device, but not for the second one, and -> -> why virtio-net-pci doesn't show these problems. The only difference -> -> between -ccw and -pci that comes to my mind here is that config space -> -> accesses for ccw are done via an asynchronous operation, so timing -> -> might be different. -> -> -Hopefully Jason has an idea. Could you post a full command line -> -please? Do you need a working guest to trigger this? Does this trigger -> -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 - --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) - -> -> -> > > -> -> > > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only -> -> > > the autogenerated virtio-net-ccw device is present) works. Specifying -> -> > > several "-device virtio-net-pci" works as well. -> -> > > -> -> > > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net -> -> > > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") -> -> > > works (in-between state does not compile). -> -> > -> -> > Ouch. I didn't test all in-between states :( -> -> > But I wish we had a 0-day instrastructure like kernel has, -> -> > that catches things like that. -> -> -> -> Yep, that would be useful... so patchew only builds the complete series? -> -> -> -> > -> -> > > This is reproducible with tcg as well. Same problem both with -> -> > > --enable-vhost-vdpa and --disable-vhost-vdpa. -> -> > > -> -> > > Have not yet tried to figure out what might be special with -> -> > > virtio-ccw... anyone have an idea? -> -> > > -> -> > > [This should probably be considered a blocker?] -> -> -> -> I think so, as it makes s390x unusable with more that one -> -> virtio-net-ccw device, and I don't even see a workaround. -> - -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. - -Thanks -0001-virtio-net-check-the-existence-of-peer-before-accesi.patch -Description: -Text Data - -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang <jasowang@redhat.com> wrote: - -> -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> -> On Fri, 24 Jul 2020 11:17:57 -0400 -> -> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> -> ->> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> ->>> On Fri, 24 Jul 2020 09:30:58 -0400 -> ->>> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> ->>> -> ->>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> ->>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding -> ->>>>> -device virtio-net-ccw in addition to the autogenerated device), I get -> ->>>>> a segfault. gdb points to -> ->>>>> -> ->>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> ->>>>> config=0x55d6ad9e3f80 "RT") at -> ->>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> ->>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> ->>>>> -> ->>>>> (backtrace doesn't go further) -> ->>> The core was incomplete, but running under gdb directly shows that it -> ->>> is just a bog-standard config space access (first for that device). -> ->>> -> ->>> The cause of the crash is that nc->peer is not set... no idea how that -> ->>> can happen, not that familiar with that part of QEMU. (Should the code -> ->>> check, or is that really something that should not happen?) -> ->>> -> ->>> What I don't understand is why it is set correctly for the first, -> ->>> autogenerated virtio-net-ccw device, but not for the second one, and -> ->>> why virtio-net-pci doesn't show these problems. The only difference -> ->>> between -ccw and -pci that comes to my mind here is that config space -> ->>> accesses for ccw are done via an asynchronous operation, so timing -> ->>> might be different. -> ->> Hopefully Jason has an idea. Could you post a full command line -> ->> please? Do you need a working guest to trigger this? Does this trigger -> ->> on an x86 host? -> -> Yes, it does trigger with tcg-on-x86 as well. I've been using -> -> -> -> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu -> -> qemu,zpci=on -> -> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> -> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> -> -device -> -> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> -> -device virtio-net-ccw -> -> -> -> It seems it needs the guest actually doing something with the nics; I -> -> cannot reproduce the crash if I use the old advent calendar moon buggy -> -> image and just add a virtio-net-ccw device. -> -> -> -> (I don't think it's a problem with my local build, as I see the problem -> -> both on my laptop and on an LPAR.) -> -> -> -It looks to me we forget the check the existence of peer. -> -> -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck <cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? - -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang <jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck <cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -For autogenerated virtio-net-cww, I think the reason is that it has -already had a peer set. -Thanks - -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang <jasowang@redhat.com> wrote: - -> -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -> -> On Sat, 25 Jul 2020 08:40:07 +0800 -> -> Jason Wang <jasowang@redhat.com> wrote: -> -> -> ->> On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> ->>> On Fri, 24 Jul 2020 11:17:57 -0400 -> ->>> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> ->>> -> ->>>> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> ->>>>> On Fri, 24 Jul 2020 09:30:58 -0400 -> ->>>>> "Michael S. Tsirkin"<mst@redhat.com> wrote: -> ->>>>> -> ->>>>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> ->>>>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding -> ->>>>>>> -device virtio-net-ccw in addition to the autogenerated device), I get -> ->>>>>>> a segfault. gdb points to -> ->>>>>>> -> ->>>>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, -> ->>>>>>> config=0x55d6ad9e3f80 "RT") at -> ->>>>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> ->>>>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { -> ->>>>>>> -> ->>>>>>> (backtrace doesn't go further) -> ->>>>> The core was incomplete, but running under gdb directly shows that it -> ->>>>> is just a bog-standard config space access (first for that device). -> ->>>>> -> ->>>>> The cause of the crash is that nc->peer is not set... no idea how that -> ->>>>> can happen, not that familiar with that part of QEMU. (Should the code -> ->>>>> check, or is that really something that should not happen?) -> ->>>>> -> ->>>>> What I don't understand is why it is set correctly for the first, -> ->>>>> autogenerated virtio-net-ccw device, but not for the second one, and -> ->>>>> why virtio-net-pci doesn't show these problems. The only difference -> ->>>>> between -ccw and -pci that comes to my mind here is that config space -> ->>>>> accesses for ccw are done via an asynchronous operation, so timing -> ->>>>> might be different. -> ->>>> Hopefully Jason has an idea. Could you post a full command line -> ->>>> please? Do you need a working guest to trigger this? Does this trigger -> ->>>> on an x86 host? -> ->>> Yes, it does trigger with tcg-on-x86 as well. I've been using -> ->>> -> ->>> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu -> ->>> qemu,zpci=on -> ->>> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> ->>> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> ->>> -device -> ->>> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> ->>> -device virtio-net-ccw -> ->>> -> ->>> It seems it needs the guest actually doing something with the nics; I -> ->>> cannot reproduce the crash if I use the old advent calendar moon buggy -> ->>> image and just add a virtio-net-ccw device. -> ->>> -> ->>> (I don't think it's a problem with my local build, as I see the problem -> ->>> both on my laptop and on an LPAR.) -> ->> -> ->> It looks to me we forget the check the existence of peer. -> ->> -> ->> Please try the attached patch to see if it works. -> -> Thanks, that patch gets my guest up and running again. So, FWIW, -> -> -> -> Tested-by: Cornelia Huck <cohuck@redhat.com> -> -> -> -> Any idea why this did not hit with virtio-net-pci (or the autogenerated -> -> virtio-net-ccw device)? -> -> -> -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. - -> -> -For autogenerated virtio-net-cww, I think the reason is that it has -> -already had a peer set. -Ok, that might well be. - -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang <jasowang@redhat.com> wrote: -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang <jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck <cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. -Sorry for being unclear, I meant for networking part, you just need -start without peer, and you need a real guest (any Linux) that is trying -to access the config space of virtio-net. -Thanks -For autogenerated virtio-net-cww, I think the reason is that it has -already had a peer set. -Ok, that might well be. - -On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -> -> -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -> -> On Mon, 27 Jul 2020 15:38:12 +0800 -> -> Jason Wang <jasowang@redhat.com> wrote: -> -> -> -> > On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -> -> > > On Sat, 25 Jul 2020 08:40:07 +0800 -> -> > > Jason Wang <jasowang@redhat.com> wrote: -> -> > > > On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> -> > > > > On Fri, 24 Jul 2020 11:17:57 -0400 -> -> > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -> > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400 -> -> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -> -> > > > > > > > > When I start qemu with a second virtio-net-ccw device (i.e. -> -> > > > > > > > > adding -> -> > > > > > > > > -device virtio-net-ccw in addition to the autogenerated -> -> > > > > > > > > device), I get -> -> > > > > > > > > a segfault. gdb points to -> -> > > > > > > > > -> -> > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config -> -> > > > > > > > > (vdev=<optimized out>, -> -> > > > > > > > > config=0x55d6ad9e3f80 "RT") at -> -> > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > > > > > > > > 146 if (nc->peer->info->type == -> -> > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > > > > > > > > -> -> > > > > > > > > (backtrace doesn't go further) -> -> > > > > > > The core was incomplete, but running under gdb directly shows -> -> > > > > > > that it -> -> > > > > > > is just a bog-standard config space access (first for that -> -> > > > > > > device). -> -> > > > > > > -> -> > > > > > > The cause of the crash is that nc->peer is not set... no idea -> -> > > > > > > how that -> -> > > > > > > can happen, not that familiar with that part of QEMU. (Should -> -> > > > > > > the code -> -> > > > > > > check, or is that really something that should not happen?) -> -> > > > > > > -> -> > > > > > > What I don't understand is why it is set correctly for the -> -> > > > > > > first, -> -> > > > > > > autogenerated virtio-net-ccw device, but not for the second -> -> > > > > > > one, and -> -> > > > > > > why virtio-net-pci doesn't show these problems. The only -> -> > > > > > > difference -> -> > > > > > > between -ccw and -pci that comes to my mind here is that config -> -> > > > > > > space -> -> > > > > > > accesses for ccw are done via an asynchronous operation, so -> -> > > > > > > timing -> -> > > > > > > might be different. -> -> > > > > > Hopefully Jason has an idea. Could you post a full command line -> -> > > > > > please? Do you need a working guest to trigger this? Does this -> -> > > > > > trigger -> -> > > > > > on an x86 host? -> -> > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using -> -> > > > > -> -> > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu -> -> > > > > qemu,zpci=on -> -> > > > > -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> -> > > > > -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> -> > > > > -device -> -> > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> -> > > > > -device virtio-net-ccw -> -> > > > > -> -> > > > > It seems it needs the guest actually doing something with the nics; -> -> > > > > I -> -> > > > > cannot reproduce the crash if I use the old advent calendar moon -> -> > > > > buggy -> -> > > > > image and just add a virtio-net-ccw device. -> -> > > > > -> -> > > > > (I don't think it's a problem with my local build, as I see the -> -> > > > > problem -> -> > > > > both on my laptop and on an LPAR.) -> -> > > > It looks to me we forget the check the existence of peer. -> -> > > > -> -> > > > Please try the attached patch to see if it works. -> -> > > Thanks, that patch gets my guest up and running again. So, FWIW, -> -> > > -> -> > > Tested-by: Cornelia Huck <cohuck@redhat.com> -> -> > > -> -> > > Any idea why this did not hit with virtio-net-pci (or the autogenerated -> -> > > virtio-net-ccw device)? -> -> > -> -> > It can be hit with virtio-net-pci as well (just start without peer). -> -> Hm, I had not been able to reproduce the crash with a 'naked' -device -> -> virtio-net-pci. But checking seems to be the right idea anyway. -> -> -> -Sorry for being unclear, I meant for networking part, you just need start -> -without peer, and you need a real guest (any Linux) that is trying to access -> -the config space of virtio-net. -> -> -Thanks -A pxe guest will do it, but that doesn't support ccw, right? - -I'm still unclear why this triggers with ccw but not pci - -any idea? - -> -> -> -> -> > For autogenerated virtio-net-cww, I think the reason is that it has -> -> > already had a peer set. -> -> Ok, that might well be. -> -> -> -> - -On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: -On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck<cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. -Sorry for being unclear, I meant for networking part, you just need start -without peer, and you need a real guest (any Linux) that is trying to access -the config space of virtio-net. - -Thanks -A pxe guest will do it, but that doesn't support ccw, right? -Yes, it depends on the cli actually. -I'm still unclear why this triggers with ccw but not pci - -any idea? -I don't test pxe but I can reproduce this with pci (just start a linux -guest without a peer). -Thanks - -On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote: -> -> -On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: -> -> On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -> -> > On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -> -> > > On Mon, 27 Jul 2020 15:38:12 +0800 -> -> > > Jason Wang<jasowang@redhat.com> wrote: -> -> > > -> -> > > > On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -> -> > > > > On Sat, 25 Jul 2020 08:40:07 +0800 -> -> > > > > Jason Wang<jasowang@redhat.com> wrote: -> -> > > > > > On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -> -> > > > > > > On Fri, 24 Jul 2020 11:17:57 -0400 -> -> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -> -> > > > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400 -> -> > > > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: -> -> > > > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck -> -> > > > > > > > > > wrote: -> -> > > > > > > > > > > When I start qemu with a second virtio-net-ccw device -> -> > > > > > > > > > > (i.e. adding -> -> > > > > > > > > > > -device virtio-net-ccw in addition to the autogenerated -> -> > > > > > > > > > > device), I get -> -> > > > > > > > > > > a segfault. gdb points to -> -> > > > > > > > > > > -> -> > > > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config -> -> > > > > > > > > > > (vdev=<optimized out>, -> -> > > > > > > > > > > config=0x55d6ad9e3f80 "RT") at -> -> > > > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 -> -> > > > > > > > > > > 146 if (nc->peer->info->type == -> -> > > > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) { -> -> > > > > > > > > > > -> -> > > > > > > > > > > (backtrace doesn't go further) -> -> > > > > > > > > The core was incomplete, but running under gdb directly -> -> > > > > > > > > shows that it -> -> > > > > > > > > is just a bog-standard config space access (first for that -> -> > > > > > > > > device). -> -> > > > > > > > > -> -> > > > > > > > > The cause of the crash is that nc->peer is not set... no -> -> > > > > > > > > idea how that -> -> > > > > > > > > can happen, not that familiar with that part of QEMU. -> -> > > > > > > > > (Should the code -> -> > > > > > > > > check, or is that really something that should not happen?) -> -> > > > > > > > > -> -> > > > > > > > > What I don't understand is why it is set correctly for the -> -> > > > > > > > > first, -> -> > > > > > > > > autogenerated virtio-net-ccw device, but not for the second -> -> > > > > > > > > one, and -> -> > > > > > > > > why virtio-net-pci doesn't show these problems. The only -> -> > > > > > > > > difference -> -> > > > > > > > > between -ccw and -pci that comes to my mind here is that -> -> > > > > > > > > config space -> -> > > > > > > > > accesses for ccw are done via an asynchronous operation, so -> -> > > > > > > > > timing -> -> > > > > > > > > might be different. -> -> > > > > > > > Hopefully Jason has an idea. Could you post a full command -> -> > > > > > > > line -> -> > > > > > > > please? Do you need a working guest to trigger this? Does -> -> > > > > > > > this trigger -> -> > > > > > > > on an x86 host? -> -> > > > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using -> -> > > > > > > -> -> > > > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -> -> > > > > > > -cpu qemu,zpci=on -> -> > > > > > > -m 1024 -nographic -device -> -> > > > > > > virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 -> -> > > > > > > -drive -> -> > > > > > > file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 -> -> > > > > > > -device -> -> > > > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -> -> > > > > > > -device virtio-net-ccw -> -> > > > > > > -> -> > > > > > > It seems it needs the guest actually doing something with the -> -> > > > > > > nics; I -> -> > > > > > > cannot reproduce the crash if I use the old advent calendar -> -> > > > > > > moon buggy -> -> > > > > > > image and just add a virtio-net-ccw device. -> -> > > > > > > -> -> > > > > > > (I don't think it's a problem with my local build, as I see the -> -> > > > > > > problem -> -> > > > > > > both on my laptop and on an LPAR.) -> -> > > > > > It looks to me we forget the check the existence of peer. -> -> > > > > > -> -> > > > > > Please try the attached patch to see if it works. -> -> > > > > Thanks, that patch gets my guest up and running again. So, FWIW, -> -> > > > > -> -> > > > > Tested-by: Cornelia Huck<cohuck@redhat.com> -> -> > > > > -> -> > > > > Any idea why this did not hit with virtio-net-pci (or the -> -> > > > > autogenerated -> -> > > > > virtio-net-ccw device)? -> -> > > > It can be hit with virtio-net-pci as well (just start without peer). -> -> > > Hm, I had not been able to reproduce the crash with a 'naked' -device -> -> > > virtio-net-pci. But checking seems to be the right idea anyway. -> -> > Sorry for being unclear, I meant for networking part, you just need start -> -> > without peer, and you need a real guest (any Linux) that is trying to -> -> > access -> -> > the config space of virtio-net. -> -> > -> -> > Thanks -> -> A pxe guest will do it, but that doesn't support ccw, right? -> -> -> -Yes, it depends on the cli actually. -> -> -> -> -> -> I'm still unclear why this triggers with ccw but not pci - -> -> any idea? -> -> -> -I don't test pxe but I can reproduce this with pci (just start a linux guest -> -without a peer). -> -> -Thanks -> -Might be a good addition to a unit test. Not sure what would the -test do exactly: just make sure guest runs? Looks like a lot of work -for an empty test ... maybe we can poke at the guest config with -qtest commands at least. - --- -MST - -On 2020/7/27 ä¸å9:16, Michael S. Tsirkin wrote: -On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote: -On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: -On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: -On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: -On Mon, 27 Jul 2020 15:38:12 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: -On Sat, 25 Jul 2020 08:40:07 +0800 -Jason Wang<jasowang@redhat.com> wrote: -On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: -On Fri, 24 Jul 2020 11:17:57 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: -On Fri, 24 Jul 2020 09:30:58 -0400 -"Michael S. Tsirkin"<mst@redhat.com> wrote: -On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: -When I start qemu with a second virtio-net-ccw device (i.e. adding --device virtio-net-ccw in addition to the autogenerated device), I get -a segfault. gdb points to - -#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, - config=0x55d6ad9e3f80 "RT") at -/home/cohuck/git/qemu/hw/net/virtio-net.c:146 -146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { - -(backtrace doesn't go further) -The core was incomplete, but running under gdb directly shows that it -is just a bog-standard config space access (first for that device). - -The cause of the crash is that nc->peer is not set... no idea how that -can happen, not that familiar with that part of QEMU. (Should the code -check, or is that really something that should not happen?) - -What I don't understand is why it is set correctly for the first, -autogenerated virtio-net-ccw device, but not for the second one, and -why virtio-net-pci doesn't show these problems. The only difference -between -ccw and -pci that comes to my mind here is that config space -accesses for ccw are done via an asynchronous operation, so timing -might be different. -Hopefully Jason has an idea. Could you post a full command line -please? Do you need a working guest to trigger this? Does this trigger -on an x86 host? -Yes, it does trigger with tcg-on-x86 as well. I've been using - -s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on --m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 --drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 --device -scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 --device virtio-net-ccw - -It seems it needs the guest actually doing something with the nics; I -cannot reproduce the crash if I use the old advent calendar moon buggy -image and just add a virtio-net-ccw device. - -(I don't think it's a problem with my local build, as I see the problem -both on my laptop and on an LPAR.) -It looks to me we forget the check the existence of peer. - -Please try the attached patch to see if it works. -Thanks, that patch gets my guest up and running again. So, FWIW, - -Tested-by: Cornelia Huck<cohuck@redhat.com> - -Any idea why this did not hit with virtio-net-pci (or the autogenerated -virtio-net-ccw device)? -It can be hit with virtio-net-pci as well (just start without peer). -Hm, I had not been able to reproduce the crash with a 'naked' -device -virtio-net-pci. But checking seems to be the right idea anyway. -Sorry for being unclear, I meant for networking part, you just need start -without peer, and you need a real guest (any Linux) that is trying to access -the config space of virtio-net. - -Thanks -A pxe guest will do it, but that doesn't support ccw, right? -Yes, it depends on the cli actually. -I'm still unclear why this triggers with ccw but not pci - -any idea? -I don't test pxe but I can reproduce this with pci (just start a linux guest -without a peer). - -Thanks -Might be a good addition to a unit test. Not sure what would the -test do exactly: just make sure guest runs? Looks like a lot of work -for an empty test ... maybe we can poke at the guest config with -qtest commands at least. -That should work or we can simply extend the exist virtio-net qtest to -do that. -Thanks - |