diff options
Diffstat (limited to 'results/classifier/016/none')
| -rw-r--r-- | results/classifier/016/none/23300761 | 340 | ||||
| -rw-r--r-- | results/classifier/016/none/42613410 | 176 | ||||
| -rw-r--r-- | results/classifier/016/none/42974450 | 456 | ||||
| -rw-r--r-- | results/classifier/016/none/48245039 | 557 | ||||
| -rw-r--r-- | results/classifier/016/none/50773216 | 137 | ||||
| -rw-r--r-- | results/classifier/016/none/55753058 | 320 | ||||
| -rw-r--r-- | results/classifier/016/none/56309929 | 207 | ||||
| -rw-r--r-- | results/classifier/016/none/65781993 | 2820 | ||||
| -rw-r--r-- | results/classifier/016/none/70294255 | 1088 | ||||
| -rw-r--r-- | results/classifier/016/none/70868267 | 67 | ||||
| -rw-r--r-- | results/classifier/016/none/80604314 | 1507 |
11 files changed, 7675 insertions, 0 deletions
diff --git a/results/classifier/016/none/23300761 b/results/classifier/016/none/23300761 new file mode 100644 index 00000000..2a3e6f16 --- /dev/null +++ b/results/classifier/016/none/23300761 @@ -0,0 +1,340 @@ +i386: 0.475 +x86: 0.171 +debug: 0.052 +files: 0.038 +performance: 0.030 +register: 0.029 +virtual: 0.027 +PID: 0.025 +TCG: 0.019 +semantic: 0.018 +operating system: 0.017 +socket: 0.013 +boot: 0.013 +hypervisor: 0.012 +device: 0.012 +user-level: 0.011 +risc-v: 0.010 +alpha: 0.007 +ppc: 0.006 +VMM: 0.005 +vnc: 0.004 +network: 0.004 +architecture: 0.003 +permissions: 0.003 +assembly: 0.003 +peripherals: 0.003 +kernel: 0.002 +arm: 0.002 +graphic: 0.002 +mistranslation: 0.001 +KVM: 0.000 + +[Qemu-devel] [BUG] 216 Alerts reported by LGTM for QEMU (some might be release critical) + +Hi, +LGTM reports 16 errors, 81 warnings and 119 recommendations: +https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list +. +Some of them are already know (wrong format strings), others look like +real errors: +- several multiplication results which don't work as they should in +contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only +32 bit!), target/i386/translate.c and other files +- potential buffer overflows in gdbstub.c and other files +I am afraid that the overflows in the block code are release critical, +maybe that in target/i386/translate.c and other errors, too. +About half of the alerts are issues which can be fixed later. + +Regards + +Stefan + +On 13/07/19 19:46, Stefan Weil wrote: +> +> +LGTM reports 16 errors, 81 warnings and 119 recommendations: +> +https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list +. +> +> +Some of them are already know (wrong format strings), others look like +> +real errors: +> +> +- several multiplication results which don't work as they should in +> +contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only +> +32 bit!), target/i386/translate.c and other files +m->nb_clusters here is limited by s->l2_slice_size (see for example +handle_alloc) so I wouldn't be surprised if this is a false positive. I +couldn't find this particular multiplication in Coverity, but it has +about 250 issues marked as intentional or false positive so there's +probably a lot of overlap with what LGTM found. + +Paolo + +Am 13.07.2019 um 21:42 schrieb Paolo Bonzini: +> +On 13/07/19 19:46, Stefan Weil wrote: +> +> LGTM reports 16 errors, 81 warnings and 119 recommendations: +> +> +https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list +. +> +> +> +> Some of them are already known (wrong format strings), others look like +> +> real errors: +> +> +> +> - several multiplication results which don't work as they should in +> +> contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only +> +> 32 bit!), target/i386/translate.c and other files +> +m->nb_clusters here is limited by s->l2_slice_size (see for example +> +handle_alloc) so I wouldn't be surprised if this is a false positive. I +> +couldn't find this particular multiplication in Coverity, but it has +> +about 250 issues marked as intentional or false positive so there's +> +probably a lot of overlap with what LGTM found. +> +> +Paolo +> +From other projects I know that there is a certain overlap between the +results from Coverity Scan an LGTM, but it is good to have both +analyzers, and the results from LGTM are typically quite reliable. + +Even if we know that there is no multiplication overflow, the code could +be modified. Either the assigned value should use the same data type as +the factors (possible when there is never an overflow, avoids a size +extension), or the multiplication could use the larger data type by +adding a type cast to one of the factors (then an overflow cannot +happen, static code analysers and human reviewers have an easier job, +but the multiplication costs more time). + +Stefan + +Am 14.07.2019 um 15:28 hat Stefan Weil geschrieben: +> +Am 13.07.2019 um 21:42 schrieb Paolo Bonzini: +> +> On 13/07/19 19:46, Stefan Weil wrote: +> +>> LGTM reports 16 errors, 81 warnings and 119 recommendations: +> +>> +https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list +. +> +>> +> +>> Some of them are already known (wrong format strings), others look like +> +>> real errors: +> +>> +> +>> - several multiplication results which don't work as they should in +> +>> contrib/vhost-user-gpu, block/* (m->nb_clusters * s->cluster_size only +> +>> 32 bit!), target/i386/translate.c and other files +Request sizes are limited to 32 bit in the generic block layer before +they are even passed to the individual block drivers, so most if not all +of these are going to be false positives. + +> +> m->nb_clusters here is limited by s->l2_slice_size (see for example +> +> handle_alloc) so I wouldn't be surprised if this is a false positive. I +> +> couldn't find this particular multiplication in Coverity, but it has +> +> about 250 issues marked as intentional or false positive so there's +> +> probably a lot of overlap with what LGTM found. +> +> +> +> Paolo +> +> +From other projects I know that there is a certain overlap between the +> +results from Coverity Scan an LGTM, but it is good to have both +> +analyzers, and the results from LGTM are typically quite reliable. +> +> +Even if we know that there is no multiplication overflow, the code could +> +be modified. Either the assigned value should use the same data type as +> +the factors (possible when there is never an overflow, avoids a size +> +extension), or the multiplication could use the larger data type by +> +adding a type cast to one of the factors (then an overflow cannot +> +happen, static code analysers and human reviewers have an easier job, +> +but the multiplication costs more time). +But if you look at the code we're talking about, you see that it's +complaining about things where being more explicit would make things +less readable. + +For example, if complains about the multiplication in this line: + + s->file_size += n * s->header.cluster_size; + +We know that n * s->header.cluster_size fits in 32 bits, but +s->file_size is 64 bits (and has to be 64 bits). Do you really think we +should introduce another uint32_t variable to store the intermediate +result? And if we cast n to uint64_t, not only might the multiplication +cost more time, but also human readers would wonder why the result could +become larger than 32 bits. So a cast would be misleading. + + +It also complains about this line: + + ret = bdrv_truncate(bs->file, (3 + l1_clusters) * s->cluster_size, + PREALLOC_MODE_OFF, &local_err); + +Here, we don't even assign the result to a 64 bit variable, but just +pass it to a function which takes a 64 bit parameter. Again, I don't +think introducing additional variables for the intermediate result or +adding casts would be an improvement of the situation. + + +So I don't think this is a good enough tool to base our code on what it +does and doesn't understand. It would have too much of a negative impact +on our code. We'd rather need a way to mark false positives as such and +move on without changing the code in such cases. + +Kevin + +On Sat, 13 Jul 2019 at 18:46, Stefan Weil <address@hidden> wrote: +> +LGTM reports 16 errors, 81 warnings and 119 recommendations: +> +https://lgtm.com/projects/g/qemu/qemu/alerts/?mode=list +. +I had a look at some of these before, but mostly I came +to the conclusion that it wasn't worth trying to put the +effort into keeping up with the site because they didn't +seem to provide any useful way to mark things as false +positives. Coverity has its flaws but at least you can do +that kind of thing in its UI (it runs at about a 33% fp +rate, I think.) "Analyzer thinks this multiply can overflow +but in fact it's not possible" is quite a common false +positive cause... + +Anyway, if you want to fish out specific issues, analyse +whether they're false positive or real, and report them +to the mailing list as followups to the patches which +introduced the issue, that's probably the best way for +us to make use of this analyzer. (That is essentially +what I do for coverity.) + +thanks +-- PMM + +Am 14.07.2019 um 19:30 schrieb Peter Maydell: +[...] +> +"Analyzer thinks this multiply can overflow +> +but in fact it's not possible" is quite a common false +> +positive cause... +The analysers don't complain because a multiply can overflow. + +They complain because the code indicates that a larger result is +expected, for example uint64_t = uint32_t * uint32_t. They would not +complain for the same multiplication if it were assigned to a uint32_t. + +So there is a simple solution to write the code in a way which avoids +false positives... + +Stefan + +Stefan Weil <address@hidden> writes: + +> +Am 14.07.2019 um 19:30 schrieb Peter Maydell: +> +[...] +> +> "Analyzer thinks this multiply can overflow +> +> but in fact it's not possible" is quite a common false +> +> positive cause... +> +> +> +The analysers don't complain because a multiply can overflow. +> +> +They complain because the code indicates that a larger result is +> +expected, for example uint64_t = uint32_t * uint32_t. They would not +> +complain for the same multiplication if it were assigned to a uint32_t. +I agree this is an anti-pattern. + +> +So there is a simple solution to write the code in a way which avoids +> +false positives... +You wrote elsewhere in this thread: + + Either the assigned value should use the same data type as the + factors (possible when there is never an overflow, avoids a size + extension), or the multiplication could use the larger data type by + adding a type cast to one of the factors (then an overflow cannot + happen, static code analysers and human reviewers have an easier + job, but the multiplication costs more time). + +Makes sense to me. + +On 7/14/19 5:30 PM, Peter Maydell wrote: +> +I had a look at some of these before, but mostly I came +> +to the conclusion that it wasn't worth trying to put the +> +effort into keeping up with the site because they didn't +> +seem to provide any useful way to mark things as false +> +positives. Coverity has its flaws but at least you can do +> +that kind of thing in its UI (it runs at about a 33% fp +> +rate, I think.) +Yes, LGTM wants you to modify the source code with + + /* lgtm [cpp/some-warning-code] */ + +and on the same line as the reported problem. Which is mildly annoying in that +you're definitely committing to LGTM in the long term. Also for any +non-trivial bit of code, it will almost certainly run over 80 columns. + + +r~ + diff --git a/results/classifier/016/none/42613410 b/results/classifier/016/none/42613410 new file mode 100644 index 00000000..387e80bd --- /dev/null +++ b/results/classifier/016/none/42613410 @@ -0,0 +1,176 @@ +network: 0.116 +x86: 0.043 +TCG: 0.038 +operating system: 0.031 +files: 0.031 +register: 0.030 +socket: 0.029 +virtual: 0.026 +i386: 0.021 +ppc: 0.020 +PID: 0.020 +VMM: 0.020 +hypervisor: 0.018 +arm: 0.018 +device: 0.017 +risc-v: 0.016 +alpha: 0.016 +boot: 0.013 +vnc: 0.013 +semantic: 0.012 +debug: 0.010 +KVM: 0.006 +kernel: 0.005 +user-level: 0.005 +performance: 0.004 +peripherals: 0.003 +architecture: 0.003 +permissions: 0.002 +graphic: 0.002 +assembly: 0.001 +mistranslation: 0.001 + +[Qemu-devel] [PATCH, Bug 1612908] scripts: Add TCP endpoints for qom-* scripts + +From: Carl Allendorph <address@hidden> + +I've created a patch for bug #1612908. The current docs for the scripts +in the "scripts/qmp/" directory suggest that both unix sockets and +tcp endpoints can be used. The TCP endpoints don't work for most of the +scripts, with notable exception of 'qmp-shell'. This patch attempts to +refactor the process of distinguishing between unix path endpoints and +tcp endpoints to work for all of these scripts. + +Carl Allendorph (1): + scripts: Add ability for qom-* python scripts to target tcp endpoints + + scripts/qmp/qmp-shell | 22 ++-------------------- + scripts/qmp/qmp.py | 23 ++++++++++++++++++++--- + 2 files changed, 22 insertions(+), 23 deletions(-) + +-- +2.7.4 + +From: Carl Allendorph <address@hidden> + +The current code for QEMUMonitorProtocol accepts both a unix socket +endpoint as a string and a tcp endpoint as a tuple. Most of the scripts +that use this class don't massage the command line argument to generate +a tuple. This patch refactors qmp-shell slightly to reuse the existing +parsing of the "host:port" string for all the qom-* scripts. + +Signed-off-by: Carl Allendorph <address@hidden> +--- + scripts/qmp/qmp-shell | 22 ++-------------------- + scripts/qmp/qmp.py | 23 ++++++++++++++++++++--- + 2 files changed, 22 insertions(+), 23 deletions(-) + +diff --git a/scripts/qmp/qmp-shell b/scripts/qmp/qmp-shell +index 0373b24..8a2a437 100755 +--- a/scripts/qmp/qmp-shell ++++ b/scripts/qmp/qmp-shell +@@ -83,9 +83,6 @@ class QMPCompleter(list): + class QMPShellError(Exception): + pass + +-class QMPShellBadPort(QMPShellError): +- pass +- + class FuzzyJSON(ast.NodeTransformer): + '''This extension of ast.NodeTransformer filters literal "true/false/null" + values in an AST and replaces them by proper "True/False/None" values that +@@ -103,28 +100,13 @@ class FuzzyJSON(ast.NodeTransformer): + # _execute_cmd()). Let's design a better one. + class QMPShell(qmp.QEMUMonitorProtocol): + def __init__(self, address, pretty=False): +- qmp.QEMUMonitorProtocol.__init__(self, self.__get_address(address)) ++ qmp.QEMUMonitorProtocol.__init__(self, address) + self._greeting = None + self._completer = None + self._pretty = pretty + self._transmode = False + self._actions = list() + +- def __get_address(self, arg): +- """ +- Figure out if the argument is in the port:host form, if it's not it's +- probably a file path. +- """ +- addr = arg.split(':') +- if len(addr) == 2: +- try: +- port = int(addr[1]) +- except ValueError: +- raise QMPShellBadPort +- return ( addr[0], port ) +- # socket path +- return arg +- + def _fill_completion(self): + for cmd in self.cmd('query-commands')['return']: + self._completer.append(cmd['name']) +@@ -400,7 +382,7 @@ def main(): + + if qemu is None: + fail_cmdline() +- except QMPShellBadPort: ++ except qmp.QMPShellBadPort: + die('bad port number in command-line') + + try: +diff --git a/scripts/qmp/qmp.py b/scripts/qmp/qmp.py +index 62d3651..261ece8 100644 +--- a/scripts/qmp/qmp.py ++++ b/scripts/qmp/qmp.py +@@ -25,21 +25,23 @@ class QMPCapabilitiesError(QMPError): + class QMPTimeoutError(QMPError): + pass + ++class QMPShellBadPort(QMPError): ++ pass ++ + class QEMUMonitorProtocol: + def __init__(self, address, server=False, debug=False): + """ + Create a QEMUMonitorProtocol class. + + @param address: QEMU address, can be either a unix socket path (string) +- or a tuple in the form ( address, port ) for a TCP +- connection ++ or a TCP endpoint (string in the format "host:port") + @param server: server mode listens on the socket (bool) + @raise socket.error on socket connection errors + @note No connection is established, this is done by the connect() or + accept() methods + """ + self.__events = [] +- self.__address = address ++ self.__address = self.__get_address(address) + self._debug = debug + self.__sock = self.__get_sock() + if server: +@@ -47,6 +49,21 @@ class QEMUMonitorProtocol: + self.__sock.bind(self.__address) + self.__sock.listen(1) + ++ def __get_address(self, arg): ++ """ ++ Figure out if the argument is in the port:host form, if it's not it's ++ probably a file path. ++ """ ++ addr = arg.split(':') ++ if len(addr) == 2: ++ try: ++ port = int(addr[1]) ++ except ValueError: ++ raise QMPShellBadPort ++ return ( addr[0], port ) ++ # socket path ++ return arg ++ + def __get_sock(self): + if isinstance(self.__address, tuple): + family = socket.AF_INET +-- +2.7.4 + diff --git a/results/classifier/016/none/42974450 b/results/classifier/016/none/42974450 new file mode 100644 index 00000000..9ab3582a --- /dev/null +++ b/results/classifier/016/none/42974450 @@ -0,0 +1,456 @@ +operating system: 0.713 +kernel: 0.463 +debug: 0.442 +hypervisor: 0.390 +x86: 0.334 +virtual: 0.259 +files: 0.192 +TCG: 0.182 +register: 0.171 +device: 0.116 +KVM: 0.071 +i386: 0.064 +VMM: 0.054 +PID: 0.052 +ppc: 0.049 +boot: 0.047 +assembly: 0.037 +architecture: 0.035 +socket: 0.033 +network: 0.028 +user-level: 0.023 +risc-v: 0.023 +arm: 0.022 +semantic: 0.017 +vnc: 0.014 +alpha: 0.007 +peripherals: 0.007 +performance: 0.005 +permissions: 0.004 +graphic: 0.002 +mistranslation: 0.001 + +[Bug Report] Possible Missing Endianness Conversion + +The virtio packed virtqueue support patch[1] suggests converting +endianness by lines: + +virtio_tswap16s(vdev, &e->off_wrap); +virtio_tswap16s(vdev, &e->flags); + +Though both of these conversion statements aren't present in the +latest qemu code here[2] + +Is this intentional? + +[1]: +https://mail.gnu.org/archive/html/qemu-block/2019-10/msg01492.html +[2]: +https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L314 + +CCing Jason. + +On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: +> +> +The virtio packed virtqueue support patch[1] suggests converting +> +endianness by lines: +> +> +virtio_tswap16s(vdev, &e->off_wrap); +> +virtio_tswap16s(vdev, &e->flags); +> +> +Though both of these conversion statements aren't present in the +> +latest qemu code here[2] +> +> +Is this intentional? +Good catch! + +It looks like it was removed (maybe by mistake) by commit +d152cdd6f6 ("virtio: use virtio accessor to access packed event") + +Jason can you confirm that? + +Thanks, +Stefano + +> +> +[1]: +https://mail.gnu.org/archive/html/qemu-block/2019-10/msg01492.html +> +[2]: +https://elixir.bootlin.com/qemu/latest/source/hw/virtio/virtio.c#L314 +> + +On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: +> +> +CCing Jason. +> +> +On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: +> +> +> +> The virtio packed virtqueue support patch[1] suggests converting +> +> endianness by lines: +> +> +> +> virtio_tswap16s(vdev, &e->off_wrap); +> +> virtio_tswap16s(vdev, &e->flags); +> +> +> +> Though both of these conversion statements aren't present in the +> +> latest qemu code here[2] +> +> +> +> Is this intentional? +> +> +Good catch! +> +> +It looks like it was removed (maybe by mistake) by commit +> +d152cdd6f6 ("virtio: use virtio accessor to access packed event") +That commit changes from: + +- address_space_read_cached(cache, off_off, &e->off_wrap, +- sizeof(e->off_wrap)); +- virtio_tswap16s(vdev, &e->off_wrap); + +which does a byte read of 2 bytes and then swaps the bytes +depending on the host endianness and the value of +virtio_access_is_big_endian() + +to this: + ++ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); + +virtio_lduw_phys_cached() is a small function which calls +either lduw_be_phys_cached() or lduw_le_phys_cached() +depending on the value of virtio_access_is_big_endian(). +(And lduw_be_phys_cached() and lduw_le_phys_cached() do +the right thing for the host-endianness to do a "load +a specifically big or little endian 16-bit value".) + +Which is to say that because we use a load/store function that's +explicit about the size of the data type it is accessing, the +function itself can handle doing the load as big or little +endian, rather than the calling code having to do a manual swap after +it has done a load-as-bag-of-bytes. This is generally preferable +as it's less error-prone. + +(Explicit swap-after-loading still has a place where the +code is doing a load of a whole structure out of the +guest and then swapping each struct field after the fact, +because it means we can do a single load-from-guest-memory +rather than a whole sequence of calls all the way down +through the memory subsystem.) + +thanks +-- PMM + +On Mon, Jun 24, 2024 at 04:19:52PM GMT, Peter Maydell wrote: +On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: +CCing Jason. + +On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: +> +> The virtio packed virtqueue support patch[1] suggests converting +> endianness by lines: +> +> virtio_tswap16s(vdev, &e->off_wrap); +> virtio_tswap16s(vdev, &e->flags); +> +> Though both of these conversion statements aren't present in the +> latest qemu code here[2] +> +> Is this intentional? + +Good catch! + +It looks like it was removed (maybe by mistake) by commit +d152cdd6f6 ("virtio: use virtio accessor to access packed event") +That commit changes from: + +- address_space_read_cached(cache, off_off, &e->off_wrap, +- sizeof(e->off_wrap)); +- virtio_tswap16s(vdev, &e->off_wrap); + +which does a byte read of 2 bytes and then swaps the bytes +depending on the host endianness and the value of +virtio_access_is_big_endian() + +to this: + ++ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); + +virtio_lduw_phys_cached() is a small function which calls +either lduw_be_phys_cached() or lduw_le_phys_cached() +depending on the value of virtio_access_is_big_endian(). +(And lduw_be_phys_cached() and lduw_le_phys_cached() do +the right thing for the host-endianness to do a "load +a specifically big or little endian 16-bit value".) + +Which is to say that because we use a load/store function that's +explicit about the size of the data type it is accessing, the +function itself can handle doing the load as big or little +endian, rather than the calling code having to do a manual swap after +it has done a load-as-bag-of-bytes. This is generally preferable +as it's less error-prone. +Thanks for the details! + +So, should we also remove `virtio_tswap16s(vdev, &e->flags);` ? + +I mean: +diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c +index 893a072c9d..2e5e67bdb9 100644 +--- a/hw/virtio/virtio.c ++++ b/hw/virtio/virtio.c +@@ -323,7 +323,6 @@ static void vring_packed_event_read(VirtIODevice *vdev, + /* Make sure flags is seen before off_wrap */ + smp_rmb(); + e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); +- virtio_tswap16s(vdev, &e->flags); + } + + static void vring_packed_off_wrap_write(VirtIODevice *vdev, + +Thanks, +Stefano +(Explicit swap-after-loading still has a place where the +code is doing a load of a whole structure out of the +guest and then swapping each struct field after the fact, +because it means we can do a single load-from-guest-memory +rather than a whole sequence of calls all the way down +through the memory subsystem.) + +thanks +-- PMM + +On Tue, 25 Jun 2024 at 08:18, Stefano Garzarella <sgarzare@redhat.com> wrote: +> +> +On Mon, Jun 24, 2024 at 04:19:52PM GMT, Peter Maydell wrote: +> +>On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: +> +>> +> +>> CCing Jason. +> +>> +> +>> On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: +> +>> > +> +>> > The virtio packed virtqueue support patch[1] suggests converting +> +>> > endianness by lines: +> +>> > +> +>> > virtio_tswap16s(vdev, &e->off_wrap); +> +>> > virtio_tswap16s(vdev, &e->flags); +> +>> > +> +>> > Though both of these conversion statements aren't present in the +> +>> > latest qemu code here[2] +> +>> > +> +>> > Is this intentional? +> +>> +> +>> Good catch! +> +>> +> +>> It looks like it was removed (maybe by mistake) by commit +> +>> d152cdd6f6 ("virtio: use virtio accessor to access packed event") +> +> +> +>That commit changes from: +> +> +> +>- address_space_read_cached(cache, off_off, &e->off_wrap, +> +>- sizeof(e->off_wrap)); +> +>- virtio_tswap16s(vdev, &e->off_wrap); +> +> +> +>which does a byte read of 2 bytes and then swaps the bytes +> +>depending on the host endianness and the value of +> +>virtio_access_is_big_endian() +> +> +> +>to this: +> +> +> +>+ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); +> +> +> +>virtio_lduw_phys_cached() is a small function which calls +> +>either lduw_be_phys_cached() or lduw_le_phys_cached() +> +>depending on the value of virtio_access_is_big_endian(). +> +>(And lduw_be_phys_cached() and lduw_le_phys_cached() do +> +>the right thing for the host-endianness to do a "load +> +>a specifically big or little endian 16-bit value".) +> +> +> +>Which is to say that because we use a load/store function that's +> +>explicit about the size of the data type it is accessing, the +> +>function itself can handle doing the load as big or little +> +>endian, rather than the calling code having to do a manual swap after +> +>it has done a load-as-bag-of-bytes. This is generally preferable +> +>as it's less error-prone. +> +> +Thanks for the details! +> +> +So, should we also remove `virtio_tswap16s(vdev, &e->flags);` ? +> +> +I mean: +> +diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c +> +index 893a072c9d..2e5e67bdb9 100644 +> +--- a/hw/virtio/virtio.c +> ++++ b/hw/virtio/virtio.c +> +@@ -323,7 +323,6 @@ static void vring_packed_event_read(VirtIODevice *vdev, +> +/* Make sure flags is seen before off_wrap */ +> +smp_rmb(); +> +e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); +> +- virtio_tswap16s(vdev, &e->flags); +> +} +That definitely looks like it's probably not correct... + +-- PMM + +On Fri, Jun 28, 2024 at 03:53:09PM GMT, Peter Maydell wrote: +On Tue, 25 Jun 2024 at 08:18, Stefano Garzarella <sgarzare@redhat.com> wrote: +On Mon, Jun 24, 2024 at 04:19:52PM GMT, Peter Maydell wrote: +>On Mon, 24 Jun 2024 at 16:11, Stefano Garzarella <sgarzare@redhat.com> wrote: +>> +>> CCing Jason. +>> +>> On Mon, Jun 24, 2024 at 4:30â¯PM Xoykie <xoykie@gmail.com> wrote: +>> > +>> > The virtio packed virtqueue support patch[1] suggests converting +>> > endianness by lines: +>> > +>> > virtio_tswap16s(vdev, &e->off_wrap); +>> > virtio_tswap16s(vdev, &e->flags); +>> > +>> > Though both of these conversion statements aren't present in the +>> > latest qemu code here[2] +>> > +>> > Is this intentional? +>> +>> Good catch! +>> +>> It looks like it was removed (maybe by mistake) by commit +>> d152cdd6f6 ("virtio: use virtio accessor to access packed event") +> +>That commit changes from: +> +>- address_space_read_cached(cache, off_off, &e->off_wrap, +>- sizeof(e->off_wrap)); +>- virtio_tswap16s(vdev, &e->off_wrap); +> +>which does a byte read of 2 bytes and then swaps the bytes +>depending on the host endianness and the value of +>virtio_access_is_big_endian() +> +>to this: +> +>+ e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); +> +>virtio_lduw_phys_cached() is a small function which calls +>either lduw_be_phys_cached() or lduw_le_phys_cached() +>depending on the value of virtio_access_is_big_endian(). +>(And lduw_be_phys_cached() and lduw_le_phys_cached() do +>the right thing for the host-endianness to do a "load +>a specifically big or little endian 16-bit value".) +> +>Which is to say that because we use a load/store function that's +>explicit about the size of the data type it is accessing, the +>function itself can handle doing the load as big or little +>endian, rather than the calling code having to do a manual swap after +>it has done a load-as-bag-of-bytes. This is generally preferable +>as it's less error-prone. + +Thanks for the details! + +So, should we also remove `virtio_tswap16s(vdev, &e->flags);` ? + +I mean: +diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c +index 893a072c9d..2e5e67bdb9 100644 +--- a/hw/virtio/virtio.c ++++ b/hw/virtio/virtio.c +@@ -323,7 +323,6 @@ static void vring_packed_event_read(VirtIODevice *vdev, + /* Make sure flags is seen before off_wrap */ + smp_rmb(); + e->off_wrap = virtio_lduw_phys_cached(vdev, cache, off_off); +- virtio_tswap16s(vdev, &e->flags); + } +That definitely looks like it's probably not correct... +Yeah, I just sent that patch: +20240701075208.19634-1-sgarzare@redhat.com +">https://lore.kernel.org/qemu-devel/ +20240701075208.19634-1-sgarzare@redhat.com +We can continue the discussion there. + +Thanks, +Stefano + diff --git a/results/classifier/016/none/48245039 b/results/classifier/016/none/48245039 new file mode 100644 index 00000000..913c2333 --- /dev/null +++ b/results/classifier/016/none/48245039 @@ -0,0 +1,557 @@ +user-level: 0.787 +performance: 0.642 +operating system: 0.416 +risc-v: 0.375 +debug: 0.341 +x86: 0.185 +TCG: 0.172 +ppc: 0.166 +device: 0.139 +arm: 0.119 +VMM: 0.111 +boot: 0.111 +files: 0.104 +PID: 0.099 +vnc: 0.095 +register: 0.088 +socket: 0.085 +network: 0.081 +i386: 0.071 +alpha: 0.059 +hypervisor: 0.056 +virtual: 0.055 +peripherals: 0.054 +kernel: 0.026 +semantic: 0.025 +architecture: 0.011 +KVM: 0.010 +mistranslation: 0.005 +assembly: 0.004 +graphic: 0.004 +permissions: 0.002 + +[Qemu-devel] [BUG] gcov support appears to be broken + +Hello, according to out docs, here is the procedure that should produce +coverage report for execution of the complete "make check": + +#./configure --enable-gcov +#make +#make check +#make coverage-report + +It seems that first three commands execute as expected. (For example, there are +plenty of files generated by "make check" that would've not been generated if +"enable-gcov" hadn't been chosen.) However, the last command complains about +some missing files related to FP support. If those files are added (for +example, artificially, using "touch <missing-file"), that it starts complaining +about missing some decodetree-generated files. Other kinds of files are +involved too. + +It would be nice to have coverage support working. Please somebody take a look, +or explain if I make a mistake or misunderstood our gcov support. + +Yours, +Aleksandar + +On Mon, 5 Aug 2019 at 11:39, Aleksandar Markovic <address@hidden> wrote: +> +> +Hello, according to out docs, here is the procedure that should produce +> +coverage report for execution of the complete "make check": +> +> +#./configure --enable-gcov +> +#make +> +#make check +> +#make coverage-report +> +> +It seems that first three commands execute as expected. (For example, there +> +are plenty of files generated by "make check" that would've not been +> +generated if "enable-gcov" hadn't been chosen.) However, the last command +> +complains about some missing files related to FP support. If those files are +> +added (for example, artificially, using "touch <missing-file"), that it +> +starts complaining about missing some decodetree-generated files. Other kinds +> +of files are involved too. +> +> +It would be nice to have coverage support working. Please somebody take a +> +look, or explain if I make a mistake or misunderstood our gcov support. +Cc'ing Alex who's probably the closest we have to a gcov expert. + +(make/make check of a --enable-gcov build is in the set of things our +Travis CI setup runs, so we do defend that part against regressions.) + +thanks +-- PMM + +Peter Maydell <address@hidden> writes: + +> +On Mon, 5 Aug 2019 at 11:39, Aleksandar Markovic <address@hidden> wrote: +> +> +> +> Hello, according to out docs, here is the procedure that should produce +> +> coverage report for execution of the complete "make check": +> +> +> +> #./configure --enable-gcov +> +> #make +> +> #make check +> +> #make coverage-report +> +> +> +> It seems that first three commands execute as expected. (For example, +> +> there are plenty of files generated by "make check" that would've not +> +> been generated if "enable-gcov" hadn't been chosen.) However, the +> +> last command complains about some missing files related to FP +> +> support. If those files are added (for example, artificially, using +> +> "touch <missing-file"), that it starts complaining about missing some +> +> decodetree-generated files. Other kinds of files are involved too. +The gcov tool is fairly noisy about missing files but that just +indicates the tests haven't exercised those code paths. "make check" +especially doesn't touch much of the TCG code and a chunk of floating +point. + +> +> +> +> It would be nice to have coverage support working. Please somebody +> +> take a look, or explain if I make a mistake or misunderstood our gcov +> +> support. +So your failure mode is no report is generated at all? It's working for +me here. + +> +> +Cc'ing Alex who's probably the closest we have to a gcov expert. +> +> +(make/make check of a --enable-gcov build is in the set of things our +> +Travis CI setup runs, so we do defend that part against regressions.) +We defend the build but I have just checked and it seems our +check_coverage script is currently failing: +https://travis-ci.org/stsquad/qemu/jobs/567809808#L10328 +But as it's an after_success script it doesn't fail the build. + +> +> +thanks +> +-- PMM +-- +Alex Bennée + +> +> #./configure --enable-gcov +> +> #make +> +> #make check +> +> #make coverage-report +> +> +> +> It seems that first three commands execute as expected. (For example, +> +> there are plenty of files generated by "make check" that would've not +> +> been generated if "enable-gcov" hadn't been chosen.) However, the +> +> last command complains about some missing files related to FP +> +So your failure mode is no report is generated at all? It's working for +> +me here. +Alex, no report is generated for my test setups - in fact, "make +coverage-report" even says that it explicitly deletes what appears to be the +main coverage report html file). + +This is the terminal output of an unsuccessful executions of "make +coverage-report" for recent ToT: + +~/Build/qemu-TOT-TEST$ make coverage-report +make[1]: Entering directory '/home/user/Build/qemu-TOT-TEST/slirp' +make[1]: Nothing to be done for 'all'. +make[1]: Leaving directory '/home/user/Build/qemu-TOT-TEST/slirp' + CHK version_gen.h + GEN coverage-report.html +Traceback (most recent call last): + File "/usr/bin/gcovr", line 1970, in <module> + print_html_report(covdata, options.html_details) + File "/usr/bin/gcovr", line 1473, in print_html_report + INPUT = open(data['FILENAME'], 'r') +IOError: [Errno 2] No such file or directory: 'wrap.inc.c' +Makefile:1048: recipe for target +'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' failed +make: *** +[/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html] Error 1 +make: *** Deleting file +'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' + +This instance is executed in QEMU 3.0 source tree: (so, it looks the problem +existed for quite some time) + +~/Build/qemu-3.0$ make coverage-report + CHK version_gen.h + GEN coverage-report.html +Traceback (most recent call last): + File "/usr/bin/gcovr", line 1970, in <module> + print_html_report(covdata, options.html_details) + File "/usr/bin/gcovr", line 1473, in print_html_report + INPUT = open(data['FILENAME'], 'r') +IOError: [Errno 2] No such file or directory: +'/home/user/Build/qemu-3.0/target/openrisc/decode.inc.c' +Makefile:992: recipe for target +'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' failed +make: *** [/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html] +Error 1 +make: *** Deleting file +'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' + +Fond regards, +Aleksandar + + +> +Alex Bennée + +> +> #./configure --enable-gcov +> +> #make +> +> #make check +> +> #make coverage-report +> +> +> +> It seems that first three commands execute as expected. (For example, +> +> there are plenty of files generated by "make check" that would've not +> +> been generated if "enable-gcov" hadn't been chosen.) However, the +> +> last command complains about some missing files related to FP +> +So your failure mode is no report is generated at all? It's working for +> +me here. +Another piece of info: + +~/Build/qemu-TOT-TEST$ gcov --version +gcov (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010 +Copyright (C) 2015 Free Software Foundation, Inc. +This is free software; see the source for copying conditions. +There is NO warranty; not even for MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. + +:~/Build/qemu-TOT-TEST$ gcc --version +gcc (Ubuntu 7.2.0-1ubuntu1~16.04) 7.2.0 +Copyright (C) 2017 Free Software Foundation, Inc. +This is free software; see the source for copying conditions. There is NO +warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + + + + +Alex, no report is generated for my test setups - in fact, "make +coverage-report" even says that it explicitly deletes what appears to be the +main coverage report html file). + +This is the terminal output of an unsuccessful executions of "make +coverage-report" for recent ToT: + +~/Build/qemu-TOT-TEST$ make coverage-report +make[1]: Entering directory '/home/user/Build/qemu-TOT-TEST/slirp' +make[1]: Nothing to be done for 'all'. +make[1]: Leaving directory '/home/user/Build/qemu-TOT-TEST/slirp' + CHK version_gen.h + GEN coverage-report.html +Traceback (most recent call last): + File "/usr/bin/gcovr", line 1970, in <module> + print_html_report(covdata, options.html_details) + File "/usr/bin/gcovr", line 1473, in print_html_report + INPUT = open(data['FILENAME'], 'r') +IOError: [Errno 2] No such file or directory: 'wrap.inc.c' +Makefile:1048: recipe for target +'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' failed +make: *** +[/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html] Error 1 +make: *** Deleting file +'/home/user/Build/qemu-TOT-TEST/reports/coverage/coverage-report.html' + +This instance is executed in QEMU 3.0 source tree: (so, it looks the problem +existed for quite some time) + +~/Build/qemu-3.0$ make coverage-report + CHK version_gen.h + GEN coverage-report.html +Traceback (most recent call last): + File "/usr/bin/gcovr", line 1970, in <module> + print_html_report(covdata, options.html_details) + File "/usr/bin/gcovr", line 1473, in print_html_report + INPUT = open(data['FILENAME'], 'r') +IOError: [Errno 2] No such file or directory: +'/home/user/Build/qemu-3.0/target/openrisc/decode.inc.c' +Makefile:992: recipe for target +'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' failed +make: *** [/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html] +Error 1 +make: *** Deleting file +'/home/user/Build/qemu-3.0/reports/coverage/coverage-report.html' + +Fond regards, +Aleksandar + + +> +Alex Bennée + +> +> #./configure --enable-gcov +> +> #make +> +> #make check +> +> #make coverage-report +> +> +> +> It seems that first three commands execute as expected. (For example, +> +> there are plenty of files generated by "make check" that would've not +> +> been generated if "enable-gcov" hadn't been chosen.) However, the +> +> last command complains about some missing files related to FP +> +So your failure mode is no report is generated at all? It's working for +> +me here. +Alex, here is the thing: + +Seeing that my gcovr is relatively old (2014) 3.2 version, I upgraded it from +git repo to the most recent 4.1 (actually, to a dev version, from the very tip +of the tree), and "make coverage-report" started generating coverage reports. +It did emit some error messages (totally different than previous), but still it +did not stop like it used to do with gcovr 3.2. + +Perhaps you would want to add some gcov/gcovr minimal version info in our docs. +(or at least a statement "this was tested with such and such gcc, gcov and +gcovr", etc.?) + +Coverage report looked fine at first glance, but it a kind of disappointed me +when I digged deeper into its content - for example, it shows very low coverage +for our FP code (softfloat), while, in fact, we know that "make check" contains +detailed tests on FP functionalities. But this is most likely a separate +problem of a very different nature, perhaps the issue of separate git repo for +FP tests (testfloat) that our FP tests use as a mid-layer. + +I'll try how everything works with my test examples, and will let you know. + +Your help is greatly appreciated, +Aleksandar + +Fond regards, +Aleksandar + + +> +Alex Bennée + +Aleksandar Markovic <address@hidden> writes: + +> +>> #./configure --enable-gcov +> +>> #make +> +>> #make check +> +>> #make coverage-report +> +>> +> +>> It seems that first three commands execute as expected. (For example, +> +>> there are plenty of files generated by "make check" that would've not +> +>> been generated if "enable-gcov" hadn't been chosen.) However, the +> +>> last command complains about some missing files related to FP +> +> +> So your failure mode is no report is generated at all? It's working for +> +> me here. +> +> +Alex, here is the thing: +> +> +Seeing that my gcovr is relatively old (2014) 3.2 version, I upgraded it from +> +git repo to the most recent 4.1 (actually, to a dev version, from the very +> +tip of the tree), and "make coverage-report" started generating coverage +> +reports. It did emit some error messages (totally different than previous), +> +but still it did not stop like it used to do with gcovr 3.2. +> +> +Perhaps you would want to add some gcov/gcovr minimal version info in our +> +docs. (or at least a statement "this was tested with such and such gcc, gcov +> +and gcovr", etc.?) +> +> +Coverage report looked fine at first glance, but it a kind of +> +disappointed me when I digged deeper into its content - for example, +> +it shows very low coverage for our FP code (softfloat), while, in +> +fact, we know that "make check" contains detailed tests on FP +> +functionalities. But this is most likely a separate problem of a very +> +different nature, perhaps the issue of separate git repo for FP tests +> +(testfloat) that our FP tests use as a mid-layer. +I get: + +68.6 % 2593 / 3782 62.2 % 1690 / 2718 + +Which is not bad considering we don't exercise the 80 and 128 bit +softfloat code at all (which is not shared by the re-factored 16/32/64 +bit code). + +> +> +I'll try how everything works with my test examples, and will let you know. +> +> +Your help is greatly appreciated, +> +Aleksandar +> +> +Fond regards, +> +Aleksandar +> +> +> +> Alex Bennée +-- +Alex Bennée + +> +> it shows very low coverage for our FP code (softfloat), while, in +> +> fact, we know that "make check" contains detailed tests on FP +> +> functionalities. But this is most likely a separate problem of a very +> +> different nature, perhaps the issue of separate git repo for FP tests +> +> (testfloat) that our FP tests use as a mid-layer. +> +> +I get: +> +> +68.6 % 2593 / 3782 62.2 % 1690 / 2718 +> +I would expect that kind of result too. + +However, I get: + +File: fpu/softfloat.c Lines: 8 3334 0.2 % +Date: 2019-08-05 19:56:58 Branches: 3 2376 0.1 % + +:( + +OK, I'll try to figure that out, and most likely I could live with it if it is +an isolated problem. + +Thank you for your assistance in this matter, +Aleksandar + +> +Which is not bad considering we don't exercise the 80 and 128 bit +> +softfloat code at all (which is not shared by the re-factored 16/32/64 +> +bit code). +> +> +Alex Bennée + +> +> it shows very low coverage for our FP code (softfloat), while, in +> +> fact, we know that "make check" contains detailed tests on FP +> +> functionalities. But this is most likely a separate problem of a very +> +> different nature, perhaps the issue of separate git repo for FP tests +> +> (testfloat) that our FP tests use as a mid-layer. +> +> +I get: +> +> +68.6 % 2593 / 3782 62.2 % 1690 / 2718 +> +This problem is solved too. (and it is my fault) + +I worked with multiple versions of QEMU, and my previous low-coverage results +were for QEMU 3.0, and for that version the directory tests/fp did not even +exist. :D (<blush>) + +For QEMU ToT, I get now: + +fpu/softfloat.c + 68.8 % 2592 / 3770 62.3 % 1693 / 2718 + +which is identical for all intents and purposes to your result. + +Yours cordially, +Aleksandar + diff --git a/results/classifier/016/none/50773216 b/results/classifier/016/none/50773216 new file mode 100644 index 00000000..5a856c2f --- /dev/null +++ b/results/classifier/016/none/50773216 @@ -0,0 +1,137 @@ +virtual: 0.431 +debug: 0.366 +register: 0.178 +x86: 0.170 +vnc: 0.116 +operating system: 0.087 +hypervisor: 0.084 +files: 0.082 +PID: 0.068 +TCG: 0.058 +i386: 0.053 +network: 0.041 +user-level: 0.040 +performance: 0.039 +kernel: 0.038 +semantic: 0.031 +socket: 0.027 +alpha: 0.027 +ppc: 0.026 +device: 0.018 +boot: 0.017 +permissions: 0.007 +assembly: 0.006 +arm: 0.006 +peripherals: 0.004 +risc-v: 0.004 +VMM: 0.004 +graphic: 0.003 +architecture: 0.003 +KVM: 0.002 +mistranslation: 0.002 + +[Qemu-devel] Can I have someone's feedback on [bug 1809075] Concurrency bug on keyboard events: capslock LED messing up keycode streams causes character misses at guest kernel + +Hi everyone. +Can I please have someone's feedback on this bug? +https://bugs.launchpad.net/qemu/+bug/1809075 +Briefly, guest OS loses characters sent to it via vnc. And I spot the +bug in relation to ps2 driver. +I'm thinking of possible fixes and I might want to use a memory barrier. +But I would really like to have some suggestion from a qemu developer +first. For example, can we brutally drop capslock LED key events in ps2 +queue? +It is actually relevant to openQA, an automated QA tool for openSUSE. +And this bug blocks a few test cases for us. +Thank you in advance! + +Kind regards, +Gao Zhiyuan + +Cc'ing Marc-André & Gerd. + +On 12/19/18 10:31 AM, Gao Zhiyuan wrote: +> +Hi everyone. +> +> +Can I please have someone's feedback on this bug? +> +https://bugs.launchpad.net/qemu/+bug/1809075 +> +Briefly, guest OS loses characters sent to it via vnc. And I spot the +> +bug in relation to ps2 driver. +> +> +I'm thinking of possible fixes and I might want to use a memory barrier. +> +But I would really like to have some suggestion from a qemu developer +> +first. For example, can we brutally drop capslock LED key events in ps2 +> +queue? +> +> +It is actually relevant to openQA, an automated QA tool for openSUSE. +> +And this bug blocks a few test cases for us. +> +> +Thank you in advance! +> +> +Kind regards, +> +Gao Zhiyuan +> + +On Thu, Jan 03, 2019 at 12:05:54PM +0100, Philippe Mathieu-Daudé wrote: +> +Cc'ing Marc-André & Gerd. +> +> +On 12/19/18 10:31 AM, Gao Zhiyuan wrote: +> +> Hi everyone. +> +> +> +> Can I please have someone's feedback on this bug? +> +> +https://bugs.launchpad.net/qemu/+bug/1809075 +> +> Briefly, guest OS loses characters sent to it via vnc. And I spot the +> +> bug in relation to ps2 driver. +> +> +> +> I'm thinking of possible fixes and I might want to use a memory barrier. +> +> But I would really like to have some suggestion from a qemu developer +> +> first. For example, can we brutally drop capslock LED key events in ps2 +> +> queue? +There is no "capslock LED key event". 0xfa is KBD_REPLY_ACK, and the +device queues it in response to guest port writes. Yes, the ack can +race with actual key events. But IMO that isn't a bug in qemu. + +Probably the linux kernel just throws away everything until it got the +ack for the port write, and that way the key event gets lost. On +physical hardware you will not notice because it is next to impossible +to type fast enough to hit the race window. + +So, go fix the kernel. + +Alternatively fix vncdotool to send uppercase letters properly with +shift key pressed. Then qemu wouldn't generate capslock key events +(that happens because qemu thinks guest and host capslock state is out +of sync) and the guests's capslock led update request wouldn't get into +the way. + +cheers, + Gerd + diff --git a/results/classifier/016/none/55753058 b/results/classifier/016/none/55753058 new file mode 100644 index 00000000..7cfee9ee --- /dev/null +++ b/results/classifier/016/none/55753058 @@ -0,0 +1,320 @@ +x86: 0.784 +operating system: 0.778 +kernel: 0.648 +debug: 0.645 +user-level: 0.550 +hypervisor: 0.097 +files: 0.093 +performance: 0.091 +assembly: 0.071 +virtual: 0.065 +PID: 0.040 +TCG: 0.037 +register: 0.035 +ppc: 0.022 +semantic: 0.010 +network: 0.007 +device: 0.006 +boot: 0.005 +architecture: 0.004 +i386: 0.004 +alpha: 0.003 +arm: 0.003 +socket: 0.002 +permissions: 0.002 +risc-v: 0.002 +vnc: 0.002 +graphic: 0.002 +peripherals: 0.001 +VMM: 0.001 +mistranslation: 0.001 +KVM: 0.000 + +[RESEND][BUG FIX HELP] QEMU main thread endlessly hangs in __ppoll() + +Hi Genius, +I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may still +exist in the mainline. +Thanks in advance to heroes who can take a look and share understanding. + +The qemu main thread endlessly hangs in the handle of the qmp statement: +{'execute': 'human-monitor-command', 'arguments':{ 'command-line': +'drive_del replication0' } } +and we have the call trace looks like: +#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1, +timeout=<optimized out>, timeout@entry=0x7ffc56c66db0, +sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44 +#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0, +__nfds=<optimized out>, __fds=<optimized out>) +at /usr/include/x86_64-linux-gnu/bits/poll2.h:77 +#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, +timeout=<optimized out>) at util/qemu-timer.c:348 +#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0, +blocking=blocking@entry=true) at util/aio-posix.c:669 +#4 0x000055561019268d in bdrv_do_drained_begin (poll=true, +ignore_bds_parents=false, parent=0x0, recursive=false, +bs=0x55561138b0a0) at block/io.c:430 +#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>, +parent=0x0, ignore_bds_parents=<optimized out>, +poll=<optimized out>) at block/io.c:396 +#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0, +child=0x7f36dc0ce380, errp=<optimized out>) +at block/quorum.c:1063 +#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120 +"colo-disk0", has_child=<optimized out>, +child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0, +errp=0x7ffc56c66f98) at blockdev.c:4494 +#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized +out>, ret=<optimized out>, errp=0x7ffc56c67018) +at qapi/qapi-commands-block-core.c:1538 +#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010, +allow_oob=<optimized out>, request=<optimized out>, +cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132 +#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized +out>, allow_oob=<optimized out>) +at qapi/qmp-dispatch.c:175 +#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40, +req=<optimized out>) at monitor/qmp.c:145 +#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized out>) +at monitor/qmp.c:234 +#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at +util/async.c:117 +#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117 +#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at +util/aio-posix.c:459 +#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>, +callback=<optimized out>, user_data=<optimized out>) +at util/async.c:260 +#17 0x00007f3c22302fbd in g_main_context_dispatch () from +/lib/x86_64-linux-gnu/libglib-2.0.so.0 +#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219 +#19 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:242 +#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518 +#21 0x000055560ff600fe in main_loop () at vl.c:1814 +#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized out>, +envp=<optimized out>) at vl.c:4503 +We found that we're doing endless check in the line of +block/io.c:bdrv_do_drained_begin(): +BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent)); +and it turns out that the bdrv_drain_poll() always get true from: +- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents) +- AND atomic_read(&bs->in_flight) + +I personally think this is a deadlock issue in the a QEMU block layer +(as we know, we have some #FIXME comments in related codes, such as block +permisson update). +Any comments are welcome and appreciated. + +--- +thx,likexu + +On 2/28/21 9:39 PM, Like Xu wrote: +Hi Genius, +I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may +still exist in the mainline. +Thanks in advance to heroes who can take a look and share understanding. +Do you have a test case that reproduces on 5.2? It'd be nice to know if +it was still a problem in the latest source tree or not. +--js +The qemu main thread endlessly hangs in the handle of the qmp statement: +{'execute': 'human-monitor-command', 'arguments':{ 'command-line': +'drive_del replication0' } } +and we have the call trace looks like: +#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1, +timeout=<optimized out>, timeout@entry=0x7ffc56c66db0, +sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44 +#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0, +__nfds=<optimized out>, __fds=<optimized out>) +at /usr/include/x86_64-linux-gnu/bits/poll2.h:77 +#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, +timeout=<optimized out>) at util/qemu-timer.c:348 +#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0, +blocking=blocking@entry=true) at util/aio-posix.c:669 +#4 0x000055561019268d in bdrv_do_drained_begin (poll=true, +ignore_bds_parents=false, parent=0x0, recursive=false, +bs=0x55561138b0a0) at block/io.c:430 +#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>, +parent=0x0, ignore_bds_parents=<optimized out>, +poll=<optimized out>) at block/io.c:396 +#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0, +child=0x7f36dc0ce380, errp=<optimized out>) +at block/quorum.c:1063 +#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120 +"colo-disk0", has_child=<optimized out>, +child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0, +errp=0x7ffc56c66f98) at blockdev.c:4494 +#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized +out>, ret=<optimized out>, errp=0x7ffc56c67018) +at qapi/qapi-commands-block-core.c:1538 +#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010, +allow_oob=<optimized out>, request=<optimized out>, +cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132 +#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized +out>, allow_oob=<optimized out>) +at qapi/qmp-dispatch.c:175 +#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40, +req=<optimized out>) at monitor/qmp.c:145 +#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized +out>) at monitor/qmp.c:234 +#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at +util/async.c:117 +#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117 +#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at +util/aio-posix.c:459 +#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>, +callback=<optimized out>, user_data=<optimized out>) +at util/async.c:260 +#17 0x00007f3c22302fbd in g_main_context_dispatch () from +/lib/x86_64-linux-gnu/libglib-2.0.so.0 +#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219 +#19 os_host_main_loop_wait (timeout=<optimized out>) at +util/main-loop.c:242 +#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518 +#21 0x000055560ff600fe in main_loop () at vl.c:1814 +#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized +out>, envp=<optimized out>) at vl.c:4503 +We found that we're doing endless check in the line of +block/io.c:bdrv_do_drained_begin(): +    BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent)); +and it turns out that the bdrv_drain_poll() always get true from: +- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents) +- AND atomic_read(&bs->in_flight) + +I personally think this is a deadlock issue in the a QEMU block layer +(as we know, we have some #FIXME comments in related codes, such as +block permisson update). +Any comments are welcome and appreciated. + +--- +thx,likexu + +Hi John, + +Thanks for your comment. + +On 2021/3/5 7:53, John Snow wrote: +On 2/28/21 9:39 PM, Like Xu wrote: +Hi Genius, +I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may +still exist in the mainline. +Thanks in advance to heroes who can take a look and share understanding. +Do you have a test case that reproduces on 5.2? It'd be nice to know if it +was still a problem in the latest source tree or not. +We narrowed down the source of the bug, which basically came from +the following qmp usage: +{'execute': 'human-monitor-command', 'arguments':{ 'command-line': +'drive_del replication0' } } +One of the test cases is the COLO usage (docs/colo-proxy.txt). + +This issue is sporadic,the probability may be 1/15 for a io-heavy guest. + +I believe it's reproducible on 5.2 and the latest tree. +--js +The qemu main thread endlessly hangs in the handle of the qmp statement: +{'execute': 'human-monitor-command', 'arguments':{ 'command-line': +'drive_del replication0' } } +and we have the call trace looks like: +#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1, +timeout=<optimized out>, timeout@entry=0x7ffc56c66db0, +sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44 +#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0, +__nfds=<optimized out>, __fds=<optimized out>) +at /usr/include/x86_64-linux-gnu/bits/poll2.h:77 +#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, +timeout=<optimized out>) at util/qemu-timer.c:348 +#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0, +blocking=blocking@entry=true) at util/aio-posix.c:669 +#4 0x000055561019268d in bdrv_do_drained_begin (poll=true, +ignore_bds_parents=false, parent=0x0, recursive=false, +bs=0x55561138b0a0) at block/io.c:430 +#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>, +parent=0x0, ignore_bds_parents=<optimized out>, +poll=<optimized out>) at block/io.c:396 +#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0, +child=0x7f36dc0ce380, errp=<optimized out>) +at block/quorum.c:1063 +#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120 +"colo-disk0", has_child=<optimized out>, +child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0, +errp=0x7ffc56c66f98) at blockdev.c:4494 +#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized +out>, ret=<optimized out>, errp=0x7ffc56c67018) +at qapi/qapi-commands-block-core.c:1538 +#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010, +allow_oob=<optimized out>, request=<optimized out>, +cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132 +#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized +out>, allow_oob=<optimized out>) +at qapi/qmp-dispatch.c:175 +#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40, +req=<optimized out>) at monitor/qmp.c:145 +#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized +out>) at monitor/qmp.c:234 +#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at +util/async.c:117 +#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117 +#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at +util/aio-posix.c:459 +#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>, +callback=<optimized out>, user_data=<optimized out>) +at util/async.c:260 +#17 0x00007f3c22302fbd in g_main_context_dispatch () from +/lib/x86_64-linux-gnu/libglib-2.0.so.0 +#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219 +#19 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:242 +#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518 +#21 0x000055560ff600fe in main_loop () at vl.c:1814 +#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized +out>, envp=<optimized out>) at vl.c:4503 +We found that we're doing endless check in the line of +block/io.c:bdrv_do_drained_begin(): +     BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent)); +and it turns out that the bdrv_drain_poll() always get true from: +- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents) +- AND atomic_read(&bs->in_flight) + +I personally think this is a deadlock issue in the a QEMU block layer +(as we know, we have some #FIXME comments in related codes, such as block +permisson update). +Any comments are welcome and appreciated. + +--- +thx,likexu + +On 3/4/21 10:08 PM, Like Xu wrote: +Hi John, + +Thanks for your comment. + +On 2021/3/5 7:53, John Snow wrote: +On 2/28/21 9:39 PM, Like Xu wrote: +Hi Genius, +I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may +still exist in the mainline. +Thanks in advance to heroes who can take a look and share understanding. +Do you have a test case that reproduces on 5.2? It'd be nice to know +if it was still a problem in the latest source tree or not. +We narrowed down the source of the bug, which basically came from +the following qmp usage: +{'execute': 'human-monitor-command', 'arguments':{ 'command-line': +'drive_del replication0' } } +One of the test cases is the COLO usage (docs/colo-proxy.txt). + +This issue is sporadic,the probability may be 1/15 for a io-heavy guest. + +I believe it's reproducible on 5.2 and the latest tree. +Can you please test and confirm that this is the case, and then file a +bug report on the LP: +https://launchpad.net/qemu +and include: +- The exact commit you used (current origin/master debug build would be +the most ideal.) +- Which QEMU binary you are using (qemu-system-x86_64?) +- The shortest command line you are aware of that reproduces the problem +- The host OS and kernel version +- An updated call trace +- Any relevant commands issued prior to the one that caused the hang; or +detailed reproduction steps if possible. +Thanks, +--js + diff --git a/results/classifier/016/none/56309929 b/results/classifier/016/none/56309929 new file mode 100644 index 00000000..9eb7151e --- /dev/null +++ b/results/classifier/016/none/56309929 @@ -0,0 +1,207 @@ +kernel: 0.698 +files: 0.249 +operating system: 0.101 +semantic: 0.071 +TCG: 0.056 +debug: 0.041 +virtual: 0.024 +ppc: 0.017 +PID: 0.015 +hypervisor: 0.015 +register: 0.013 +VMM: 0.013 +x86: 0.011 +user-level: 0.007 +device: 0.007 +performance: 0.007 +network: 0.004 +architecture: 0.003 +risc-v: 0.003 +alpha: 0.003 +KVM: 0.003 +permissions: 0.002 +arm: 0.002 +peripherals: 0.002 +vnc: 0.002 +socket: 0.002 +boot: 0.002 +graphic: 0.001 +assembly: 0.001 +mistranslation: 0.001 +i386: 0.001 + +[Qemu-devel] [BUG 2.6] Broken CONFIG_TPM? + +A compilation test with clang -Weverything reported this problem: + +config-host.h:112:20: warning: '$' in identifier +[-Wdollar-in-identifier-extension] + +The line of code looks like this: + +#define CONFIG_TPM $(CONFIG_SOFTMMU) + +This is fine for Makefile code, but won't work as expected in C code. + +Am 28.04.2016 um 22:33 schrieb Stefan Weil: +> +A compilation test with clang -Weverything reported this problem: +> +> +config-host.h:112:20: warning: '$' in identifier +> +[-Wdollar-in-identifier-extension] +> +> +The line of code looks like this: +> +> +#define CONFIG_TPM $(CONFIG_SOFTMMU) +> +> +This is fine for Makefile code, but won't work as expected in C code. +> +A complete 64 bit build with clang -Weverything creates a log file of +1.7 GB. +Here are the uniq warnings sorted by their frequency: + + 1 -Wflexible-array-extensions + 1 -Wgnu-folding-constant + 1 -Wunknown-pragmas + 1 -Wunknown-warning-option + 1 -Wunreachable-code-loop-increment + 2 -Warray-bounds-pointer-arithmetic + 2 -Wdollar-in-identifier-extension + 3 -Woverlength-strings + 3 -Wweak-vtables + 4 -Wgnu-empty-struct + 4 -Wstring-conversion + 6 -Wclass-varargs + 7 -Wc99-extensions + 7 -Wc++-compat + 8 -Wfloat-equal + 11 -Wformat-nonliteral + 16 -Wshift-negative-value + 19 -Wglobal-constructors + 28 -Wc++11-long-long + 29 -Wembedded-directive + 38 -Wvla + 40 -Wcovered-switch-default + 40 -Wmissing-variable-declarations + 49 -Wold-style-cast + 53 -Wgnu-conditional-omitted-operand + 56 -Wformat-pedantic + 61 -Wvariadic-macros + 77 -Wc++11-extensions + 83 -Wgnu-flexible-array-initializer + 83 -Wzero-length-array + 96 -Wgnu-designator + 102 -Wmissing-noreturn + 103 -Wconditional-uninitialized + 107 -Wdisabled-macro-expansion + 115 -Wunreachable-code-return + 134 -Wunreachable-code + 243 -Wunreachable-code-break + 257 -Wfloat-conversion + 280 -Wswitch-enum + 291 -Wpointer-arith + 298 -Wshadow + 378 -Wassign-enum + 395 -Wused-but-marked-unused + 420 -Wreserved-id-macro + 493 -Wdocumentation + 510 -Wshift-sign-overflow + 565 -Wgnu-case-range + 566 -Wgnu-zero-variadic-macro-arguments + 650 -Wbad-function-cast + 705 -Wmissing-field-initializers + 817 -Wgnu-statement-expression + 968 -Wdocumentation-unknown-command + 1021 -Wextra-semi + 1112 -Wgnu-empty-initializer + 1138 -Wcast-qual + 1509 -Wcast-align + 1766 -Wextended-offsetof + 1937 -Wsign-compare + 2130 -Wpacked + 2404 -Wunused-macros + 3081 -Wpadded + 4182 -Wconversion + 5430 -Wlanguage-extension-token + 6655 -Wshorten-64-to-32 + 6995 -Wpedantic + 7354 -Wunused-parameter + 27659 -Wsign-conversion + +Stefan Weil <address@hidden> writes: + +> +A compilation test with clang -Weverything reported this problem: +> +> +config-host.h:112:20: warning: '$' in identifier +> +[-Wdollar-in-identifier-extension] +> +> +The line of code looks like this: +> +> +#define CONFIG_TPM $(CONFIG_SOFTMMU) +> +> +This is fine for Makefile code, but won't work as expected in C code. +Broken in commit 3b8acc1 "configure: fix TPM logic". Cc'ing Paolo. + +Impact: #ifdef CONFIG_TPM never disables code. There are no other uses +of CONFIG_TPM in C code. + +I had a quick peek at configure and create_config, but refrained from +attempting to fix this, since I don't understand when exactly CONFIG_TPM +should be defined. + +On 29 April 2016 at 08:42, Markus Armbruster <address@hidden> wrote: +> +Stefan Weil <address@hidden> writes: +> +> +> A compilation test with clang -Weverything reported this problem: +> +> +> +> config-host.h:112:20: warning: '$' in identifier +> +> [-Wdollar-in-identifier-extension] +> +> +> +> The line of code looks like this: +> +> +> +> #define CONFIG_TPM $(CONFIG_SOFTMMU) +> +> +> +> This is fine for Makefile code, but won't work as expected in C code. +> +> +Broken in commit 3b8acc1 "configure: fix TPM logic". Cc'ing Paolo. +> +> +Impact: #ifdef CONFIG_TPM never disables code. There are no other uses +> +of CONFIG_TPM in C code. +> +> +I had a quick peek at configure and create_config, but refrained from +> +attempting to fix this, since I don't understand when exactly CONFIG_TPM +> +should be defined. +Looking at 'git blame' suggests this has been wrong like this for +some years, so we don't need to scramble to fix it for 2.6. + +thanks +-- PMM + diff --git a/results/classifier/016/none/65781993 b/results/classifier/016/none/65781993 new file mode 100644 index 00000000..92cd0275 --- /dev/null +++ b/results/classifier/016/none/65781993 @@ -0,0 +1,2820 @@ +debug: 0.668 +hypervisor: 0.630 +operating system: 0.229 +socket: 0.188 +files: 0.071 +performance: 0.053 +network: 0.043 +x86: 0.027 +virtual: 0.022 +register: 0.021 +TCG: 0.019 +kernel: 0.013 +i386: 0.011 +device: 0.010 +permissions: 0.009 +alpha: 0.008 +PID: 0.008 +semantic: 0.006 +ppc: 0.006 +assembly: 0.004 +risc-v: 0.004 +user-level: 0.004 +boot: 0.003 +architecture: 0.003 +arm: 0.002 +vnc: 0.002 +VMM: 0.002 +mistranslation: 0.002 +graphic: 0.002 +peripherals: 0.001 +KVM: 0.001 + +[Qemu-devel] 答复: Re: 答复: Re: [BUG]COLO failover hang + +Thank youã + +I have test areadyã + +When the Primary Node panic,the Secondary Node qemu hang at the same placeã + +Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary Node qemu +will not produce the problem,but Primary Node panic canã + +I think due to the feature of channel does not support +QIO_CHANNEL_FEATURE_SHUTDOWN. + + +when failover,channel_shutdown could not shut down the channel. + + +so the colo_process_incoming_thread will hang at recvmsg. + + +I test a patch: + + +diff --git a/migration/socket.c b/migration/socket.c + + +index 13966f1..d65a0ea 100644 + + +--- a/migration/socket.c + + ++++ b/migration/socket.c + + +@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel +*ioc, + + + } + + + + + + trace_migration_socket_incoming_accepted() + + + + + + qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") + + ++ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) + + + migration_channel_process_incoming(migrate_get_current(), + + + QIO_CHANNEL(sioc)) + + + object_unref(OBJECT(sioc)) + + + + +My test will not hang any more. + + + + + + + + + + + + + + + + + +åå§é®ä»¶ + + + +åä»¶äººï¼ address@hidden +æ¶ä»¶äººï¼ç广10165992 address@hidden +æéäººï¼ address@hidden address@hidden +æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang + + + + + +Hi,Wang. + +You can test this branch: +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +and please follow wiki ensure your own configuration correctly. +http://wiki.qemu-project.org/Features/COLO +Thanks + +Zhang Chen + + +On 03/21/2017 03:27 PM, address@hidden wrote: +ï¼ +ï¼ hi. +ï¼ +ï¼ I test the git qemu master have the same problem. +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +ï¼ +ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +ï¼ (address@hidden, address@hidden "", +ï¼ address@hidden, address@hidden) at io/channel.c:114 +ï¼ +ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ migration/qemu-file-channel.c:78 +ï¼ +ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +ï¼ migration/qemu-file.c:295 +ï¼ +ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +ï¼ address@hidden) at migration/qemu-file.c:555 +ï¼ +ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +ï¼ migration/qemu-file.c:568 +ï¼ +ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +ï¼ migration/qemu-file.c:648 +ï¼ +ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +ï¼ address@hidden) at migration/colo.c:244 +ï¼ +ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +ï¼ outï¼, address@hidden, +ï¼ address@hidden) +ï¼ +ï¼ at migration/colo.c:264 +ï¼ +ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +ï¼ +ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ +ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +ï¼ +ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN +ï¼ +ï¼ $3 = 0 +ï¼ +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +ï¼ +ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +ï¼ gmain.c:3054 +ï¼ +ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +ï¼ address@hidden) at gmain.c:3630 +ï¼ +ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +ï¼ +ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +ï¼ util/main-loop.c:258 +ï¼ +ï¼ #5 main_loop_wait (address@hidden) at +ï¼ util/main-loop.c:506 +ï¼ +ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +ï¼ +ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +ï¼ outï¼) at vl.c:4709 +ï¼ +ï¼ (gdb) p ioc-ï¼features +ï¼ +ï¼ $1 = 6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +ï¼ +ï¼ +ï¼ May be socket_accept_incoming_migration should +ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +ï¼ +ï¼ +ï¼ thank you. +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ åå§é®ä»¶ +ï¼ address@hidden +ï¼ address@hidden +ï¼ address@hidden@huawei.comï¼ +ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +ï¼ ï¼ +ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +ï¼ ï¼ +ï¼ ï¼ I found that the colo in qemu is not complete yet. +ï¼ ï¼ Do the colo have any plan for development? +ï¼ +ï¼ Yes, We are developing. You can see some of patch we pushing. +ï¼ +ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +ï¼ +ï¼ In our internal version can run it successfully, +ï¼ The failover detail you can ask Zhanghailiang for help. +ï¼ Next time if you have some question about COLO, +ï¼ please cc me and zhanghailiang address@hidden +ï¼ +ï¼ +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ centos7.2+qemu2.7.50 +ï¼ ï¼ (gdb) bt +ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, +ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at +ï¼ ï¼ io/channel-socket.c:497 +ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +ï¼ ï¼ address@hidden "", address@hidden, +ï¼ ï¼ address@hidden) at io/channel.c:97 +ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +ï¼ ï¼ migration/qemu-file.c:257 +ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:523 +ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:603 +ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +ï¼ ï¼ address@hidden) at migration/colo.c:215 +ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, +ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +ï¼ ï¼ migration/colo.c:546 +ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +ï¼ ï¼ migration/colo.c:649 +ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ -- +ï¼ ï¼ View this message in context: +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ +ï¼ -- +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ + +-- +Thanks +Zhang Chen + +Hi, + +On 2017/3/21 16:10, address@hidden wrote: +Thank youã + +I have test areadyã + +When the Primary Node panic,the Secondary Node qemu hang at the same placeã + +Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary Node qemu +will not produce the problem,but Primary Node panic canã + +I think due to the feature of channel does not support +QIO_CHANNEL_FEATURE_SHUTDOWN. +Yes, you are right, when we do failover for primary/secondary VM, we will +shutdown the related +fd in case it is stuck in the read/write fd. + +It seems that you didn't follow the above introduction exactly to do the test. +Could you +share your test procedures ? Especially the commands used in the test. + +Thanks, +Hailiang +when failover,channel_shutdown could not shut down the channel. + + +so the colo_process_incoming_thread will hang at recvmsg. + + +I test a patch: + + +diff --git a/migration/socket.c b/migration/socket.c + + +index 13966f1..d65a0ea 100644 + + +--- a/migration/socket.c + + ++++ b/migration/socket.c + + +@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel +*ioc, + + + } + + + + + + trace_migration_socket_incoming_accepted() + + + + + + qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") + + ++ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) + + + migration_channel_process_incoming(migrate_get_current(), + + + QIO_CHANNEL(sioc)) + + + object_unref(OBJECT(sioc)) + + + + +My test will not hang any more. + + + + + + + + + + + + + + + + + +åå§é®ä»¶ + + + +åä»¶äººï¼ address@hidden +æ¶ä»¶äººï¼ç广10165992 address@hidden +æéäººï¼ address@hidden address@hidden +æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang + + + + + +Hi,Wang. + +You can test this branch: +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +and please follow wiki ensure your own configuration correctly. +http://wiki.qemu-project.org/Features/COLO +Thanks + +Zhang Chen + + +On 03/21/2017 03:27 PM, address@hidden wrote: +ï¼ +ï¼ hi. +ï¼ +ï¼ I test the git qemu master have the same problem. +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +ï¼ +ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +ï¼ (address@hidden, address@hidden "", +ï¼ address@hidden, address@hidden) at io/channel.c:114 +ï¼ +ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ migration/qemu-file-channel.c:78 +ï¼ +ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +ï¼ migration/qemu-file.c:295 +ï¼ +ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +ï¼ address@hidden) at migration/qemu-file.c:555 +ï¼ +ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +ï¼ migration/qemu-file.c:568 +ï¼ +ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +ï¼ migration/qemu-file.c:648 +ï¼ +ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +ï¼ address@hidden) at migration/colo.c:244 +ï¼ +ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +ï¼ outï¼, address@hidden, +ï¼ address@hidden) +ï¼ +ï¼ at migration/colo.c:264 +ï¼ +ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +ï¼ +ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ +ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +ï¼ +ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN +ï¼ +ï¼ $3 = 0 +ï¼ +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +ï¼ +ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +ï¼ gmain.c:3054 +ï¼ +ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +ï¼ address@hidden) at gmain.c:3630 +ï¼ +ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +ï¼ +ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +ï¼ util/main-loop.c:258 +ï¼ +ï¼ #5 main_loop_wait (address@hidden) at +ï¼ util/main-loop.c:506 +ï¼ +ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +ï¼ +ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +ï¼ outï¼) at vl.c:4709 +ï¼ +ï¼ (gdb) p ioc-ï¼features +ï¼ +ï¼ $1 = 6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +ï¼ +ï¼ +ï¼ May be socket_accept_incoming_migration should +ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +ï¼ +ï¼ +ï¼ thank you. +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ åå§é®ä»¶ +ï¼ address@hidden +ï¼ address@hidden +ï¼ address@hidden@huawei.comï¼ +ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +ï¼ ï¼ +ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +ï¼ ï¼ +ï¼ ï¼ I found that the colo in qemu is not complete yet. +ï¼ ï¼ Do the colo have any plan for development? +ï¼ +ï¼ Yes, We are developing. You can see some of patch we pushing. +ï¼ +ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +ï¼ +ï¼ In our internal version can run it successfully, +ï¼ The failover detail you can ask Zhanghailiang for help. +ï¼ Next time if you have some question about COLO, +ï¼ please cc me and zhanghailiang address@hidden +ï¼ +ï¼ +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ centos7.2+qemu2.7.50 +ï¼ ï¼ (gdb) bt +ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, +ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at +ï¼ ï¼ io/channel-socket.c:497 +ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +ï¼ ï¼ address@hidden "", address@hidden, +ï¼ ï¼ address@hidden) at io/channel.c:97 +ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +ï¼ ï¼ migration/qemu-file.c:257 +ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:523 +ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:603 +ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +ï¼ ï¼ address@hidden) at migration/colo.c:215 +ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, +ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +ï¼ ï¼ migration/colo.c:546 +ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +ï¼ ï¼ migration/colo.c:649 +ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ -- +ï¼ ï¼ View this message in context: +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ +ï¼ -- +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ + +Hi, + +Thanks for reporting this, and i confirmed it in my test, and it is a bug. + +Though we tried to call qemu_file_shutdown() to shutdown the related fd, in +case COLO thread/incoming thread is stuck in read/write() while do failover, +but it didn't take effect, because all the fd used by COLO (also migration) +has been wrapped by qio channel, and it will not call the shutdown API if +we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN). + +Cc: Dr. David Alan Gilbert <address@hidden> + +I doubted migration cancel has the same problem, it may be stuck in write() +if we tried to cancel migration. + +void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error +**errp) +{ + qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); + migration_channel_connect(s, ioc, NULL); + ... ... +We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN) above, +and the +migrate_fd_cancel() +{ + ... ... + if (s->state == MIGRATION_STATUS_CANCELLING && f) { + qemu_file_shutdown(f); --> This will not take effect. No ? + } +} + +Thanks, +Hailiang + +On 2017/3/21 16:10, address@hidden wrote: +Thank youã + +I have test areadyã + +When the Primary Node panic,the Secondary Node qemu hang at the same placeã + +Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary Node qemu +will not produce the problem,but Primary Node panic canã + +I think due to the feature of channel does not support +QIO_CHANNEL_FEATURE_SHUTDOWN. + + +when failover,channel_shutdown could not shut down the channel. + + +so the colo_process_incoming_thread will hang at recvmsg. + + +I test a patch: + + +diff --git a/migration/socket.c b/migration/socket.c + + +index 13966f1..d65a0ea 100644 + + +--- a/migration/socket.c + + ++++ b/migration/socket.c + + +@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel +*ioc, + + + } + + + + + + trace_migration_socket_incoming_accepted() + + + + + + qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") + + ++ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) + + + migration_channel_process_incoming(migrate_get_current(), + + + QIO_CHANNEL(sioc)) + + + object_unref(OBJECT(sioc)) + + + + +My test will not hang any more. + + + + + + + + + + + + + + + + + +åå§é®ä»¶ + + + +åä»¶äººï¼ address@hidden +æ¶ä»¶äººï¼ç广10165992 address@hidden +æéäººï¼ address@hidden address@hidden +æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang + + + + + +Hi,Wang. + +You can test this branch: +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +and please follow wiki ensure your own configuration correctly. +http://wiki.qemu-project.org/Features/COLO +Thanks + +Zhang Chen + + +On 03/21/2017 03:27 PM, address@hidden wrote: +ï¼ +ï¼ hi. +ï¼ +ï¼ I test the git qemu master have the same problem. +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +ï¼ +ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +ï¼ (address@hidden, address@hidden "", +ï¼ address@hidden, address@hidden) at io/channel.c:114 +ï¼ +ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ migration/qemu-file-channel.c:78 +ï¼ +ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +ï¼ migration/qemu-file.c:295 +ï¼ +ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +ï¼ address@hidden) at migration/qemu-file.c:555 +ï¼ +ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +ï¼ migration/qemu-file.c:568 +ï¼ +ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +ï¼ migration/qemu-file.c:648 +ï¼ +ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +ï¼ address@hidden) at migration/colo.c:244 +ï¼ +ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +ï¼ outï¼, address@hidden, +ï¼ address@hidden) +ï¼ +ï¼ at migration/colo.c:264 +ï¼ +ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +ï¼ +ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ +ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +ï¼ +ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN +ï¼ +ï¼ $3 = 0 +ï¼ +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +ï¼ +ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +ï¼ gmain.c:3054 +ï¼ +ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +ï¼ address@hidden) at gmain.c:3630 +ï¼ +ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +ï¼ +ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +ï¼ util/main-loop.c:258 +ï¼ +ï¼ #5 main_loop_wait (address@hidden) at +ï¼ util/main-loop.c:506 +ï¼ +ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +ï¼ +ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +ï¼ outï¼) at vl.c:4709 +ï¼ +ï¼ (gdb) p ioc-ï¼features +ï¼ +ï¼ $1 = 6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +ï¼ +ï¼ +ï¼ May be socket_accept_incoming_migration should +ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +ï¼ +ï¼ +ï¼ thank you. +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ åå§é®ä»¶ +ï¼ address@hidden +ï¼ address@hidden +ï¼ address@hidden@huawei.comï¼ +ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +ï¼ ï¼ +ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +ï¼ ï¼ +ï¼ ï¼ I found that the colo in qemu is not complete yet. +ï¼ ï¼ Do the colo have any plan for development? +ï¼ +ï¼ Yes, We are developing. You can see some of patch we pushing. +ï¼ +ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +ï¼ +ï¼ In our internal version can run it successfully, +ï¼ The failover detail you can ask Zhanghailiang for help. +ï¼ Next time if you have some question about COLO, +ï¼ please cc me and zhanghailiang address@hidden +ï¼ +ï¼ +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ centos7.2+qemu2.7.50 +ï¼ ï¼ (gdb) bt +ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, +ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at +ï¼ ï¼ io/channel-socket.c:497 +ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +ï¼ ï¼ address@hidden "", address@hidden, +ï¼ ï¼ address@hidden) at io/channel.c:97 +ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +ï¼ ï¼ migration/qemu-file.c:257 +ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:523 +ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:603 +ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +ï¼ ï¼ address@hidden) at migration/colo.c:215 +ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, +ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +ï¼ ï¼ migration/colo.c:546 +ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +ï¼ ï¼ migration/colo.c:649 +ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ -- +ï¼ ï¼ View this message in context: +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ +ï¼ -- +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ + +* Hailiang Zhang (address@hidden) wrote: +> +Hi, +> +> +Thanks for reporting this, and i confirmed it in my test, and it is a bug. +> +> +Though we tried to call qemu_file_shutdown() to shutdown the related fd, in +> +case COLO thread/incoming thread is stuck in read/write() while do failover, +> +but it didn't take effect, because all the fd used by COLO (also migration) +> +has been wrapped by qio channel, and it will not call the shutdown API if +> +we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), +> +QIO_CHANNEL_FEATURE_SHUTDOWN). +> +> +Cc: Dr. David Alan Gilbert <address@hidden> +> +> +I doubted migration cancel has the same problem, it may be stuck in write() +> +if we tried to cancel migration. +> +> +void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error +> +**errp) +> +{ +> +qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); +> +migration_channel_connect(s, ioc, NULL); +> +... ... +> +We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), +> +QIO_CHANNEL_FEATURE_SHUTDOWN) above, +> +and the +> +migrate_fd_cancel() +> +{ +> +... ... +> +if (s->state == MIGRATION_STATUS_CANCELLING && f) { +> +qemu_file_shutdown(f); --> This will not take effect. No ? +> +} +> +} +(cc'd in Daniel Berrange). +I see that we call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); +at the +top of qio_channel_socket_new; so I think that's safe isn't it? + +Dave + +> +Thanks, +> +Hailiang +> +> +On 2017/3/21 16:10, address@hidden wrote: +> +> Thank youã +> +> +> +> I have test areadyã +> +> +> +> When the Primary Node panic,the Secondary Node qemu hang at the same placeã +> +> +> +> Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary Node +> +> qemu will not produce the problem,but Primary Node panic canã +> +> +> +> I think due to the feature of channel does not support +> +> QIO_CHANNEL_FEATURE_SHUTDOWN. +> +> +> +> +> +> when failover,channel_shutdown could not shut down the channel. +> +> +> +> +> +> so the colo_process_incoming_thread will hang at recvmsg. +> +> +> +> +> +> I test a patch: +> +> +> +> +> +> diff --git a/migration/socket.c b/migration/socket.c +> +> +> +> +> +> index 13966f1..d65a0ea 100644 +> +> +> +> +> +> --- a/migration/socket.c +> +> +> +> +> +> +++ b/migration/socket.c +> +> +> +> +> +> @@ -147,8 +147,9 @@ static gboolean +> +> socket_accept_incoming_migration(QIOChannel *ioc, +> +> +> +> +> +> } +> +> +> +> +> +> +> +> +> +> +> +> trace_migration_socket_incoming_accepted() +> +> +> +> +> +> +> +> +> +> +> +> qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") +> +> +> +> +> +> + qio_channel_set_feature(QIO_CHANNEL(sioc), +> +> QIO_CHANNEL_FEATURE_SHUTDOWN) +> +> +> +> +> +> migration_channel_process_incoming(migrate_get_current(), +> +> +> +> +> +> QIO_CHANNEL(sioc)) +> +> +> +> +> +> object_unref(OBJECT(sioc)) +> +> +> +> +> +> +> +> +> +> My test will not hang any more. +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> +> åå§é®ä»¶ +> +> +> +> +> +> +> +> åä»¶äººï¼ address@hidden +> +> æ¶ä»¶äººï¼ç广10165992 address@hidden +> +> æéäººï¼ address@hidden address@hidden +> +> æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +> +> 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang +> +> +> +> +> +> +> +> +> +> +> +> Hi,Wang. +> +> +> +> You can test this branch: +> +> +> +> +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +> +> +> +> and please follow wiki ensure your own configuration correctly. +> +> +> +> +http://wiki.qemu-project.org/Features/COLO +> +> +> +> +> +> Thanks +> +> +> +> Zhang Chen +> +> +> +> +> +> On 03/21/2017 03:27 PM, address@hidden wrote: +> +> ï¼ +> +> ï¼ hi. +> +> ï¼ +> +> ï¼ I test the git qemu master have the same problem. +> +> ï¼ +> +> ï¼ (gdb) bt +> +> ï¼ +> +> ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +> +> ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +> +> ï¼ +> +> ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +> +> ï¼ (address@hidden, address@hidden "", +> +> ï¼ address@hidden, address@hidden) at io/channel.c:114 +> +> ï¼ +> +> ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +> +> ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +> +> ï¼ migration/qemu-file-channel.c:78 +> +> ï¼ +> +> ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +> +> ï¼ migration/qemu-file.c:295 +> +> ï¼ +> +> ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +> +> ï¼ address@hidden) at migration/qemu-file.c:555 +> +> ï¼ +> +> ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +> +> ï¼ migration/qemu-file.c:568 +> +> ï¼ +> +> ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +> +> ï¼ migration/qemu-file.c:648 +> +> ï¼ +> +> ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +> +> ï¼ address@hidden) at migration/colo.c:244 +> +> ï¼ +> +> ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +> +> ï¼ outï¼, address@hidden, +> +> ï¼ address@hidden) +> +> ï¼ +> +> ï¼ at migration/colo.c:264 +> +> ï¼ +> +> ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +> +> ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +> +> ï¼ +> +> ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +> +> ï¼ +> +> ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +> +> ï¼ +> +> ï¼ (gdb) p ioc-ï¼name +> +> ï¼ +> +> ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +> +> ï¼ +> +> ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN +> +> ï¼ +> +> ï¼ $3 = 0 +> +> ï¼ +> +> ï¼ +> +> ï¼ (gdb) bt +> +> ï¼ +> +> ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +> +> ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +> +> ï¼ +> +> ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +> +> ï¼ gmain.c:3054 +> +> ï¼ +> +> ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +> +> ï¼ address@hidden) at gmain.c:3630 +> +> ï¼ +> +> ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +> +> ï¼ +> +> ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +> +> ï¼ util/main-loop.c:258 +> +> ï¼ +> +> ï¼ #5 main_loop_wait (address@hidden) at +> +> ï¼ util/main-loop.c:506 +> +> ï¼ +> +> ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +> +> ï¼ +> +> ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +> +> ï¼ outï¼) at vl.c:4709 +> +> ï¼ +> +> ï¼ (gdb) p ioc-ï¼features +> +> ï¼ +> +> ï¼ $1 = 6 +> +> ï¼ +> +> ï¼ (gdb) p ioc-ï¼name +> +> ï¼ +> +> ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +> +> ï¼ +> +> ï¼ +> +> ï¼ May be socket_accept_incoming_migration should +> +> ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +> +> ï¼ +> +> ï¼ +> +> ï¼ thank you. +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> ï¼ åå§é®ä»¶ +> +> ï¼ address@hidden +> +> ï¼ address@hidden +> +> ï¼ address@hidden@huawei.comï¼ +> +> ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +> +> ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +> +> ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +> +> ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +> +> ï¼ ï¼ +> +> ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +> +> ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +> +> ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +> +> ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +> +> ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +> +> ï¼ ï¼ +> +> ï¼ ï¼ I found that the colo in qemu is not complete yet. +> +> ï¼ ï¼ Do the colo have any plan for development? +> +> ï¼ +> +> ï¼ Yes, We are developing. You can see some of patch we pushing. +> +> ï¼ +> +> ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +> +> ï¼ +> +> ï¼ In our internal version can run it successfully, +> +> ï¼ The failover detail you can ask Zhanghailiang for help. +> +> ï¼ Next time if you have some question about COLO, +> +> ï¼ please cc me and zhanghailiang address@hidden +> +> ï¼ +> +> ï¼ +> +> ï¼ Thanks +> +> ï¼ Zhang Chen +> +> ï¼ +> +> ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ centos7.2+qemu2.7.50 +> +> ï¼ ï¼ (gdb) bt +> +> ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +> +> ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, +> +> ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) +> +> at +> +> ï¼ ï¼ io/channel-socket.c:497 +> +> ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +> +> ï¼ ï¼ address@hidden "", address@hidden, +> +> ï¼ ï¼ address@hidden) at io/channel.c:97 +> +> ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, +> +> ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +> +> ï¼ ï¼ migration/qemu-file-channel.c:78 +> +> ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +> +> ï¼ ï¼ migration/qemu-file.c:257 +> +> ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +> +> ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +> +> ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +> +> ï¼ ï¼ migration/qemu-file.c:523 +> +> ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +> +> ï¼ ï¼ migration/qemu-file.c:603 +> +> ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +> +> ï¼ ï¼ address@hidden) at migration/colo.c:215 +> +> ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, +> +> ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +> +> ï¼ ï¼ migration/colo.c:546 +> +> ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +> +> ï¼ ï¼ migration/colo.c:649 +> +> ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 +> +> ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ -- +> +> ï¼ ï¼ View this message in context: +> +> +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +> +> ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ ï¼ +> +> ï¼ +> +> ï¼ -- +> +> ï¼ Thanks +> +> ï¼ Zhang Chen +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> ï¼ +> +> +> +-- +Dr. David Alan Gilbert / address@hidden / Manchester, UK + +On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: +* Hailiang Zhang (address@hidden) wrote: +Hi, + +Thanks for reporting this, and i confirmed it in my test, and it is a bug. + +Though we tried to call qemu_file_shutdown() to shutdown the related fd, in +case COLO thread/incoming thread is stuck in read/write() while do failover, +but it didn't take effect, because all the fd used by COLO (also migration) +has been wrapped by qio channel, and it will not call the shutdown API if +we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN). + +Cc: Dr. David Alan Gilbert <address@hidden> + +I doubted migration cancel has the same problem, it may be stuck in write() +if we tried to cancel migration. + +void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error +**errp) +{ + qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); + migration_channel_connect(s, ioc, NULL); + ... ... +We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN) above, +and the +migrate_fd_cancel() +{ + ... ... + if (s->state == MIGRATION_STATUS_CANCELLING && f) { + qemu_file_shutdown(f); --> This will not take effect. No ? + } +} +(cc'd in Daniel Berrange). +I see that we call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN); +at the +top of qio_channel_socket_new; so I think that's safe isn't it? +Hmm, you are right, this problem is only exist for the migration incoming fd, +thanks. +Dave +Thanks, +Hailiang + +On 2017/3/21 16:10, address@hidden wrote: +Thank youã + +I have test areadyã + +When the Primary Node panic,the Secondary Node qemu hang at the same placeã + +Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary Node qemu +will not produce the problem,but Primary Node panic canã + +I think due to the feature of channel does not support +QIO_CHANNEL_FEATURE_SHUTDOWN. + + +when failover,channel_shutdown could not shut down the channel. + + +so the colo_process_incoming_thread will hang at recvmsg. + + +I test a patch: + + +diff --git a/migration/socket.c b/migration/socket.c + + +index 13966f1..d65a0ea 100644 + + +--- a/migration/socket.c + + ++++ b/migration/socket.c + + +@@ -147,8 +147,9 @@ static gboolean socket_accept_incoming_migration(QIOChannel +*ioc, + + + } + + + + + + trace_migration_socket_incoming_accepted() + + + + + + qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") + + ++ qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) + + + migration_channel_process_incoming(migrate_get_current(), + + + QIO_CHANNEL(sioc)) + + + object_unref(OBJECT(sioc)) + + + + +My test will not hang any more. + + + + + + + + + + + + + + + + + +åå§é®ä»¶ + + + +åä»¶äººï¼ address@hidden +æ¶ä»¶äººï¼ç广10165992 address@hidden +æéäººï¼ address@hidden address@hidden +æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang + + + + + +Hi,Wang. + +You can test this branch: +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +and please follow wiki ensure your own configuration correctly. +http://wiki.qemu-project.org/Features/COLO +Thanks + +Zhang Chen + + +On 03/21/2017 03:27 PM, address@hidden wrote: +ï¼ +ï¼ hi. +ï¼ +ï¼ I test the git qemu master have the same problem. +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +ï¼ +ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +ï¼ (address@hidden, address@hidden "", +ï¼ address@hidden, address@hidden) at io/channel.c:114 +ï¼ +ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ migration/qemu-file-channel.c:78 +ï¼ +ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +ï¼ migration/qemu-file.c:295 +ï¼ +ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +ï¼ address@hidden) at migration/qemu-file.c:555 +ï¼ +ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +ï¼ migration/qemu-file.c:568 +ï¼ +ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +ï¼ migration/qemu-file.c:648 +ï¼ +ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +ï¼ address@hidden) at migration/colo.c:244 +ï¼ +ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +ï¼ outï¼, address@hidden, +ï¼ address@hidden) +ï¼ +ï¼ at migration/colo.c:264 +ï¼ +ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +ï¼ +ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ +ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +ï¼ +ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN +ï¼ +ï¼ $3 = 0 +ï¼ +ï¼ +ï¼ (gdb) bt +ï¼ +ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +ï¼ +ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +ï¼ gmain.c:3054 +ï¼ +ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +ï¼ address@hidden) at gmain.c:3630 +ï¼ +ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +ï¼ +ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +ï¼ util/main-loop.c:258 +ï¼ +ï¼ #5 main_loop_wait (address@hidden) at +ï¼ util/main-loop.c:506 +ï¼ +ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +ï¼ +ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +ï¼ outï¼) at vl.c:4709 +ï¼ +ï¼ (gdb) p ioc-ï¼features +ï¼ +ï¼ $1 = 6 +ï¼ +ï¼ (gdb) p ioc-ï¼name +ï¼ +ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +ï¼ +ï¼ +ï¼ May be socket_accept_incoming_migration should +ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +ï¼ +ï¼ +ï¼ thank you. +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ åå§é®ä»¶ +ï¼ address@hidden +ï¼ address@hidden +ï¼ address@hidden@huawei.comï¼ +ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +ï¼ ï¼ +ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +ï¼ ï¼ +ï¼ ï¼ I found that the colo in qemu is not complete yet. +ï¼ ï¼ Do the colo have any plan for development? +ï¼ +ï¼ Yes, We are developing. You can see some of patch we pushing. +ï¼ +ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +ï¼ +ï¼ In our internal version can run it successfully, +ï¼ The failover detail you can ask Zhanghailiang for help. +ï¼ Next time if you have some question about COLO, +ï¼ please cc me and zhanghailiang address@hidden +ï¼ +ï¼ +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ centos7.2+qemu2.7.50 +ï¼ ï¼ (gdb) bt +ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized outï¼, +ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, errp=0x0) at +ï¼ ï¼ io/channel-socket.c:497 +ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +ï¼ ï¼ address@hidden "", address@hidden, +ï¼ ï¼ address@hidden) at io/channel.c:97 +ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +ï¼ ï¼ migration/qemu-file.c:257 +ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:523 +ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ migration/qemu-file.c:603 +ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +ï¼ ï¼ address@hidden) at migration/colo.c:215 +ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message (errp=0x7f3d62bfaa48, +ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +ï¼ ï¼ migration/colo.c:546 +ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +ï¼ ï¼ migration/colo.c:649 +ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ -- +ï¼ ï¼ View this message in context: +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ ï¼ +ï¼ +ï¼ -- +ï¼ Thanks +ï¼ Zhang Chen +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +-- +Dr. David Alan Gilbert / address@hidden / Manchester, UK + +. + +* Hailiang Zhang (address@hidden) wrote: +> +On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: +> +> * Hailiang Zhang (address@hidden) wrote: +> +> > Hi, +> +> > +> +> > Thanks for reporting this, and i confirmed it in my test, and it is a bug. +> +> > +> +> > Though we tried to call qemu_file_shutdown() to shutdown the related fd, +> +> > in +> +> > case COLO thread/incoming thread is stuck in read/write() while do +> +> > failover, +> +> > but it didn't take effect, because all the fd used by COLO (also +> +> > migration) +> +> > has been wrapped by qio channel, and it will not call the shutdown API if +> +> > we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), +> +> > QIO_CHANNEL_FEATURE_SHUTDOWN). +> +> > +> +> > Cc: Dr. David Alan Gilbert <address@hidden> +> +> > +> +> > I doubted migration cancel has the same problem, it may be stuck in +> +> > write() +> +> > if we tried to cancel migration. +> +> > +> +> > void fd_start_outgoing_migration(MigrationState *s, const char *fdname, +> +> > Error **errp) +> +> > { +> +> > qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing"); +> +> > migration_channel_connect(s, ioc, NULL); +> +> > ... ... +> +> > We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), +> +> > QIO_CHANNEL_FEATURE_SHUTDOWN) above, +> +> > and the +> +> > migrate_fd_cancel() +> +> > { +> +> > ... ... +> +> > if (s->state == MIGRATION_STATUS_CANCELLING && f) { +> +> > qemu_file_shutdown(f); --> This will not take effect. No ? +> +> > } +> +> > } +> +> +> +> (cc'd in Daniel Berrange). +> +> I see that we call qio_channel_set_feature(ioc, +> +> QIO_CHANNEL_FEATURE_SHUTDOWN); at the +> +> top of qio_channel_socket_new; so I think that's safe isn't it? +> +> +> +> +Hmm, you are right, this problem is only exist for the migration incoming fd, +> +thanks. +Yes, and I don't think we normally do a cancel on the incoming side of a +migration. + +Dave + +> +> Dave +> +> +> +> > Thanks, +> +> > Hailiang +> +> > +> +> > On 2017/3/21 16:10, address@hidden wrote: +> +> > > Thank youã +> +> > > +> +> > > I have test areadyã +> +> > > +> +> > > When the Primary Node panic,the Secondary Node qemu hang at the same +> +> > > placeã +> +> > > +> +> > > Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary +> +> > > Node qemu will not produce the problem,but Primary Node panic canã +> +> > > +> +> > > I think due to the feature of channel does not support +> +> > > QIO_CHANNEL_FEATURE_SHUTDOWN. +> +> > > +> +> > > +> +> > > when failover,channel_shutdown could not shut down the channel. +> +> > > +> +> > > +> +> > > so the colo_process_incoming_thread will hang at recvmsg. +> +> > > +> +> > > +> +> > > I test a patch: +> +> > > +> +> > > +> +> > > diff --git a/migration/socket.c b/migration/socket.c +> +> > > +> +> > > +> +> > > index 13966f1..d65a0ea 100644 +> +> > > +> +> > > +> +> > > --- a/migration/socket.c +> +> > > +> +> > > +> +> > > +++ b/migration/socket.c +> +> > > +> +> > > +> +> > > @@ -147,8 +147,9 @@ static gboolean +> +> > > socket_accept_incoming_migration(QIOChannel *ioc, +> +> > > +> +> > > +> +> > > } +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > trace_migration_socket_incoming_accepted() +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > qio_channel_set_name(QIO_CHANNEL(sioc), +> +> > > "migration-socket-incoming") +> +> > > +> +> > > +> +> > > + qio_channel_set_feature(QIO_CHANNEL(sioc), +> +> > > QIO_CHANNEL_FEATURE_SHUTDOWN) +> +> > > +> +> > > +> +> > > migration_channel_process_incoming(migrate_get_current(), +> +> > > +> +> > > +> +> > > QIO_CHANNEL(sioc)) +> +> > > +> +> > > +> +> > > object_unref(OBJECT(sioc)) +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > My test will not hang any more. +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > åå§é®ä»¶ +> +> > > +> +> > > +> +> > > +> +> > > åä»¶äººï¼ address@hidden +> +> > > æ¶ä»¶äººï¼ç广10165992 address@hidden +> +> > > æéäººï¼ address@hidden address@hidden +> +> > > æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +> +> > > 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > +> +> > > Hi,Wang. +> +> > > +> +> > > You can test this branch: +> +> > > +> +> > > +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +> +> > > +> +> > > and please follow wiki ensure your own configuration correctly. +> +> > > +> +> > > +http://wiki.qemu-project.org/Features/COLO +> +> > > +> +> > > +> +> > > Thanks +> +> > > +> +> > > Zhang Chen +> +> > > +> +> > > +> +> > > On 03/21/2017 03:27 PM, address@hidden wrote: +> +> > > ï¼ +> +> > > ï¼ hi. +> +> > > ï¼ +> +> > > ï¼ I test the git qemu master have the same problem. +> +> > > ï¼ +> +> > > ï¼ (gdb) bt +> +> > > ï¼ +> +> > > ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +> +> > > ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +> +> > > ï¼ +> +> > > ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +> +> > > ï¼ (address@hidden, address@hidden "", +> +> > > ï¼ address@hidden, address@hidden) at io/channel.c:114 +> +> > > ï¼ +> +> > > ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +> +> > > ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +> +> > > ï¼ migration/qemu-file-channel.c:78 +> +> > > ï¼ +> +> > > ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +> +> > > ï¼ migration/qemu-file.c:295 +> +> > > ï¼ +> +> > > ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +> +> > > ï¼ address@hidden) at migration/qemu-file.c:555 +> +> > > ï¼ +> +> > > ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +> +> > > ï¼ migration/qemu-file.c:568 +> +> > > ï¼ +> +> > > ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +> +> > > ï¼ migration/qemu-file.c:648 +> +> > > ï¼ +> +> > > ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +> +> > > ï¼ address@hidden) at migration/colo.c:244 +> +> > > ï¼ +> +> > > ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +> +> > > ï¼ outï¼, address@hidden, +> +> > > ï¼ address@hidden) +> +> > > ï¼ +> +> > > ï¼ at migration/colo.c:264 +> +> > > ï¼ +> +> > > ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +> +> > > ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +> +> > > ï¼ +> +> > > ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +> +> > > ï¼ +> +> > > ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +> +> > > ï¼ +> +> > > ï¼ (gdb) p ioc-ï¼name +> +> > > ï¼ +> +> > > ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +> +> > > ï¼ +> +> > > ï¼ (gdb) p ioc-ï¼features Do not support +> +> > > QIO_CHANNEL_FEATURE_SHUTDOWN +> +> > > ï¼ +> +> > > ï¼ $3 = 0 +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ (gdb) bt +> +> > > ï¼ +> +> > > ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +> +> > > ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +> +> > > ï¼ +> +> > > ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +> +> > > ï¼ gmain.c:3054 +> +> > > ï¼ +> +> > > ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +> +> > > ï¼ address@hidden) at gmain.c:3630 +> +> > > ï¼ +> +> > > ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +> +> > > ï¼ +> +> > > ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +> +> > > ï¼ util/main-loop.c:258 +> +> > > ï¼ +> +> > > ï¼ #5 main_loop_wait (address@hidden) at +> +> > > ï¼ util/main-loop.c:506 +> +> > > ï¼ +> +> > > ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +> +> > > ï¼ +> +> > > ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +> +> > > ï¼ outï¼) at vl.c:4709 +> +> > > ï¼ +> +> > > ï¼ (gdb) p ioc-ï¼features +> +> > > ï¼ +> +> > > ï¼ $1 = 6 +> +> > > ï¼ +> +> > > ï¼ (gdb) p ioc-ï¼name +> +> > > ï¼ +> +> > > ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ May be socket_accept_incoming_migration should +> +> > > ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ thank you. +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ åå§é®ä»¶ +> +> > > ï¼ address@hidden +> +> > > ï¼ address@hidden +> +> > > ï¼ address@hidden@huawei.comï¼ +> +> > > ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +> +> > > ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +> +> > > ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +> +> > > ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +> +> > > ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +> +> > > ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +> +> > > ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +> +> > > ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ I found that the colo in qemu is not complete yet. +> +> > > ï¼ ï¼ Do the colo have any plan for development? +> +> > > ï¼ +> +> > > ï¼ Yes, We are developing. You can see some of patch we pushing. +> +> > > ï¼ +> +> > > ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +> +> > > ï¼ +> +> > > ï¼ In our internal version can run it successfully, +> +> > > ï¼ The failover detail you can ask Zhanghailiang for help. +> +> > > ï¼ Next time if you have some question about COLO, +> +> > > ï¼ please cc me and zhanghailiang address@hidden +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ Thanks +> +> > > ï¼ Zhang Chen +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ centos7.2+qemu2.7.50 +> +> > > ï¼ ï¼ (gdb) bt +> +> > > ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +> +> > > ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized +> +> > > outï¼, +> +> > > ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, +> +> > > errp=0x0) at +> +> > > ï¼ ï¼ io/channel-socket.c:497 +> +> > > ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +> +> > > ï¼ ï¼ address@hidden "", address@hidden, +> +> > > ï¼ ï¼ address@hidden) at io/channel.c:97 +> +> > > ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized +> +> > > outï¼, +> +> > > ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +> +> > > ï¼ ï¼ migration/qemu-file-channel.c:78 +> +> > > ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +> +> > > ï¼ ï¼ migration/qemu-file.c:257 +> +> > > ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +> +> > > ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +> +> > > ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +> +> > > ï¼ ï¼ migration/qemu-file.c:523 +> +> > > ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +> +> > > ï¼ ï¼ migration/qemu-file.c:603 +> +> > > ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +> +> > > ï¼ ï¼ address@hidden) at migration/colo.c:215 +> +> > > ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message +> +> > > (errp=0x7f3d62bfaa48, +> +> > > ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +> +> > > ï¼ ï¼ migration/colo.c:546 +> +> > > ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +> +> > > ï¼ ï¼ migration/colo.c:649 +> +> > > ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from +> +> > > /lib64/libpthread.so.0 +> +> > > ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc.so.6 +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ -- +> +> > > ï¼ ï¼ View this message in context: +> +> > > +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +> +> > > ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ ï¼ +> +> > > ï¼ +> +> > > ï¼ -- +> +> > > ï¼ Thanks +> +> > > ï¼ Zhang Chen +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > ï¼ +> +> > > +> +> > +> +> -- +> +> Dr. David Alan Gilbert / address@hidden / Manchester, UK +> +> +> +> . +> +> +> +-- +Dr. David Alan Gilbert / address@hidden / Manchester, UK + diff --git a/results/classifier/016/none/70294255 b/results/classifier/016/none/70294255 new file mode 100644 index 00000000..286c1cc4 --- /dev/null +++ b/results/classifier/016/none/70294255 @@ -0,0 +1,1088 @@ +socket: 0.753 +debug: 0.454 +network: 0.316 +operating system: 0.278 +files: 0.213 +hypervisor: 0.060 +virtual: 0.046 +kernel: 0.045 +performance: 0.044 +i386: 0.043 +alpha: 0.037 +TCG: 0.036 +permissions: 0.032 +device: 0.025 +x86: 0.024 +PID: 0.021 +boot: 0.014 +KVM: 0.013 +semantic: 0.011 +risc-v: 0.010 +VMM: 0.010 +register: 0.009 +assembly: 0.009 +architecture: 0.008 +arm: 0.007 +vnc: 0.006 +ppc: 0.006 +peripherals: 0.005 +user-level: 0.004 +graphic: 0.003 +mistranslation: 0.002 + +[Qemu-devel] 答复: Re: 答复: Re: 答复: Re: 答复: Re: [BUG]COLO failover hang + +hi: + +yes.it is better. + +And should we delete + + + + +#ifdef WIN32 + + QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) + +#endif + + + + +in qio_channel_socket_acceptï¼ + +qio_channel_socket_new already have it. + + + + + + + + + + + + +åå§é®ä»¶ + + + +åä»¶äººï¼ address@hidden +æ¶ä»¶äººï¼ç广10165992 +æéäººï¼ address@hidden address@hidden address@hidden address@hidden +æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 +主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang + + + + + +Hi, + +On 2017/3/22 9:42, address@hidden wrote: +ï¼ diff --git a/migration/socket.c b/migration/socket.c +ï¼ +ï¼ +ï¼ index 13966f1..d65a0ea 100644 +ï¼ +ï¼ +ï¼ --- a/migration/socket.c +ï¼ +ï¼ +ï¼ +++ b/migration/socket.c +ï¼ +ï¼ +ï¼ @@ -147,8 +147,9 @@ static gboolean +socket_accept_incoming_migration(QIOChannel *ioc, +ï¼ +ï¼ +ï¼ } +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ trace_migration_socket_incoming_accepted() +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") +ï¼ +ï¼ +ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) +ï¼ +ï¼ +ï¼ migration_channel_process_incoming(migrate_get_current(), +ï¼ +ï¼ +ï¼ QIO_CHANNEL(sioc)) +ï¼ +ï¼ +ï¼ object_unref(OBJECT(sioc)) +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ Is this patch ok? +ï¼ + +Yes, i think this works, but a better way maybe to call +qio_channel_set_feature() +in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the +socket accept fd, +Or fix it by this: + +diff --git a/io/channel-socket.c b/io/channel-socket.c +index f546c68..ce6894c 100644 +--- a/io/channel-socket.c ++++ b/io/channel-socket.c +@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, + Error **errp) + { + QIOChannelSocket *cioc +- +- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) +- cioc-ï¼fd = -1 ++ ++ cioc = qio_channel_socket_new() + cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) + cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) + + +Thanks, +Hailiang + +ï¼ I have test it . The test could not hang any more. +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ åå§é®ä»¶ +ï¼ +ï¼ +ï¼ +ï¼ åä»¶äººï¼ address@hidden +ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden +ï¼ æéäººï¼ address@hidden address@hidden address@hidden +ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 +ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: +ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: +ï¼ ï¼ï¼ Hi, +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in +ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do +failover, +ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) +ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if +ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN). +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() +ï¼ ï¼ï¼ if we tried to cancel migration. +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, +Error **errp) +ï¼ ï¼ï¼ { +ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") +ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) +ï¼ ï¼ï¼ ... ... +ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN) above, +ï¼ ï¼ï¼ and the +ï¼ ï¼ï¼ migrate_fd_cancel() +ï¼ ï¼ï¼ { +ï¼ ï¼ï¼ ... ... +ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { +ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? +ï¼ ï¼ï¼ } +ï¼ ï¼ï¼ } +ï¼ ï¼ +ï¼ ï¼ (cc'd in Daniel Berrange). +ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, +QIO_CHANNEL_FEATURE_SHUTDOWN) at the +ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? +ï¼ ï¼ +ï¼ +ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, +thanks. +ï¼ +ï¼ ï¼ Dave +ï¼ ï¼ +ï¼ ï¼ï¼ Thanks, +ï¼ ï¼ï¼ Hailiang +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: +ï¼ ï¼ï¼ï¼ Thank youã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ I have test areadyã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same +placeã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary Node +qemu will not produce the problem,but Primary Node panic canã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support +QIO_CHANNEL_FEATURE_SHUTDOWN. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ I test a patch: +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ --- a/migration/socket.c +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean +socket_accept_incoming_migration(QIOChannel *ioc, +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ } +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), +"migration-socket-incoming") +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN) +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ My test will not hang any more. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ åå§é®ä»¶ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden +ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden +ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden +ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Hi,Wang. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ You can test this branch: +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +http://wiki.qemu-project.org/Features/COLO +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Thanks +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Zhang Chen +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ hi. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", +ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should +ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ thank you. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ +ï¼ ï¼ï¼ï¼ ï¼ address@hidden +ï¼ ï¼ï¼ï¼ ï¼ address@hidden +ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ +ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, +ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. +ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, +ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ Thanks +ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized +outï¼, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, +errp=0x0) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message +(errp=0x7f3d62bfaa48, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- +ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ -- +ï¼ ï¼ï¼ï¼ ï¼ Thanks +ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ +ï¼ ï¼ -- +ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK +ï¼ ï¼ +ï¼ ï¼ . +ï¼ ï¼ +ï¼ + +On 2017/3/22 16:09, address@hidden wrote: +hi: + +yes.it is better. + +And should we delete +Yes, you are right. +#ifdef WIN32 + + QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) + +#endif + + + + +in qio_channel_socket_acceptï¼ + +qio_channel_socket_new already have it. + + + + + + + + + + + + +åå§é®ä»¶ + + + +åä»¶äººï¼ address@hidden +æ¶ä»¶äººï¼ç广10165992 +æéäººï¼ address@hidden address@hidden address@hidden address@hidden +æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 +主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang + + + + + +Hi, + +On 2017/3/22 9:42, address@hidden wrote: +ï¼ diff --git a/migration/socket.c b/migration/socket.c +ï¼ +ï¼ +ï¼ index 13966f1..d65a0ea 100644 +ï¼ +ï¼ +ï¼ --- a/migration/socket.c +ï¼ +ï¼ +ï¼ +++ b/migration/socket.c +ï¼ +ï¼ +ï¼ @@ -147,8 +147,9 @@ static gboolean +socket_accept_incoming_migration(QIOChannel *ioc, +ï¼ +ï¼ +ï¼ } +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ trace_migration_socket_incoming_accepted() +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") +ï¼ +ï¼ +ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) +ï¼ +ï¼ +ï¼ migration_channel_process_incoming(migrate_get_current(), +ï¼ +ï¼ +ï¼ QIO_CHANNEL(sioc)) +ï¼ +ï¼ +ï¼ object_unref(OBJECT(sioc)) +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ Is this patch ok? +ï¼ + +Yes, i think this works, but a better way maybe to call +qio_channel_set_feature() +in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the +socket accept fd, +Or fix it by this: + +diff --git a/io/channel-socket.c b/io/channel-socket.c +index f546c68..ce6894c 100644 +--- a/io/channel-socket.c ++++ b/io/channel-socket.c +@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, + Error **errp) + { + QIOChannelSocket *cioc +- +- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) +- cioc-ï¼fd = -1 ++ ++ cioc = qio_channel_socket_new() + cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) + cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) + + +Thanks, +Hailiang + +ï¼ I have test it . The test could not hang any more. +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ åå§é®ä»¶ +ï¼ +ï¼ +ï¼ +ï¼ åä»¶äººï¼ address@hidden +ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden +ï¼ æéäººï¼ address@hidden address@hidden address@hidden +ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 +ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ +ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: +ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: +ï¼ ï¼ï¼ Hi, +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in +ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do +failover, +ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) +ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if +ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN). +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() +ï¼ ï¼ï¼ if we tried to cancel migration. +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, +Error **errp) +ï¼ ï¼ï¼ { +ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") +ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) +ï¼ ï¼ï¼ ... ... +ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN) above, +ï¼ ï¼ï¼ and the +ï¼ ï¼ï¼ migrate_fd_cancel() +ï¼ ï¼ï¼ { +ï¼ ï¼ï¼ ... ... +ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { +ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? +ï¼ ï¼ï¼ } +ï¼ ï¼ï¼ } +ï¼ ï¼ +ï¼ ï¼ (cc'd in Daniel Berrange). +ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, +QIO_CHANNEL_FEATURE_SHUTDOWN) at the +ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? +ï¼ ï¼ +ï¼ +ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, +thanks. +ï¼ +ï¼ ï¼ Dave +ï¼ ï¼ +ï¼ ï¼ï¼ Thanks, +ï¼ ï¼ï¼ Hailiang +ï¼ ï¼ï¼ +ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: +ï¼ ï¼ï¼ï¼ Thank youã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ I have test areadyã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same +placeã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Incorrding +http://wiki.qemu-project.org/Features/COLO +ï¼kill Primary Node +qemu will not produce the problem,but Primary Node panic canã +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support +QIO_CHANNEL_FEATURE_SHUTDOWN. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ I test a patch: +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ --- a/migration/socket.c +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean +socket_accept_incoming_migration(QIOChannel *ioc, +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ } +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), +"migration-socket-incoming") +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), +QIO_CHANNEL_FEATURE_SHUTDOWN) +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ My test will not hang any more. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ åå§é®ä»¶ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden +ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden +ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden +ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 +ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Hi,Wang. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ You can test this branch: +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +http://wiki.qemu-project.org/Features/COLO +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Thanks +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ Zhang Chen +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ hi. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, +ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read +ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", +ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized +ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread +ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, +ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at +ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, +ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at +ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized +ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should +ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ thank you. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ +ï¼ ï¼ï¼ï¼ ï¼ address@hidden +ï¼ ï¼ï¼ï¼ ï¼ address@hidden +ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ +ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 +ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: +ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( +http://wiki.qemu-project.org/Features/COLO +). +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": +ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's +ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, +ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. +ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, +ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ Thanks +ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized +outï¼, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, +errp=0x0) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message +(errp=0x7f3d62bfaa48, +ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at +ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- +ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: +http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html +ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ -- +ï¼ ï¼ï¼ï¼ ï¼ Thanks +ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ ï¼ +ï¼ ï¼ï¼ï¼ +ï¼ ï¼ï¼ +ï¼ ï¼ -- +ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK +ï¼ ï¼ +ï¼ ï¼ . +ï¼ ï¼ +ï¼ + diff --git a/results/classifier/016/none/70868267 b/results/classifier/016/none/70868267 new file mode 100644 index 00000000..3f50c2ef --- /dev/null +++ b/results/classifier/016/none/70868267 @@ -0,0 +1,67 @@ +x86: 0.245 +operating system: 0.079 +files: 0.026 +hypervisor: 0.023 +TCG: 0.023 +debug: 0.020 +network: 0.019 +PID: 0.018 +i386: 0.011 +virtual: 0.008 +register: 0.006 +user-level: 0.004 +ppc: 0.003 +semantic: 0.003 +device: 0.002 +socket: 0.002 +assembly: 0.002 +VMM: 0.002 +kernel: 0.002 +performance: 0.001 +arm: 0.001 +alpha: 0.001 +graphic: 0.001 +vnc: 0.001 +peripherals: 0.001 +architecture: 0.001 +boot: 0.001 +risc-v: 0.001 +permissions: 0.000 +KVM: 0.000 +mistranslation: 0.000 + +[Qemu-devel] [BUG] Failed to compile using gcc7.1 + +Hi all, + +After upgrading gcc from 6.3.1 to 7.1.1, qemu can't be compiled with gcc. + +The error is: + +------ + CC block/blkdebug.o +block/blkdebug.c: In function 'blkdebug_refresh_filename': +block/blkdebug.c:693:31: error: '%s' directive output may be truncated +writing up to 4095 bytes into a region of size 4086 +[-Werror=format-truncation=] +"blkdebug:%s:%s", s->config_file ?: "", + ^~ +In file included from /usr/include/stdio.h:939:0, + from /home/adam/qemu/include/qemu/osdep.h:68, + from block/blkdebug.c:25: +/usr/include/bits/stdio2.h:64:10: note: '__builtin___snprintf_chk' +output 11 or more bytes (assuming 4106) into a destination of size 4096 +return __builtin___snprintf_chk (__s, __n, __USE_FORTIFY_LEVEL - 1, + ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + __bos (__s), __fmt, __va_arg_pack ()); + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +cc1: all warnings being treated as errors +make: *** [/home/adam/qemu/rules.mak:69: block/blkdebug.o] Error 1 +------ + +It seems that gcc 7 is introducing more restrict check for printf. +If using clang, although there are some extra warning, it can at least +pass the compile. +Thanks, +Qu + diff --git a/results/classifier/016/none/80604314 b/results/classifier/016/none/80604314 new file mode 100644 index 00000000..8112f757 --- /dev/null +++ b/results/classifier/016/none/80604314 @@ -0,0 +1,1507 @@ +hypervisor: 0.669 +network: 0.654 +debug: 0.554 +operating system: 0.404 +virtual: 0.190 +files: 0.103 +TCG: 0.102 +PID: 0.097 +boot: 0.095 +device: 0.090 +user-level: 0.089 +VMM: 0.084 +vnc: 0.081 +register: 0.062 +socket: 0.055 +kernel: 0.042 +KVM: 0.021 +risc-v: 0.019 +performance: 0.014 +assembly: 0.011 +semantic: 0.008 +architecture: 0.007 +alpha: 0.004 +ppc: 0.003 +permissions: 0.003 +graphic: 0.003 +peripherals: 0.002 +mistranslation: 0.002 +x86: 0.001 +arm: 0.001 +i386: 0.000 + +[BUG] vhost-vdpa: qemu-system-s390x crashes with second virtio-net-ccw device + +When I start qemu with a second virtio-net-ccw device (i.e. adding +-device virtio-net-ccw in addition to the autogenerated device), I get +a segfault. gdb points to + +#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, + config=0x55d6ad9e3f80 "RT") at /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { + +(backtrace doesn't go further) + +Starting qemu with no additional "-device virtio-net-ccw" (i.e., only +the autogenerated virtio-net-ccw device is present) works. Specifying +several "-device virtio-net-pci" works as well. + +Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net +client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") +works (in-between state does not compile). + +This is reproducible with tcg as well. Same problem both with +--enable-vhost-vdpa and --disable-vhost-vdpa. + +Have not yet tried to figure out what might be special with +virtio-ccw... anyone have an idea? + +[This should probably be considered a blocker?] + +On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +> +When I start qemu with a second virtio-net-ccw device (i.e. adding +> +-device virtio-net-ccw in addition to the autogenerated device), I get +> +a segfault. gdb points to +> +> +#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, +> +config=0x55d6ad9e3f80 "RT") at +> +/home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { +> +> +(backtrace doesn't go further) +> +> +Starting qemu with no additional "-device virtio-net-ccw" (i.e., only +> +the autogenerated virtio-net-ccw device is present) works. Specifying +> +several "-device virtio-net-pci" works as well. +> +> +Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net +> +client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") +> +works (in-between state does not compile). +Ouch. I didn't test all in-between states :( +But I wish we had a 0-day instrastructure like kernel has, +that catches things like that. + +> +This is reproducible with tcg as well. Same problem both with +> +--enable-vhost-vdpa and --disable-vhost-vdpa. +> +> +Have not yet tried to figure out what might be special with +> +virtio-ccw... anyone have an idea? +> +> +[This should probably be considered a blocker?] + +On Fri, 24 Jul 2020 09:30:58 -0400 +"Michael S. Tsirkin" <mst@redhat.com> wrote: + +> +On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +> +> When I start qemu with a second virtio-net-ccw device (i.e. adding +> +> -device virtio-net-ccw in addition to the autogenerated device), I get +> +> a segfault. gdb points to +> +> +> +> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, +> +> config=0x55d6ad9e3f80 "RT") at +> +> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { +> +> +> +> (backtrace doesn't go further) +The core was incomplete, but running under gdb directly shows that it +is just a bog-standard config space access (first for that device). + +The cause of the crash is that nc->peer is not set... no idea how that +can happen, not that familiar with that part of QEMU. (Should the code +check, or is that really something that should not happen?) + +What I don't understand is why it is set correctly for the first, +autogenerated virtio-net-ccw device, but not for the second one, and +why virtio-net-pci doesn't show these problems. The only difference +between -ccw and -pci that comes to my mind here is that config space +accesses for ccw are done via an asynchronous operation, so timing +might be different. + +> +> +> +> Starting qemu with no additional "-device virtio-net-ccw" (i.e., only +> +> the autogenerated virtio-net-ccw device is present) works. Specifying +> +> several "-device virtio-net-pci" works as well. +> +> +> +> Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net +> +> client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") +> +> works (in-between state does not compile). +> +> +Ouch. I didn't test all in-between states :( +> +But I wish we had a 0-day instrastructure like kernel has, +> +that catches things like that. +Yep, that would be useful... so patchew only builds the complete series? + +> +> +> This is reproducible with tcg as well. Same problem both with +> +> --enable-vhost-vdpa and --disable-vhost-vdpa. +> +> +> +> Have not yet tried to figure out what might be special with +> +> virtio-ccw... anyone have an idea? +> +> +> +> [This should probably be considered a blocker?] +I think so, as it makes s390x unusable with more that one +virtio-net-ccw device, and I don't even see a workaround. + +On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +> +On Fri, 24 Jul 2020 09:30:58 -0400 +> +"Michael S. Tsirkin" <mst@redhat.com> wrote: +> +> +> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +> +> > When I start qemu with a second virtio-net-ccw device (i.e. adding +> +> > -device virtio-net-ccw in addition to the autogenerated device), I get +> +> > a segfault. gdb points to +> +> > +> +> > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, +> +> > config=0x55d6ad9e3f80 "RT") at +> +> > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +> > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { +> +> > +> +> > (backtrace doesn't go further) +> +> +The core was incomplete, but running under gdb directly shows that it +> +is just a bog-standard config space access (first for that device). +> +> +The cause of the crash is that nc->peer is not set... no idea how that +> +can happen, not that familiar with that part of QEMU. (Should the code +> +check, or is that really something that should not happen?) +> +> +What I don't understand is why it is set correctly for the first, +> +autogenerated virtio-net-ccw device, but not for the second one, and +> +why virtio-net-pci doesn't show these problems. The only difference +> +between -ccw and -pci that comes to my mind here is that config space +> +accesses for ccw are done via an asynchronous operation, so timing +> +might be different. +Hopefully Jason has an idea. Could you post a full command line +please? Do you need a working guest to trigger this? Does this trigger +on an x86 host? + +> +> > +> +> > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only +> +> > the autogenerated virtio-net-ccw device is present) works. Specifying +> +> > several "-device virtio-net-pci" works as well. +> +> > +> +> > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net +> +> > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") +> +> > works (in-between state does not compile). +> +> +> +> Ouch. I didn't test all in-between states :( +> +> But I wish we had a 0-day instrastructure like kernel has, +> +> that catches things like that. +> +> +Yep, that would be useful... so patchew only builds the complete series? +> +> +> +> +> > This is reproducible with tcg as well. Same problem both with +> +> > --enable-vhost-vdpa and --disable-vhost-vdpa. +> +> > +> +> > Have not yet tried to figure out what might be special with +> +> > virtio-ccw... anyone have an idea? +> +> > +> +> > [This should probably be considered a blocker?] +> +> +I think so, as it makes s390x unusable with more that one +> +virtio-net-ccw device, and I don't even see a workaround. + +On Fri, 24 Jul 2020 11:17:57 -0400 +"Michael S. Tsirkin" <mst@redhat.com> wrote: + +> +On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +> +> On Fri, 24 Jul 2020 09:30:58 -0400 +> +> "Michael S. Tsirkin" <mst@redhat.com> wrote: +> +> +> +> > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +> +> > > When I start qemu with a second virtio-net-ccw device (i.e. adding +> +> > > -device virtio-net-ccw in addition to the autogenerated device), I get +> +> > > a segfault. gdb points to +> +> > > +> +> > > #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, +> +> > > config=0x55d6ad9e3f80 "RT") at +> +> > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +> > > 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { +> +> > > +> +> > > (backtrace doesn't go further) +> +> +> +> The core was incomplete, but running under gdb directly shows that it +> +> is just a bog-standard config space access (first for that device). +> +> +> +> The cause of the crash is that nc->peer is not set... no idea how that +> +> can happen, not that familiar with that part of QEMU. (Should the code +> +> check, or is that really something that should not happen?) +> +> +> +> What I don't understand is why it is set correctly for the first, +> +> autogenerated virtio-net-ccw device, but not for the second one, and +> +> why virtio-net-pci doesn't show these problems. The only difference +> +> between -ccw and -pci that comes to my mind here is that config space +> +> accesses for ccw are done via an asynchronous operation, so timing +> +> might be different. +> +> +Hopefully Jason has an idea. Could you post a full command line +> +please? Do you need a working guest to trigger this? Does this trigger +> +on an x86 host? +Yes, it does trigger with tcg-on-x86 as well. I've been using + +s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on +-m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +-drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +-device +scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 + +-device virtio-net-ccw + +It seems it needs the guest actually doing something with the nics; I +cannot reproduce the crash if I use the old advent calendar moon buggy +image and just add a virtio-net-ccw device. + +(I don't think it's a problem with my local build, as I see the problem +both on my laptop and on an LPAR.) + +> +> +> > > +> +> > > Starting qemu with no additional "-device virtio-net-ccw" (i.e., only +> +> > > the autogenerated virtio-net-ccw device is present) works. Specifying +> +> > > several "-device virtio-net-pci" works as well. +> +> > > +> +> > > Things break with 1e0a84ea49b6 ("vhost-vdpa: introduce vhost-vdpa net +> +> > > client"), 38140cc4d971 ("vhost_net: introduce set_config & get_config") +> +> > > works (in-between state does not compile). +> +> > +> +> > Ouch. I didn't test all in-between states :( +> +> > But I wish we had a 0-day instrastructure like kernel has, +> +> > that catches things like that. +> +> +> +> Yep, that would be useful... so patchew only builds the complete series? +> +> +> +> > +> +> > > This is reproducible with tcg as well. Same problem both with +> +> > > --enable-vhost-vdpa and --disable-vhost-vdpa. +> +> > > +> +> > > Have not yet tried to figure out what might be special with +> +> > > virtio-ccw... anyone have an idea? +> +> > > +> +> > > [This should probably be considered a blocker?] +> +> +> +> I think so, as it makes s390x unusable with more that one +> +> virtio-net-ccw device, and I don't even see a workaround. +> + +On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +On Fri, 24 Jul 2020 11:17:57 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +On Fri, 24 Jul 2020 09:30:58 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +When I start qemu with a second virtio-net-ccw device (i.e. adding +-device virtio-net-ccw in addition to the autogenerated device), I get +a segfault. gdb points to + +#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, + config=0x55d6ad9e3f80 "RT") at +/home/cohuck/git/qemu/hw/net/virtio-net.c:146 +146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { + +(backtrace doesn't go further) +The core was incomplete, but running under gdb directly shows that it +is just a bog-standard config space access (first for that device). + +The cause of the crash is that nc->peer is not set... no idea how that +can happen, not that familiar with that part of QEMU. (Should the code +check, or is that really something that should not happen?) + +What I don't understand is why it is set correctly for the first, +autogenerated virtio-net-ccw device, but not for the second one, and +why virtio-net-pci doesn't show these problems. The only difference +between -ccw and -pci that comes to my mind here is that config space +accesses for ccw are done via an asynchronous operation, so timing +might be different. +Hopefully Jason has an idea. Could you post a full command line +please? Do you need a working guest to trigger this? Does this trigger +on an x86 host? +Yes, it does trigger with tcg-on-x86 as well. I've been using + +s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on +-m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +-drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +-device +scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +-device virtio-net-ccw + +It seems it needs the guest actually doing something with the nics; I +cannot reproduce the crash if I use the old advent calendar moon buggy +image and just add a virtio-net-ccw device. + +(I don't think it's a problem with my local build, as I see the problem +both on my laptop and on an LPAR.) +It looks to me we forget the check the existence of peer. + +Please try the attached patch to see if it works. + +Thanks +0001-virtio-net-check-the-existence-of-peer-before-accesi.patch +Description: +Text Data + +On Sat, 25 Jul 2020 08:40:07 +0800 +Jason Wang <jasowang@redhat.com> wrote: + +> +On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +> +> On Fri, 24 Jul 2020 11:17:57 -0400 +> +> "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +> +> +>> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +> +>>> On Fri, 24 Jul 2020 09:30:58 -0400 +> +>>> "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +>>> +> +>>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +> +>>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding +> +>>>>> -device virtio-net-ccw in addition to the autogenerated device), I get +> +>>>>> a segfault. gdb points to +> +>>>>> +> +>>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, +> +>>>>> config=0x55d6ad9e3f80 "RT") at +> +>>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +>>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { +> +>>>>> +> +>>>>> (backtrace doesn't go further) +> +>>> The core was incomplete, but running under gdb directly shows that it +> +>>> is just a bog-standard config space access (first for that device). +> +>>> +> +>>> The cause of the crash is that nc->peer is not set... no idea how that +> +>>> can happen, not that familiar with that part of QEMU. (Should the code +> +>>> check, or is that really something that should not happen?) +> +>>> +> +>>> What I don't understand is why it is set correctly for the first, +> +>>> autogenerated virtio-net-ccw device, but not for the second one, and +> +>>> why virtio-net-pci doesn't show these problems. The only difference +> +>>> between -ccw and -pci that comes to my mind here is that config space +> +>>> accesses for ccw are done via an asynchronous operation, so timing +> +>>> might be different. +> +>> Hopefully Jason has an idea. Could you post a full command line +> +>> please? Do you need a working guest to trigger this? Does this trigger +> +>> on an x86 host? +> +> Yes, it does trigger with tcg-on-x86 as well. I've been using +> +> +> +> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu +> +> qemu,zpci=on +> +> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +> +> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +> +> -device +> +> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +> +> -device virtio-net-ccw +> +> +> +> It seems it needs the guest actually doing something with the nics; I +> +> cannot reproduce the crash if I use the old advent calendar moon buggy +> +> image and just add a virtio-net-ccw device. +> +> +> +> (I don't think it's a problem with my local build, as I see the problem +> +> both on my laptop and on an LPAR.) +> +> +> +It looks to me we forget the check the existence of peer. +> +> +Please try the attached patch to see if it works. +Thanks, that patch gets my guest up and running again. So, FWIW, + +Tested-by: Cornelia Huck <cohuck@redhat.com> + +Any idea why this did not hit with virtio-net-pci (or the autogenerated +virtio-net-ccw device)? + +On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: +On Sat, 25 Jul 2020 08:40:07 +0800 +Jason Wang <jasowang@redhat.com> wrote: +On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +On Fri, 24 Jul 2020 11:17:57 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +On Fri, 24 Jul 2020 09:30:58 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +When I start qemu with a second virtio-net-ccw device (i.e. adding +-device virtio-net-ccw in addition to the autogenerated device), I get +a segfault. gdb points to + +#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, + config=0x55d6ad9e3f80 "RT") at +/home/cohuck/git/qemu/hw/net/virtio-net.c:146 +146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { + +(backtrace doesn't go further) +The core was incomplete, but running under gdb directly shows that it +is just a bog-standard config space access (first for that device). + +The cause of the crash is that nc->peer is not set... no idea how that +can happen, not that familiar with that part of QEMU. (Should the code +check, or is that really something that should not happen?) + +What I don't understand is why it is set correctly for the first, +autogenerated virtio-net-ccw device, but not for the second one, and +why virtio-net-pci doesn't show these problems. The only difference +between -ccw and -pci that comes to my mind here is that config space +accesses for ccw are done via an asynchronous operation, so timing +might be different. +Hopefully Jason has an idea. Could you post a full command line +please? Do you need a working guest to trigger this? Does this trigger +on an x86 host? +Yes, it does trigger with tcg-on-x86 as well. I've been using + +s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on +-m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +-drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +-device +scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +-device virtio-net-ccw + +It seems it needs the guest actually doing something with the nics; I +cannot reproduce the crash if I use the old advent calendar moon buggy +image and just add a virtio-net-ccw device. + +(I don't think it's a problem with my local build, as I see the problem +both on my laptop and on an LPAR.) +It looks to me we forget the check the existence of peer. + +Please try the attached patch to see if it works. +Thanks, that patch gets my guest up and running again. So, FWIW, + +Tested-by: Cornelia Huck <cohuck@redhat.com> + +Any idea why this did not hit with virtio-net-pci (or the autogenerated +virtio-net-ccw device)? +It can be hit with virtio-net-pci as well (just start without peer). +For autogenerated virtio-net-cww, I think the reason is that it has +already had a peer set. +Thanks + +On Mon, 27 Jul 2020 15:38:12 +0800 +Jason Wang <jasowang@redhat.com> wrote: + +> +On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: +> +> On Sat, 25 Jul 2020 08:40:07 +0800 +> +> Jason Wang <jasowang@redhat.com> wrote: +> +> +> +>> On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +> +>>> On Fri, 24 Jul 2020 11:17:57 -0400 +> +>>> "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +>>> +> +>>>> On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +> +>>>>> On Fri, 24 Jul 2020 09:30:58 -0400 +> +>>>>> "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +>>>>> +> +>>>>>> On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +> +>>>>>>> When I start qemu with a second virtio-net-ccw device (i.e. adding +> +>>>>>>> -device virtio-net-ccw in addition to the autogenerated device), I get +> +>>>>>>> a segfault. gdb points to +> +>>>>>>> +> +>>>>>>> #0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, +> +>>>>>>> config=0x55d6ad9e3f80 "RT") at +> +>>>>>>> /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +>>>>>>> 146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { +> +>>>>>>> +> +>>>>>>> (backtrace doesn't go further) +> +>>>>> The core was incomplete, but running under gdb directly shows that it +> +>>>>> is just a bog-standard config space access (first for that device). +> +>>>>> +> +>>>>> The cause of the crash is that nc->peer is not set... no idea how that +> +>>>>> can happen, not that familiar with that part of QEMU. (Should the code +> +>>>>> check, or is that really something that should not happen?) +> +>>>>> +> +>>>>> What I don't understand is why it is set correctly for the first, +> +>>>>> autogenerated virtio-net-ccw device, but not for the second one, and +> +>>>>> why virtio-net-pci doesn't show these problems. The only difference +> +>>>>> between -ccw and -pci that comes to my mind here is that config space +> +>>>>> accesses for ccw are done via an asynchronous operation, so timing +> +>>>>> might be different. +> +>>>> Hopefully Jason has an idea. Could you post a full command line +> +>>>> please? Do you need a working guest to trigger this? Does this trigger +> +>>>> on an x86 host? +> +>>> Yes, it does trigger with tcg-on-x86 as well. I've been using +> +>>> +> +>>> s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu +> +>>> qemu,zpci=on +> +>>> -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +> +>>> -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +> +>>> -device +> +>>> scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +> +>>> -device virtio-net-ccw +> +>>> +> +>>> It seems it needs the guest actually doing something with the nics; I +> +>>> cannot reproduce the crash if I use the old advent calendar moon buggy +> +>>> image and just add a virtio-net-ccw device. +> +>>> +> +>>> (I don't think it's a problem with my local build, as I see the problem +> +>>> both on my laptop and on an LPAR.) +> +>> +> +>> It looks to me we forget the check the existence of peer. +> +>> +> +>> Please try the attached patch to see if it works. +> +> Thanks, that patch gets my guest up and running again. So, FWIW, +> +> +> +> Tested-by: Cornelia Huck <cohuck@redhat.com> +> +> +> +> Any idea why this did not hit with virtio-net-pci (or the autogenerated +> +> virtio-net-ccw device)? +> +> +> +It can be hit with virtio-net-pci as well (just start without peer). +Hm, I had not been able to reproduce the crash with a 'naked' -device +virtio-net-pci. But checking seems to be the right idea anyway. + +> +> +For autogenerated virtio-net-cww, I think the reason is that it has +> +already had a peer set. +Ok, that might well be. + +On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: +On Mon, 27 Jul 2020 15:38:12 +0800 +Jason Wang <jasowang@redhat.com> wrote: +On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: +On Sat, 25 Jul 2020 08:40:07 +0800 +Jason Wang <jasowang@redhat.com> wrote: +On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +On Fri, 24 Jul 2020 11:17:57 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +On Fri, 24 Jul 2020 09:30:58 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +When I start qemu with a second virtio-net-ccw device (i.e. adding +-device virtio-net-ccw in addition to the autogenerated device), I get +a segfault. gdb points to + +#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, + config=0x55d6ad9e3f80 "RT") at +/home/cohuck/git/qemu/hw/net/virtio-net.c:146 +146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { + +(backtrace doesn't go further) +The core was incomplete, but running under gdb directly shows that it +is just a bog-standard config space access (first for that device). + +The cause of the crash is that nc->peer is not set... no idea how that +can happen, not that familiar with that part of QEMU. (Should the code +check, or is that really something that should not happen?) + +What I don't understand is why it is set correctly for the first, +autogenerated virtio-net-ccw device, but not for the second one, and +why virtio-net-pci doesn't show these problems. The only difference +between -ccw and -pci that comes to my mind here is that config space +accesses for ccw are done via an asynchronous operation, so timing +might be different. +Hopefully Jason has an idea. Could you post a full command line +please? Do you need a working guest to trigger this? Does this trigger +on an x86 host? +Yes, it does trigger with tcg-on-x86 as well. I've been using + +s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on +-m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +-drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +-device +scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +-device virtio-net-ccw + +It seems it needs the guest actually doing something with the nics; I +cannot reproduce the crash if I use the old advent calendar moon buggy +image and just add a virtio-net-ccw device. + +(I don't think it's a problem with my local build, as I see the problem +both on my laptop and on an LPAR.) +It looks to me we forget the check the existence of peer. + +Please try the attached patch to see if it works. +Thanks, that patch gets my guest up and running again. So, FWIW, + +Tested-by: Cornelia Huck <cohuck@redhat.com> + +Any idea why this did not hit with virtio-net-pci (or the autogenerated +virtio-net-ccw device)? +It can be hit with virtio-net-pci as well (just start without peer). +Hm, I had not been able to reproduce the crash with a 'naked' -device +virtio-net-pci. But checking seems to be the right idea anyway. +Sorry for being unclear, I meant for networking part, you just need +start without peer, and you need a real guest (any Linux) that is trying +to access the config space of virtio-net. +Thanks +For autogenerated virtio-net-cww, I think the reason is that it has +already had a peer set. +Ok, that might well be. + +On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: +> +> +On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: +> +> On Mon, 27 Jul 2020 15:38:12 +0800 +> +> Jason Wang <jasowang@redhat.com> wrote: +> +> +> +> > On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: +> +> > > On Sat, 25 Jul 2020 08:40:07 +0800 +> +> > > Jason Wang <jasowang@redhat.com> wrote: +> +> > > > On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +> +> > > > > On Fri, 24 Jul 2020 11:17:57 -0400 +> +> > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +> > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +> +> > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400 +> +> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +> > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +> +> > > > > > > > > When I start qemu with a second virtio-net-ccw device (i.e. +> +> > > > > > > > > adding +> +> > > > > > > > > -device virtio-net-ccw in addition to the autogenerated +> +> > > > > > > > > device), I get +> +> > > > > > > > > a segfault. gdb points to +> +> > > > > > > > > +> +> > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config +> +> > > > > > > > > (vdev=<optimized out>, +> +> > > > > > > > > config=0x55d6ad9e3f80 "RT") at +> +> > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +> > > > > > > > > 146 if (nc->peer->info->type == +> +> > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) { +> +> > > > > > > > > +> +> > > > > > > > > (backtrace doesn't go further) +> +> > > > > > > The core was incomplete, but running under gdb directly shows +> +> > > > > > > that it +> +> > > > > > > is just a bog-standard config space access (first for that +> +> > > > > > > device). +> +> > > > > > > +> +> > > > > > > The cause of the crash is that nc->peer is not set... no idea +> +> > > > > > > how that +> +> > > > > > > can happen, not that familiar with that part of QEMU. (Should +> +> > > > > > > the code +> +> > > > > > > check, or is that really something that should not happen?) +> +> > > > > > > +> +> > > > > > > What I don't understand is why it is set correctly for the +> +> > > > > > > first, +> +> > > > > > > autogenerated virtio-net-ccw device, but not for the second +> +> > > > > > > one, and +> +> > > > > > > why virtio-net-pci doesn't show these problems. The only +> +> > > > > > > difference +> +> > > > > > > between -ccw and -pci that comes to my mind here is that config +> +> > > > > > > space +> +> > > > > > > accesses for ccw are done via an asynchronous operation, so +> +> > > > > > > timing +> +> > > > > > > might be different. +> +> > > > > > Hopefully Jason has an idea. Could you post a full command line +> +> > > > > > please? Do you need a working guest to trigger this? Does this +> +> > > > > > trigger +> +> > > > > > on an x86 host? +> +> > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using +> +> > > > > +> +> > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu +> +> > > > > qemu,zpci=on +> +> > > > > -m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +> +> > > > > -drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +> +> > > > > -device +> +> > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +> +> > > > > -device virtio-net-ccw +> +> > > > > +> +> > > > > It seems it needs the guest actually doing something with the nics; +> +> > > > > I +> +> > > > > cannot reproduce the crash if I use the old advent calendar moon +> +> > > > > buggy +> +> > > > > image and just add a virtio-net-ccw device. +> +> > > > > +> +> > > > > (I don't think it's a problem with my local build, as I see the +> +> > > > > problem +> +> > > > > both on my laptop and on an LPAR.) +> +> > > > It looks to me we forget the check the existence of peer. +> +> > > > +> +> > > > Please try the attached patch to see if it works. +> +> > > Thanks, that patch gets my guest up and running again. So, FWIW, +> +> > > +> +> > > Tested-by: Cornelia Huck <cohuck@redhat.com> +> +> > > +> +> > > Any idea why this did not hit with virtio-net-pci (or the autogenerated +> +> > > virtio-net-ccw device)? +> +> > +> +> > It can be hit with virtio-net-pci as well (just start without peer). +> +> Hm, I had not been able to reproduce the crash with a 'naked' -device +> +> virtio-net-pci. But checking seems to be the right idea anyway. +> +> +> +Sorry for being unclear, I meant for networking part, you just need start +> +without peer, and you need a real guest (any Linux) that is trying to access +> +the config space of virtio-net. +> +> +Thanks +A pxe guest will do it, but that doesn't support ccw, right? + +I'm still unclear why this triggers with ccw but not pci - +any idea? + +> +> +> +> +> > For autogenerated virtio-net-cww, I think the reason is that it has +> +> > already had a peer set. +> +> Ok, that might well be. +> +> +> +> + +On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: +On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: +On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: +On Mon, 27 Jul 2020 15:38:12 +0800 +Jason Wang<jasowang@redhat.com> wrote: +On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: +On Sat, 25 Jul 2020 08:40:07 +0800 +Jason Wang<jasowang@redhat.com> wrote: +On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +On Fri, 24 Jul 2020 11:17:57 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +On Fri, 24 Jul 2020 09:30:58 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +When I start qemu with a second virtio-net-ccw device (i.e. adding +-device virtio-net-ccw in addition to the autogenerated device), I get +a segfault. gdb points to + +#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, + config=0x55d6ad9e3f80 "RT") at +/home/cohuck/git/qemu/hw/net/virtio-net.c:146 +146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { + +(backtrace doesn't go further) +The core was incomplete, but running under gdb directly shows that it +is just a bog-standard config space access (first for that device). + +The cause of the crash is that nc->peer is not set... no idea how that +can happen, not that familiar with that part of QEMU. (Should the code +check, or is that really something that should not happen?) + +What I don't understand is why it is set correctly for the first, +autogenerated virtio-net-ccw device, but not for the second one, and +why virtio-net-pci doesn't show these problems. The only difference +between -ccw and -pci that comes to my mind here is that config space +accesses for ccw are done via an asynchronous operation, so timing +might be different. +Hopefully Jason has an idea. Could you post a full command line +please? Do you need a working guest to trigger this? Does this trigger +on an x86 host? +Yes, it does trigger with tcg-on-x86 as well. I've been using + +s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on +-m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +-drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +-device +scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +-device virtio-net-ccw + +It seems it needs the guest actually doing something with the nics; I +cannot reproduce the crash if I use the old advent calendar moon buggy +image and just add a virtio-net-ccw device. + +(I don't think it's a problem with my local build, as I see the problem +both on my laptop and on an LPAR.) +It looks to me we forget the check the existence of peer. + +Please try the attached patch to see if it works. +Thanks, that patch gets my guest up and running again. So, FWIW, + +Tested-by: Cornelia Huck<cohuck@redhat.com> + +Any idea why this did not hit with virtio-net-pci (or the autogenerated +virtio-net-ccw device)? +It can be hit with virtio-net-pci as well (just start without peer). +Hm, I had not been able to reproduce the crash with a 'naked' -device +virtio-net-pci. But checking seems to be the right idea anyway. +Sorry for being unclear, I meant for networking part, you just need start +without peer, and you need a real guest (any Linux) that is trying to access +the config space of virtio-net. + +Thanks +A pxe guest will do it, but that doesn't support ccw, right? +Yes, it depends on the cli actually. +I'm still unclear why this triggers with ccw but not pci - +any idea? +I don't test pxe but I can reproduce this with pci (just start a linux +guest without a peer). +Thanks + +On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote: +> +> +On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: +> +> On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: +> +> > On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: +> +> > > On Mon, 27 Jul 2020 15:38:12 +0800 +> +> > > Jason Wang<jasowang@redhat.com> wrote: +> +> > > +> +> > > > On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: +> +> > > > > On Sat, 25 Jul 2020 08:40:07 +0800 +> +> > > > > Jason Wang<jasowang@redhat.com> wrote: +> +> > > > > > On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +> +> > > > > > > On Fri, 24 Jul 2020 11:17:57 -0400 +> +> > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +> > > > > > > > On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +> +> > > > > > > > > On Fri, 24 Jul 2020 09:30:58 -0400 +> +> > > > > > > > > "Michael S. Tsirkin"<mst@redhat.com> wrote: +> +> > > > > > > > > > On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck +> +> > > > > > > > > > wrote: +> +> > > > > > > > > > > When I start qemu with a second virtio-net-ccw device +> +> > > > > > > > > > > (i.e. adding +> +> > > > > > > > > > > -device virtio-net-ccw in addition to the autogenerated +> +> > > > > > > > > > > device), I get +> +> > > > > > > > > > > a segfault. gdb points to +> +> > > > > > > > > > > +> +> > > > > > > > > > > #0 0x000055d6ab52681d in virtio_net_get_config +> +> > > > > > > > > > > (vdev=<optimized out>, +> +> > > > > > > > > > > config=0x55d6ad9e3f80 "RT") at +> +> > > > > > > > > > > /home/cohuck/git/qemu/hw/net/virtio-net.c:146 +> +> > > > > > > > > > > 146 if (nc->peer->info->type == +> +> > > > > > > > > > > NET_CLIENT_DRIVER_VHOST_VDPA) { +> +> > > > > > > > > > > +> +> > > > > > > > > > > (backtrace doesn't go further) +> +> > > > > > > > > The core was incomplete, but running under gdb directly +> +> > > > > > > > > shows that it +> +> > > > > > > > > is just a bog-standard config space access (first for that +> +> > > > > > > > > device). +> +> > > > > > > > > +> +> > > > > > > > > The cause of the crash is that nc->peer is not set... no +> +> > > > > > > > > idea how that +> +> > > > > > > > > can happen, not that familiar with that part of QEMU. +> +> > > > > > > > > (Should the code +> +> > > > > > > > > check, or is that really something that should not happen?) +> +> > > > > > > > > +> +> > > > > > > > > What I don't understand is why it is set correctly for the +> +> > > > > > > > > first, +> +> > > > > > > > > autogenerated virtio-net-ccw device, but not for the second +> +> > > > > > > > > one, and +> +> > > > > > > > > why virtio-net-pci doesn't show these problems. The only +> +> > > > > > > > > difference +> +> > > > > > > > > between -ccw and -pci that comes to my mind here is that +> +> > > > > > > > > config space +> +> > > > > > > > > accesses for ccw are done via an asynchronous operation, so +> +> > > > > > > > > timing +> +> > > > > > > > > might be different. +> +> > > > > > > > Hopefully Jason has an idea. Could you post a full command +> +> > > > > > > > line +> +> > > > > > > > please? Do you need a working guest to trigger this? Does +> +> > > > > > > > this trigger +> +> > > > > > > > on an x86 host? +> +> > > > > > > Yes, it does trigger with tcg-on-x86 as well. I've been using +> +> > > > > > > +> +> > > > > > > s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg +> +> > > > > > > -cpu qemu,zpci=on +> +> > > > > > > -m 1024 -nographic -device +> +> > > > > > > virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +> +> > > > > > > -drive +> +> > > > > > > file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +> +> > > > > > > -device +> +> > > > > > > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +> +> > > > > > > -device virtio-net-ccw +> +> > > > > > > +> +> > > > > > > It seems it needs the guest actually doing something with the +> +> > > > > > > nics; I +> +> > > > > > > cannot reproduce the crash if I use the old advent calendar +> +> > > > > > > moon buggy +> +> > > > > > > image and just add a virtio-net-ccw device. +> +> > > > > > > +> +> > > > > > > (I don't think it's a problem with my local build, as I see the +> +> > > > > > > problem +> +> > > > > > > both on my laptop and on an LPAR.) +> +> > > > > > It looks to me we forget the check the existence of peer. +> +> > > > > > +> +> > > > > > Please try the attached patch to see if it works. +> +> > > > > Thanks, that patch gets my guest up and running again. So, FWIW, +> +> > > > > +> +> > > > > Tested-by: Cornelia Huck<cohuck@redhat.com> +> +> > > > > +> +> > > > > Any idea why this did not hit with virtio-net-pci (or the +> +> > > > > autogenerated +> +> > > > > virtio-net-ccw device)? +> +> > > > It can be hit with virtio-net-pci as well (just start without peer). +> +> > > Hm, I had not been able to reproduce the crash with a 'naked' -device +> +> > > virtio-net-pci. But checking seems to be the right idea anyway. +> +> > Sorry for being unclear, I meant for networking part, you just need start +> +> > without peer, and you need a real guest (any Linux) that is trying to +> +> > access +> +> > the config space of virtio-net. +> +> > +> +> > Thanks +> +> A pxe guest will do it, but that doesn't support ccw, right? +> +> +> +Yes, it depends on the cli actually. +> +> +> +> +> +> I'm still unclear why this triggers with ccw but not pci - +> +> any idea? +> +> +> +I don't test pxe but I can reproduce this with pci (just start a linux guest +> +without a peer). +> +> +Thanks +> +Might be a good addition to a unit test. Not sure what would the +test do exactly: just make sure guest runs? Looks like a lot of work +for an empty test ... maybe we can poke at the guest config with +qtest commands at least. + +-- +MST + +On 2020/7/27 ä¸å9:16, Michael S. Tsirkin wrote: +On Mon, Jul 27, 2020 at 08:44:09PM +0800, Jason Wang wrote: +On 2020/7/27 ä¸å7:43, Michael S. Tsirkin wrote: +On Mon, Jul 27, 2020 at 04:51:23PM +0800, Jason Wang wrote: +On 2020/7/27 ä¸å4:41, Cornelia Huck wrote: +On Mon, 27 Jul 2020 15:38:12 +0800 +Jason Wang<jasowang@redhat.com> wrote: +On 2020/7/27 ä¸å2:43, Cornelia Huck wrote: +On Sat, 25 Jul 2020 08:40:07 +0800 +Jason Wang<jasowang@redhat.com> wrote: +On 2020/7/24 ä¸å11:34, Cornelia Huck wrote: +On Fri, 24 Jul 2020 11:17:57 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 04:56:27PM +0200, Cornelia Huck wrote: +On Fri, 24 Jul 2020 09:30:58 -0400 +"Michael S. Tsirkin"<mst@redhat.com> wrote: +On Fri, Jul 24, 2020 at 03:27:18PM +0200, Cornelia Huck wrote: +When I start qemu with a second virtio-net-ccw device (i.e. adding +-device virtio-net-ccw in addition to the autogenerated device), I get +a segfault. gdb points to + +#0 0x000055d6ab52681d in virtio_net_get_config (vdev=<optimized out>, + config=0x55d6ad9e3f80 "RT") at +/home/cohuck/git/qemu/hw/net/virtio-net.c:146 +146 if (nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_VDPA) { + +(backtrace doesn't go further) +The core was incomplete, but running under gdb directly shows that it +is just a bog-standard config space access (first for that device). + +The cause of the crash is that nc->peer is not set... no idea how that +can happen, not that familiar with that part of QEMU. (Should the code +check, or is that really something that should not happen?) + +What I don't understand is why it is set correctly for the first, +autogenerated virtio-net-ccw device, but not for the second one, and +why virtio-net-pci doesn't show these problems. The only difference +between -ccw and -pci that comes to my mind here is that config space +accesses for ccw are done via an asynchronous operation, so timing +might be different. +Hopefully Jason has an idea. Could you post a full command line +please? Do you need a working guest to trigger this? Does this trigger +on an x86 host? +Yes, it does trigger with tcg-on-x86 as well. I've been using + +s390x-softmmu/qemu-system-s390x -M s390-ccw-virtio,accel=tcg -cpu qemu,zpci=on +-m 1024 -nographic -device virtio-scsi-ccw,id=scsi0,devno=fe.0.0001 +-drive file=/path/to/image,format=qcow2,if=none,id=drive-scsi0-0-0-0 +-device +scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 +-device virtio-net-ccw + +It seems it needs the guest actually doing something with the nics; I +cannot reproduce the crash if I use the old advent calendar moon buggy +image and just add a virtio-net-ccw device. + +(I don't think it's a problem with my local build, as I see the problem +both on my laptop and on an LPAR.) +It looks to me we forget the check the existence of peer. + +Please try the attached patch to see if it works. +Thanks, that patch gets my guest up and running again. So, FWIW, + +Tested-by: Cornelia Huck<cohuck@redhat.com> + +Any idea why this did not hit with virtio-net-pci (or the autogenerated +virtio-net-ccw device)? +It can be hit with virtio-net-pci as well (just start without peer). +Hm, I had not been able to reproduce the crash with a 'naked' -device +virtio-net-pci. But checking seems to be the right idea anyway. +Sorry for being unclear, I meant for networking part, you just need start +without peer, and you need a real guest (any Linux) that is trying to access +the config space of virtio-net. + +Thanks +A pxe guest will do it, but that doesn't support ccw, right? +Yes, it depends on the cli actually. +I'm still unclear why this triggers with ccw but not pci - +any idea? +I don't test pxe but I can reproduce this with pci (just start a linux guest +without a peer). + +Thanks +Might be a good addition to a unit test. Not sure what would the +test do exactly: just make sure guest runs? Looks like a lot of work +for an empty test ... maybe we can poke at the guest config with +qtest commands at least. +That should work or we can simply extend the exist virtio-net qtest to +do that. +Thanks + |