register: 0.709 assembly: 0.697 graphic: 0.694 user-level: 0.679 permissions: 0.677 PID: 0.677 virtual: 0.675 performance: 0.673 semantic: 0.671 debug: 0.663 arm: 0.658 device: 0.647 architecture: 0.640 network: 0.614 files: 0.608 KVM: 0.598 socket: 0.585 VMM: 0.582 ppc: 0.574 kernel: 0.570 boot: 0.569 operating system: 0.565 TCG: 0.565 alpha: 0.562 hypervisor: 0.559 x86: 0.542 mistranslation: 0.535 vnc: 0.525 peripherals: 0.501 risc-v: 0.464 i386: 0.449 [Qemu-devel] [BUG] VM abort after migration Hi guys, We found a qemu core in our testing environment, the assertion 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and the bus->irq_count[i] is '-1'. Through analysis, it was happened after VM migration and we think it was caused by the following sequence: *Migration Source* 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) 2. save E1000: e1000_pre_save e1000_mit_timer set_interrupt_cause pci_set_irq --> update pci_dev->irq_state to 1 and update bus->irq_count[x] to 1 ( new ) the irq_state sent to dest. *Migration Dest* 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is 1. 2. If the e1000 need change irqline , it would call to pci_irq_handler(), the irq_state maybe change to 0 and bus->irq_count[x] will become -1 in this situation. 3. do VM reboot then the assertion will be triggered. We also found some guys faced the similar problem: [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html [2] https://bugs.launchpad.net/qemu/+bug/1702621 Is there some patches to fix this problem ? Can we save pcibus state after all the pci devs are saved ? Thanks, Longpeng(Mike) * longpeng (address@hidden) wrote: > Hi guys, > > We found a qemu core in our testing environment, the assertion > 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and > the bus->irq_count[i] is '-1'. > > Through analysis, it was happened after VM migration and we think > it was caused by the following sequence: > > *Migration Source* > 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) > 2. save E1000: > e1000_pre_save > e1000_mit_timer > set_interrupt_cause > pci_set_irq --> update pci_dev->irq_state to 1 and > update bus->irq_count[x] to 1 ( new ) > the irq_state sent to dest. > > *Migration Dest* > 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is 1. > 2. If the e1000 need change irqline , it would call to pci_irq_handler(), > the irq_state maybe change to 0 and bus->irq_count[x] will become > -1 in this situation. > 3. do VM reboot then the assertion will be triggered. > > We also found some guys faced the similar problem: > [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html > [2] https://bugs.launchpad.net/qemu/+bug/1702621 > > Is there some patches to fix this problem ? I don't remember any. > Can we save pcibus state after all the pci devs are saved ? Does this problem only happen with e1000? I think so. If it's only e1000 I think we should fix it - I think once the VM is stopped for doing the device migration it shouldn't be raising interrupts. Dave > Thanks, > Longpeng(Mike) -- Dr. David Alan Gilbert / address@hidden / Manchester, UK On 2019/7/8 下午5:47, Dr. David Alan Gilbert wrote: * longpeng (address@hidden) wrote: Hi guys, We found a qemu core in our testing environment, the assertion 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and the bus->irq_count[i] is '-1'. Through analysis, it was happened after VM migration and we think it was caused by the following sequence: *Migration Source* 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) 2. save E1000: e1000_pre_save e1000_mit_timer set_interrupt_cause pci_set_irq --> update pci_dev->irq_state to 1 and update bus->irq_count[x] to 1 ( new ) the irq_state sent to dest. *Migration Dest* 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is 1. 2. If the e1000 need change irqline , it would call to pci_irq_handler(), the irq_state maybe change to 0 and bus->irq_count[x] will become -1 in this situation. 3. do VM reboot then the assertion will be triggered. We also found some guys faced the similar problem: [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html [2] https://bugs.launchpad.net/qemu/+bug/1702621 Is there some patches to fix this problem ? I don't remember any. Can we save pcibus state after all the pci devs are saved ? Does this problem only happen with e1000? I think so. If it's only e1000 I think we should fix it - I think once the VM is stopped for doing the device migration it shouldn't be raising interrupts. I wonder maybe we can simply fix this by no setting ICS on pre_save() but scheduling mit timer unconditionally in post_load(). Thanks Dave Thanks, Longpeng(Mike) -- Dr. David Alan Gilbert / address@hidden / Manchester, UK 在 2019/7/10 11:25, Jason Wang 写道: > > On 2019/7/8 下午5:47, Dr. David Alan Gilbert wrote: > > * longpeng (address@hidden) wrote: > >> Hi guys, > >> > >> We found a qemu core in our testing environment, the assertion > >> 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and > >> the bus->irq_count[i] is '-1'. > >> > >> Through analysis, it was happened after VM migration and we think > >> it was caused by the following sequence: > >> > >> *Migration Source* > >> 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) > >> 2. save E1000: > >>     e1000_pre_save > >>      e1000_mit_timer > >>       set_interrupt_cause > >>        pci_set_irq --> update pci_dev->irq_state to 1 and > >>                    update bus->irq_count[x] to 1 ( new ) > >>      the irq_state sent to dest. > >> > >> *Migration Dest* > >> 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is 1. > >> 2. If the e1000 need change irqline , it would call to pci_irq_handler(), > >>    the irq_state maybe change to 0 and bus->irq_count[x] will become > >>    -1 in this situation. > >> 3. do VM reboot then the assertion will be triggered. > >> > >> We also found some guys faced the similar problem: > >> [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html > >> [2] https://bugs.launchpad.net/qemu/+bug/1702621 > >> > >> Is there some patches to fix this problem ? > > I don't remember any. > > > >> Can we save pcibus state after all the pci devs are saved ? > > Does this problem only happen with e1000? I think so. > > If it's only e1000 I think we should fix it - I think once the VM is > > stopped for doing the device migration it shouldn't be raising > > interrupts. > > > I wonder maybe we can simply fix this by no setting ICS on pre_save() but > scheduling mit timer unconditionally in post_load(). > I also think this is a bug of e1000 because we find more cores with the same frame thease days. I'm not familiar with e1000 so hope someone could fix it, thanks. :) > Thanks > > > > > > Dave > > > >> Thanks, > >> Longpeng(Mike) > > -- > > Dr. David Alan Gilbert / address@hidden / Manchester, UK > > . > -- Regards, Longpeng(Mike) On 2019/7/10 上午11:36, Longpeng (Mike) wrote: 在 2019/7/10 11:25, Jason Wang 写道: On 2019/7/8 下午5:47, Dr. David Alan Gilbert wrote: * longpeng (address@hidden) wrote: Hi guys, We found a qemu core in our testing environment, the assertion 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and the bus->irq_count[i] is '-1'. Through analysis, it was happened after VM migration and we think it was caused by the following sequence: *Migration Source* 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) 2. save E1000:     e1000_pre_save      e1000_mit_timer       set_interrupt_cause        pci_set_irq --> update pci_dev->irq_state to 1 and                    update bus->irq_count[x] to 1 ( new )      the irq_state sent to dest. *Migration Dest* 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is 1. 2. If the e1000 need change irqline , it would call to pci_irq_handler(),    the irq_state maybe change to 0 and bus->irq_count[x] will become    -1 in this situation. 3. do VM reboot then the assertion will be triggered. We also found some guys faced the similar problem: [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html [2] https://bugs.launchpad.net/qemu/+bug/1702621 Is there some patches to fix this problem ? I don't remember any. Can we save pcibus state after all the pci devs are saved ? Does this problem only happen with e1000? I think so. If it's only e1000 I think we should fix it - I think once the VM is stopped for doing the device migration it shouldn't be raising interrupts. I wonder maybe we can simply fix this by no setting ICS on pre_save() but scheduling mit timer unconditionally in post_load(). I also think this is a bug of e1000 because we find more cores with the same frame thease days. I'm not familiar with e1000 so hope someone could fix it, thanks. :) Draft a path in attachment, please test. Thanks Thanks Dave Thanks, Longpeng(Mike) -- Dr. David Alan Gilbert / address@hidden / Manchester, UK . 0001-e1000-don-t-raise-interrupt-in-pre_save.patch Description: Text Data 在 2019/7/10 11:57, Jason Wang 写道: > > On 2019/7/10 上午11:36, Longpeng (Mike) wrote: > > 在 2019/7/10 11:25, Jason Wang 写道: > >> On 2019/7/8 下午5:47, Dr. David Alan Gilbert wrote: > >>> * longpeng (address@hidden) wrote: > >>>> Hi guys, > >>>> > >>>> We found a qemu core in our testing environment, the assertion > >>>> 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and > >>>> the bus->irq_count[i] is '-1'. > >>>> > >>>> Through analysis, it was happened after VM migration and we think > >>>> it was caused by the following sequence: > >>>> > >>>> *Migration Source* > >>>> 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) > >>>> 2. save E1000: > >>>>      e1000_pre_save > >>>>       e1000_mit_timer > >>>>        set_interrupt_cause > >>>>         pci_set_irq --> update pci_dev->irq_state to 1 and > >>>>                     update bus->irq_count[x] to 1 ( new ) > >>>>       the irq_state sent to dest. > >>>> > >>>> *Migration Dest* > >>>> 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is > >>>> 1. > >>>> 2. If the e1000 need change irqline , it would call to pci_irq_handler(), > >>>>     the irq_state maybe change to 0 and bus->irq_count[x] will become > >>>>     -1 in this situation. > >>>> 3. do VM reboot then the assertion will be triggered. > >>>> > >>>> We also found some guys faced the similar problem: > >>>> [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html > >>>> [2] https://bugs.launchpad.net/qemu/+bug/1702621 > >>>> > >>>> Is there some patches to fix this problem ? > >>> I don't remember any. > >>> > >>>> Can we save pcibus state after all the pci devs are saved ? > >>> Does this problem only happen with e1000? I think so. > >>> If it's only e1000 I think we should fix it - I think once the VM is > >>> stopped for doing the device migration it shouldn't be raising > >>> interrupts. > >> > >> I wonder maybe we can simply fix this by no setting ICS on pre_save() but > >> scheduling mit timer unconditionally in post_load(). > >> > > I also think this is a bug of e1000 because we find more cores with the same > > frame thease days. > > > > I'm not familiar with e1000 so hope someone could fix it, thanks. :) > > > > Draft a path in attachment, please test. > Thanks. We'll test it for a few weeks and then give you the feedback. :) > Thanks > > > >> Thanks > >> > >> > >>> Dave > >>> > >>>> Thanks, > >>>> Longpeng(Mike) > >>> -- > >>> Dr. David Alan Gilbert / address@hidden / Manchester, UK > >> . > >> -- Regards, Longpeng(Mike) 在 2019/7/10 11:57, Jason Wang 写道: > > On 2019/7/10 上午11:36, Longpeng (Mike) wrote: > > 在 2019/7/10 11:25, Jason Wang 写道: > >> On 2019/7/8 下午5:47, Dr. David Alan Gilbert wrote: > >>> * longpeng (address@hidden) wrote: > >>>> Hi guys, > >>>> > >>>> We found a qemu core in our testing environment, the assertion > >>>> 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and > >>>> the bus->irq_count[i] is '-1'. > >>>> > >>>> Through analysis, it was happened after VM migration and we think > >>>> it was caused by the following sequence: > >>>> > >>>> *Migration Source* > >>>> 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) > >>>> 2. save E1000: > >>>>      e1000_pre_save > >>>>       e1000_mit_timer > >>>>        set_interrupt_cause > >>>>         pci_set_irq --> update pci_dev->irq_state to 1 and > >>>>                     update bus->irq_count[x] to 1 ( new ) > >>>>       the irq_state sent to dest. > >>>> > >>>> *Migration Dest* > >>>> 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is > >>>> 1. > >>>> 2. If the e1000 need change irqline , it would call to pci_irq_handler(), > >>>>     the irq_state maybe change to 0 and bus->irq_count[x] will become > >>>>     -1 in this situation. > >>>> 3. do VM reboot then the assertion will be triggered. > >>>> > >>>> We also found some guys faced the similar problem: > >>>> [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html > >>>> [2] https://bugs.launchpad.net/qemu/+bug/1702621 > >>>> > >>>> Is there some patches to fix this problem ? > >>> I don't remember any. > >>> > >>>> Can we save pcibus state after all the pci devs are saved ? > >>> Does this problem only happen with e1000? I think so. > >>> If it's only e1000 I think we should fix it - I think once the VM is > >>> stopped for doing the device migration it shouldn't be raising > >>> interrupts. > >> > >> I wonder maybe we can simply fix this by no setting ICS on pre_save() but > >> scheduling mit timer unconditionally in post_load(). > >> > > I also think this is a bug of e1000 because we find more cores with the same > > frame thease days. > > > > I'm not familiar with e1000 so hope someone could fix it, thanks. :) > > > > Draft a path in attachment, please test. > Hi Jason, We've tested the patch for about two weeks, everything went well, thanks! Feel free to add my: Reported-and-tested-by: Longpeng > Thanks > > > >> Thanks > >> > >> > >>> Dave > >>> > >>>> Thanks, > >>>> Longpeng(Mike) > >>> -- > >>> Dr. David Alan Gilbert / address@hidden / Manchester, UK > >> . > >> -- Regards, Longpeng(Mike) On 2019/7/27 下午2:10, Longpeng (Mike) wrote: 在 2019/7/10 11:57, Jason Wang 写道: On 2019/7/10 上午11:36, Longpeng (Mike) wrote: 在 2019/7/10 11:25, Jason Wang 写道: On 2019/7/8 下午5:47, Dr. David Alan Gilbert wrote: * longpeng (address@hidden) wrote: Hi guys, We found a qemu core in our testing environment, the assertion 'assert(bus->irq_count[i] == 0)' in pcibus_reset() was triggered and the bus->irq_count[i] is '-1'. Through analysis, it was happened after VM migration and we think it was caused by the following sequence: *Migration Source* 1. save bus pci.0 state, including irq_count[x] ( =0 , old ) 2. save E1000:      e1000_pre_save       e1000_mit_timer        set_interrupt_cause         pci_set_irq --> update pci_dev->irq_state to 1 and                     update bus->irq_count[x] to 1 ( new )       the irq_state sent to dest. *Migration Dest* 1. Receive the irq_count[x] of pci.0 is 0 , but the irq_state of e1000 is 1. 2. If the e1000 need change irqline , it would call to pci_irq_handler(),     the irq_state maybe change to 0 and bus->irq_count[x] will become     -1 in this situation. 3. do VM reboot then the assertion will be triggered. We also found some guys faced the similar problem: [1] https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02525.html [2] https://bugs.launchpad.net/qemu/+bug/1702621 Is there some patches to fix this problem ? I don't remember any. Can we save pcibus state after all the pci devs are saved ? Does this problem only happen with e1000? I think so. If it's only e1000 I think we should fix it - I think once the VM is stopped for doing the device migration it shouldn't be raising interrupts. I wonder maybe we can simply fix this by no setting ICS on pre_save() but scheduling mit timer unconditionally in post_load(). I also think this is a bug of e1000 because we find more cores with the same frame thease days. I'm not familiar with e1000 so hope someone could fix it, thanks. :) Draft a path in attachment, please test. Hi Jason, We've tested the patch for about two weeks, everything went well, thanks! Feel free to add my: Reported-and-tested-by: Longpeng Applied. Thanks Thanks Thanks Dave Thanks, Longpeng(Mike) -- Dr. David Alan Gilbert / address@hidden / Manchester, UK .