restructure results

author: Christian Krinitsin <mail@krinitsin.com> 2025-07-03 19:39:53 +0200
committer: Christian Krinitsin <mail@krinitsin.com> 2025-07-03 19:39:53 +0200
commit: dee4dcba78baf712cab403d47d9db319ab7f95d6 (patch)
tree: 418478faf06786701a56268672f73d6b0b4eb239 /results/classifier/014/risc-v
parent: 4d9e26c0333abd39bdbd039dcdb30ed429c475ba (diff)
download: emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.tar.gz
emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.zip
3 files changed, 0 insertions, 2563 deletions
diff --git a/results/classifier/014/risc-v/25892827 b/results/classifier/014/risc-v/25892827
deleted file mode 100644
index 6346b0c1..00000000
--- a/results/classifier/014/risc-v/25892827
+++ /dev/null
@@ -1,1104 +0,0 @@
-risc-v: 0.908
-user-level: 0.889
-permissions: 0.881
-register: 0.876
-KVM: 0.872
-hypervisor: 0.871
-operating system: 0.871
-debug: 0.868
-x86: 0.849
-vnc: 0.846
-mistranslation: 0.842
-boot: 0.839
-network: 0.839
-VMM: 0.839
-device: 0.839
-TCG: 0.837
-virtual: 0.835
-i386: 0.835
-peripherals: 0.833
-graphic: 0.832
-assembly: 0.829
-architecture: 0.825
-semantic: 0.825
-ppc: 0.824
-socket: 0.822
-arm: 0.821
-performance: 0.819
-alpha: 0.816
-kernel: 0.810
-files: 0.804
-PID: 0.792
-
-[Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot
-
-Hi,
-
-Recently we encountered a problem in our project: 2 CPUs in VM are not brought 
-up normally after reboot.
-
-Our host is using KVM kmod 3.6 and QEMU 2.1.
-A SLES 11 sp3 VM configured with 8 vcpus,
-cpu model is configured with 'host-passthrough'.
-
-After VM's first time started up, everything seems to be OK.
-and then VM is paniced and rebooted.
-After reboot, only 6 cpus are brought up in VM, cpu1 and cpu7 are not online.
-
-This is the only message we can get from VM:
-VM dmesg shows:
-[    0.069867] Booting Node   0, Processors  #1
-[    5.060042] CPU1: Stuck ??
-[    5.060499]  #2
-[    5.088322] kvm-clock: cpu 2, msr 6:3fc90901, secondary cpu clock
-[    5.088335] KVM setup async PF for cpu 2
-[    5.092967] NMI watchdog enabled, takes one hw-pmu counter.
-[    5.094405]  #3
-[    5.108324] kvm-clock: cpu 3, msr 6:3fcd0901, secondary cpu clock
-[    5.108333] KVM setup async PF for cpu 3
-[    5.113553] NMI watchdog enabled, takes one hw-pmu counter.
-[    5.114970]  #4
-[    5.128325] kvm-clock: cpu 4, msr 6:3fd10901, secondary cpu clock
-[    5.128336] KVM setup async PF for cpu 4
-[    5.134576] NMI watchdog enabled, takes one hw-pmu counter.
-[    5.135998]  #5
-[    5.152324] kvm-clock: cpu 5, msr 6:3fd50901, secondary cpu clock
-[    5.152334] KVM setup async PF for cpu 5
-[    5.154764] NMI watchdog enabled, takes one hw-pmu counter.
-[    5.156467]  #6
-[    5.172327] kvm-clock: cpu 6, msr 6:3fd90901, secondary cpu clock
-[    5.172341] KVM setup async PF for cpu 6
-[    5.180738] NMI watchdog enabled, takes one hw-pmu counter.
-[    5.182173]  #7 Ok.
-[   10.170815] CPU7: Stuck ??
-[   10.171648] Brought up 6 CPUs
-[   10.172394] Total of 6 processors activated (28799.97 BogoMIPS).
-
-From host, we found that QEMU vcpu1 thread and vcpu7 thread were not consuming 
-any cpu (Should be in idle state),
-All of VCPUs' stacks in host is like bellow:
-
-[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
-[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
-[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
-[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
-[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
-[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
-[<ffffffff81468092>] system_call_fastpath+0x16/0x1b
-[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
-[<ffffffffffffffff>] 0xffffffffffffffff
-
-We looked into the kernel codes that could leading to the above 'Stuck' warning,
-and found that the only possible is the emulation of 'cpuid' instruct in 
-kvm/qemu has something wrong.
-But since we canât reproduce this problem, we are not quite sure.
-Is there any possible that the cupid emulation in kvm/qemu has some bug ?
-
-Has anyone come across these problem before? Or any idea?
-
-Thanks,
-zhanghailiang
-
-On 06/07/2015 09:54, zhanghailiang wrote:
->
->
-From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
->
-consuming any cpu (Should be in idle state),
->
-All of VCPUs' stacks in host is like bellow:
->
->
-[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
->
-[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
->
-[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
->
-[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
->
-[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
->
-[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
->
-[<ffffffff81468092>] system_call_fastpath+0x16/0x1b
->
-[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
->
-[<ffffffffffffffff>] 0xffffffffffffffff
->
->
-We looked into the kernel codes that could leading to the above 'Stuck'
->
-warning,
->
-and found that the only possible is the emulation of 'cpuid' instruct in
->
-kvm/qemu has something wrong.
->
-But since we canât reproduce this problem, we are not quite sure.
->
-Is there any possible that the cupid emulation in kvm/qemu has some bug ?
-Can you explain the relationship to the cpuid emulation?  What do the
-traces say about vcpus 1 and 7?
-
-Paolo
-
-On 2015/7/6 16:45, Paolo Bonzini wrote:
-On 06/07/2015 09:54, zhanghailiang wrote:
-From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
-consuming any cpu (Should be in idle state),
-All of VCPUs' stacks in host is like bellow:
-
-[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
-[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
-[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
-[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
-[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
-[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
-[<ffffffff81468092>] system_call_fastpath+0x16/0x1b
-[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
-[<ffffffffffffffff>] 0xffffffffffffffff
-
-We looked into the kernel codes that could leading to the above 'Stuck'
-warning,
-and found that the only possible is the emulation of 'cpuid' instruct in
-kvm/qemu has something wrong.
-But since we canât reproduce this problem, we are not quite sure.
-Is there any possible that the cupid emulation in kvm/qemu has some bug ?
-Can you explain the relationship to the cpuid emulation?  What do the
-traces say about vcpus 1 and 7?
-OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is 
-located in
-do_boot_cpu(). It's in BSP context, the call process is:
-BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() 
--> wakeup_secondary_via_INIT() to trigger APs.
-It will wait 5s for APs to startup, if some AP not startup normally, it will 
-print 'CPU%d Stuck' or 'CPU%d: Not responding'.
-
-If it prints 'Stuck', it means the AP has received the SIPI interrupt and 
-begins to execute the code
-'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before 
-smp_callin()(smpboot.c).
-The follow is the starup process of BSP and AP.
-BSP:
-start_kernel()
-  ->smp_init()
-     ->smp_boot_cpus()
-       ->do_boot_cpu()
-           ->start_ip = trampoline_address(); //set the address that AP will go 
-to execute
-           ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
-           ->for (timeout = 0; timeout < 50000; timeout++)
-               if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP 
-startup or not
-
-APs:
-ENTRY(trampoline_data) (trampoline_64.S)
-      ->ENTRY(secondary_startup_64) (head_64.S)
-         ->start_secondary() (smpboot.c)
-            ->cpu_init();
-            ->smp_callin();
-                ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes 
-here, the BSP will not prints the error message.
-
-From above call process, we can be sure that, the AP has been stuck between 
-trampoline_data and the cpumask_set_cpu() in
-smp_callin(), we look through these codes path carefully, and only found a 
-'hlt' instruct that could block the process.
-It is located in trampoline_data():
-
-ENTRY(trampoline_data)
-        ...
-
-        call    verify_cpu              # Verify the cpu supports long mode
-        testl   %eax, %eax              # Check for return code
-        jnz     no_longmode
-
-        ...
-
-no_longmode:
-        hlt
-        jmp no_longmode
-
-For the process verify_cpu(),
-we can only find the 'cpuid' sensitive instruct that could lead VM exit from 
-No-root mode.
-This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to 
-the fail in verify_cpu.
-
-From the message in VM, we know vcpu1 and vcpu7 is something wrong.
-[    5.060042] CPU1: Stuck ??
-[   10.170815] CPU7: Stuck ??
-[   10.171648] Brought up 6 CPUs
-
-Besides, the follow is the cpus message got from host.
-80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command 
-instance-0000000
-* CPU #0: pc=0x00007f64160c683d thread_id=68570
-  CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
-  CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
-  CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
-  CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
-  CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
-  CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
-  CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584
-
-Oh, i also forgot to mention in the above message that, we have bond each vCPU 
-to different physical CPU in
-host.
-
-Thanks,
-zhanghailiang
-
-On 06/07/2015 11:59, zhanghailiang wrote:
->
->
->
-Besides, the follow is the cpus message got from host.
->
-80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh
->
-qemu-monitor-command instance-0000000
->
-* CPU #0: pc=0x00007f64160c683d thread_id=68570
->
-CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
->
-CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
->
-CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
->
-CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
->
-CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
->
-CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
->
-CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584
->
->
-Oh, i also forgot to mention in the above message that, we have bond
->
-each vCPU to different physical CPU in
->
-host.
-Can you capture a trace on the host (trace-cmd record -e kvm) and send
-it privately?  Please note which CPUs get stuck, since I guess it's not
-always 1 and 7.
-
-Paolo
-
-On Mon, 6 Jul 2015 17:59:10 +0800
-zhanghailiang <address@hidden> wrote:
-
->
-On 2015/7/6 16:45, Paolo Bonzini wrote:
->
->
->
->
->
-> On 06/07/2015 09:54, zhanghailiang wrote:
->
->>
->
->>  From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
->
->> consuming any cpu (Should be in idle state),
->
->> All of VCPUs' stacks in host is like bellow:
->
->>
->
->> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
->
->> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
->
->> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
->
->> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
->
->> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
->
->> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
->
->> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b
->
->> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
->
->> [<ffffffffffffffff>] 0xffffffffffffffff
->
->>
->
->> We looked into the kernel codes that could leading to the above 'Stuck'
->
->> warning,
-in current upstream there isn't any printk(...Stuck...) left since that code 
-path
-has been reworked.
-I've often seen this on over-committed host during guest CPUs up/down torture 
-test.
-Could you update guest kernel to upstream and see if issue reproduces?
-
->
->> and found that the only possible is the emulation of 'cpuid' instruct in
->
->> kvm/qemu has something wrong.
->
->> But since we canât reproduce this problem, we are not quite sure.
->
->> Is there any possible that the cupid emulation in kvm/qemu has some bug ?
->
->
->
-> Can you explain the relationship to the cpuid emulation?  What do the
->
-> traces say about vcpus 1 and 7?
->
->
-OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is
->
-located in
->
-do_boot_cpu(). It's in BSP context, the call process is:
->
-BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu()
->
--> wakeup_secondary_via_INIT() to trigger APs.
->
-It will wait 5s for APs to startup, if some AP not startup normally, it will
->
-print 'CPU%d Stuck' or 'CPU%d: Not responding'.
->
->
-If it prints 'Stuck', it means the AP has received the SIPI interrupt and
->
-begins to execute the code
->
-'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places
->
-before smp_callin()(smpboot.c).
->
-The follow is the starup process of BSP and AP.
->
-BSP:
->
-start_kernel()
->
-->smp_init()
->
-->smp_boot_cpus()
->
-->do_boot_cpu()
->
-->start_ip = trampoline_address(); //set the address that AP will
->
-go to execute
->
-->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
->
-->for (timeout = 0; timeout < 50000; timeout++)
->
-if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if
->
-AP startup or not
->
->
-APs:
->
-ENTRY(trampoline_data) (trampoline_64.S)
->
-->ENTRY(secondary_startup_64) (head_64.S)
->
-->start_secondary() (smpboot.c)
->
-->cpu_init();
->
-->smp_callin();
->
-->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP
->
-comes here, the BSP will not prints the error message.
->
->
-From above call process, we can be sure that, the AP has been stuck between
->
-trampoline_data and the cpumask_set_cpu() in
->
-smp_callin(), we look through these codes path carefully, and only found a
->
-'hlt' instruct that could block the process.
->
-It is located in trampoline_data():
->
->
-ENTRY(trampoline_data)
->
-...
->
->
-call    verify_cpu              # Verify the cpu supports long mode
->
-testl   %eax, %eax              # Check for return code
->
-jnz     no_longmode
->
->
-...
->
->
-no_longmode:
->
-hlt
->
-jmp no_longmode
->
->
-For the process verify_cpu(),
->
-we can only find the 'cpuid' sensitive instruct that could lead VM exit from
->
-No-root mode.
->
-This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to
->
-the fail in verify_cpu.
->
->
-From the message in VM, we know vcpu1 and vcpu7 is something wrong.
->
-[    5.060042] CPU1: Stuck ??
->
-[   10.170815] CPU7: Stuck ??
->
-[   10.171648] Brought up 6 CPUs
->
->
-Besides, the follow is the cpus message got from host.
->
-80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh
->
-qemu-monitor-command instance-0000000
->
-* CPU #0: pc=0x00007f64160c683d thread_id=68570
->
-CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
->
-CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
->
-CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
->
-CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
->
-CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
->
-CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
->
-CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584
->
->
-Oh, i also forgot to mention in the above message that, we have bond each
->
-vCPU to different physical CPU in
->
-host.
->
->
-Thanks,
->
-zhanghailiang
->
->
->
->
->
---
->
-To unsubscribe from this list: send the line "unsubscribe kvm" in
->
-the body of a message to address@hidden
->
-More majordomo info at
-http://vger.kernel.org/majordomo-info.html
-
-On 2015/7/7 19:23, Igor Mammedov wrote:
-On Mon, 6 Jul 2015 17:59:10 +0800
-zhanghailiang <address@hidden> wrote:
-On 2015/7/6 16:45, Paolo Bonzini wrote:
-On 06/07/2015 09:54, zhanghailiang wrote:
-From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
-consuming any cpu (Should be in idle state),
-All of VCPUs' stacks in host is like bellow:
-
-[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
-[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
-[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
-[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
-[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
-[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
-[<ffffffff81468092>] system_call_fastpath+0x16/0x1b
-[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
-[<ffffffffffffffff>] 0xffffffffffffffff
-
-We looked into the kernel codes that could leading to the above 'Stuck'
-warning,
-in current upstream there isn't any printk(...Stuck...) left since that code 
-path
-has been reworked.
-I've often seen this on over-committed host during guest CPUs up/down torture 
-test.
-Could you update guest kernel to upstream and see if issue reproduces?
-Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to 
-reproduce it.
-
-For your test case, is it a kernel bug?
-Or is there any related patch could solve your test problem been merged into
-upstream ?
-
-Thanks,
-zhanghailiang
-and found that the only possible is the emulation of 'cpuid' instruct in
-kvm/qemu has something wrong.
-But since we canât reproduce this problem, we are not quite sure.
-Is there any possible that the cupid emulation in kvm/qemu has some bug ?
-Can you explain the relationship to the cpuid emulation?  What do the
-traces say about vcpus 1 and 7?
-OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is 
-located in
-do_boot_cpu(). It's in BSP context, the call process is:
-BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() 
--> wakeup_secondary_via_INIT() to trigger APs.
-It will wait 5s for APs to startup, if some AP not startup normally, it will 
-print 'CPU%d Stuck' or 'CPU%d: Not responding'.
-
-If it prints 'Stuck', it means the AP has received the SIPI interrupt and 
-begins to execute the code
-'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before 
-smp_callin()(smpboot.c).
-The follow is the starup process of BSP and AP.
-BSP:
-start_kernel()
-    ->smp_init()
-       ->smp_boot_cpus()
-         ->do_boot_cpu()
-             ->start_ip = trampoline_address(); //set the address that AP will 
-go to execute
-             ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
-             ->for (timeout = 0; timeout < 50000; timeout++)
-                 if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if 
-AP startup or not
-
-APs:
-ENTRY(trampoline_data) (trampoline_64.S)
-        ->ENTRY(secondary_startup_64) (head_64.S)
-           ->start_secondary() (smpboot.c)
-              ->cpu_init();
-              ->smp_callin();
-                  ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP 
-comes here, the BSP will not prints the error message.
-
-  From above call process, we can be sure that, the AP has been stuck between 
-trampoline_data and the cpumask_set_cpu() in
-smp_callin(), we look through these codes path carefully, and only found a 
-'hlt' instruct that could block the process.
-It is located in trampoline_data():
-
-ENTRY(trampoline_data)
-          ...
-
-        call    verify_cpu              # Verify the cpu supports long mode
-        testl   %eax, %eax              # Check for return code
-        jnz     no_longmode
-
-          ...
-
-no_longmode:
-        hlt
-        jmp no_longmode
-
-For the process verify_cpu(),
-we can only find the 'cpuid' sensitive instruct that could lead VM exit from 
-No-root mode.
-This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to 
-the fail in verify_cpu.
-
-  From the message in VM, we know vcpu1 and vcpu7 is something wrong.
-[    5.060042] CPU1: Stuck ??
-[   10.170815] CPU7: Stuck ??
-[   10.171648] Brought up 6 CPUs
-
-Besides, the follow is the cpus message got from host.
-80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command 
-instance-0000000
-* CPU #0: pc=0x00007f64160c683d thread_id=68570
-    CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
-    CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
-    CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
-    CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
-    CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
-    CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
-    CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584
-
-Oh, i also forgot to mention in the above message that, we have bond each vCPU 
-to different physical CPU in
-host.
-
-Thanks,
-zhanghailiang
-
-
-
-
---
-To unsubscribe from this list: send the line "unsubscribe kvm" in
-the body of a message to address@hidden
-More majordomo info at
-http://vger.kernel.org/majordomo-info.html
-.
-
-On Tue, 7 Jul 2015 19:43:35 +0800
-zhanghailiang <address@hidden> wrote:
-
->
-On 2015/7/7 19:23, Igor Mammedov wrote:
->
-> On Mon, 6 Jul 2015 17:59:10 +0800
->
-> zhanghailiang <address@hidden> wrote:
->
->
->
->> On 2015/7/6 16:45, Paolo Bonzini wrote:
->
->>>
->
->>>
->
->>> On 06/07/2015 09:54, zhanghailiang wrote:
->
->>>>
->
->>>>   From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
->
->>>> consuming any cpu (Should be in idle state),
->
->>>> All of VCPUs' stacks in host is like bellow:
->
->>>>
->
->>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
->
->>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
->
->>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
->
->>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
->
->>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
->
->>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
->
->>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b
->
->>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
->
->>>> [<ffffffffffffffff>] 0xffffffffffffffff
->
->>>>
->
->>>> We looked into the kernel codes that could leading to the above 'Stuck'
->
->>>> warning,
->
-> in current upstream there isn't any printk(...Stuck...) left since that
->
-> code path
->
-> has been reworked.
->
-> I've often seen this on over-committed host during guest CPUs up/down
->
-> torture test.
->
-> Could you update guest kernel to upstream and see if issue reproduces?
->
->
->
->
-Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to
->
-reproduce it.
->
->
-For your test case, is it a kernel bug?
->
-Or is there any related patch could solve your test problem been merged into
->
-upstream ?
-I don't remember all prerequisite patches but you should be able to find
-http://marc.info/?l=linux-kernel&m=140326703108009&w=2
-"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it"
-and then look for dependencies.
-
-
->
->
-Thanks,
->
-zhanghailiang
->
->
->>>> and found that the only possible is the emulation of 'cpuid' instruct in
->
->>>> kvm/qemu has something wrong.
->
->>>> But since we canât reproduce this problem, we are not quite sure.
->
->>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ?
->
->>>
->
->>> Can you explain the relationship to the cpuid emulation?  What do the
->
->>> traces say about vcpus 1 and 7?
->
->>
->
->> OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is
->
->> located in
->
->> do_boot_cpu(). It's in BSP context, the call process is:
->
->> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() ->
->
->> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs.
->
->> It will wait 5s for APs to startup, if some AP not startup normally, it
->
->> will print 'CPU%d Stuck' or 'CPU%d: Not responding'.
->
->>
->
->> If it prints 'Stuck', it means the AP has received the SIPI interrupt and
->
->> begins to execute the code
->
->> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places
->
->> before smp_callin()(smpboot.c).
->
->> The follow is the starup process of BSP and AP.
->
->> BSP:
->
->> start_kernel()
->
->>     ->smp_init()
->
->>        ->smp_boot_cpus()
->
->>          ->do_boot_cpu()
->
->>              ->start_ip = trampoline_address(); //set the address that AP
->
->> will go to execute
->
->>              ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
->
->>              ->for (timeout = 0; timeout < 50000; timeout++)
->
->>                  if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;//
->
->> check if AP startup or not
->
->>
->
->> APs:
->
->> ENTRY(trampoline_data) (trampoline_64.S)
->
->>         ->ENTRY(secondary_startup_64) (head_64.S)
->
->>            ->start_secondary() (smpboot.c)
->
->>               ->cpu_init();
->
->>               ->smp_callin();
->
->>                   ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP
->
->> comes here, the BSP will not prints the error message.
->
->>
->
->>   From above call process, we can be sure that, the AP has been stuck
->
->> between trampoline_data and the cpumask_set_cpu() in
->
->> smp_callin(), we look through these codes path carefully, and only found a
->
->> 'hlt' instruct that could block the process.
->
->> It is located in trampoline_data():
->
->>
->
->> ENTRY(trampoline_data)
->
->>           ...
->
->>
->
->>    call    verify_cpu              # Verify the cpu supports long mode
->
->>    testl   %eax, %eax              # Check for return code
->
->>    jnz     no_longmode
->
->>
->
->>           ...
->
->>
->
->> no_longmode:
->
->>    hlt
->
->>    jmp no_longmode
->
->>
->
->> For the process verify_cpu(),
->
->> we can only find the 'cpuid' sensitive instruct that could lead VM exit
->
->> from No-root mode.
->
->> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading
->
->> to the fail in verify_cpu.
->
->>
->
->>   From the message in VM, we know vcpu1 and vcpu7 is something wrong.
->
->> [    5.060042] CPU1: Stuck ??
->
->> [   10.170815] CPU7: Stuck ??
->
->> [   10.171648] Brought up 6 CPUs
->
->>
->
->> Besides, the follow is the cpus message got from host.
->
->> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh
->
->> qemu-monitor-command instance-0000000
->
->> * CPU #0: pc=0x00007f64160c683d thread_id=68570
->
->>     CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
->
->>     CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
->
->>     CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
->
->>     CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
->
->>     CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
->
->>     CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
->
->>     CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584
->
->>
->
->> Oh, i also forgot to mention in the above message that, we have bond each
->
->> vCPU to different physical CPU in
->
->> host.
->
->>
->
->> Thanks,
->
->> zhanghailiang
->
->>
->
->>
->
->>
->
->>
->
->> --
->
->> To unsubscribe from this list: send the line "unsubscribe kvm" in
->
->> the body of a message to address@hidden
->
->> More majordomo info at
-http://vger.kernel.org/majordomo-info.html
->
->
->
->
->
-> .
->
->
->
->
->
-
-On 2015/7/7 20:21, Igor Mammedov wrote:
-On Tue, 7 Jul 2015 19:43:35 +0800
-zhanghailiang <address@hidden> wrote:
-On 2015/7/7 19:23, Igor Mammedov wrote:
-On Mon, 6 Jul 2015 17:59:10 +0800
-zhanghailiang <address@hidden> wrote:
-On 2015/7/6 16:45, Paolo Bonzini wrote:
-On 06/07/2015 09:54, zhanghailiang wrote:
-From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
-consuming any cpu (Should be in idle state),
-All of VCPUs' stacks in host is like bellow:
-
-[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
-[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
-[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
-[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
-[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
-[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
-[<ffffffff81468092>] system_call_fastpath+0x16/0x1b
-[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
-[<ffffffffffffffff>] 0xffffffffffffffff
-
-We looked into the kernel codes that could leading to the above 'Stuck'
-warning,
-in current upstream there isn't any printk(...Stuck...) left since that code 
-path
-has been reworked.
-I've often seen this on over-committed host during guest CPUs up/down torture 
-test.
-Could you update guest kernel to upstream and see if issue reproduces?
-Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to 
-reproduce it.
-
-For your test case, is it a kernel bug?
-Or is there any related patch could solve your test problem been merged into
-upstream ?
-I don't remember all prerequisite patches but you should be able to find
-http://marc.info/?l=linux-kernel&m=140326703108009&w=2
-"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it"
-and then look for dependencies.
-Er, we have investigated this patch, and it is not related to our problem, :)
-
-Thanks.
-Thanks,
-zhanghailiang
-and found that the only possible is the emulation of 'cpuid' instruct in
-kvm/qemu has something wrong.
-But since we canât reproduce this problem, we are not quite sure.
-Is there any possible that the cupid emulation in kvm/qemu has some bug ?
-Can you explain the relationship to the cpuid emulation?  What do the
-traces say about vcpus 1 and 7?
-OK, we searched the VM's kernel codes with the 'Stuck' message, and  it is 
-located in
-do_boot_cpu(). It's in BSP context, the call process is:
-BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() 
--> wakeup_secondary_via_INIT() to trigger APs.
-It will wait 5s for APs to startup, if some AP not startup normally, it will 
-print 'CPU%d Stuck' or 'CPU%d: Not responding'.
-
-If it prints 'Stuck', it means the AP has received the SIPI interrupt and 
-begins to execute the code
-'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before 
-smp_callin()(smpboot.c).
-The follow is the starup process of BSP and AP.
-BSP:
-start_kernel()
-     ->smp_init()
-        ->smp_boot_cpus()
-          ->do_boot_cpu()
-              ->start_ip = trampoline_address(); //set the address that AP will 
-go to execute
-              ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
-              ->for (timeout = 0; timeout < 50000; timeout++)
-                  if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if 
-AP startup or not
-
-APs:
-ENTRY(trampoline_data) (trampoline_64.S)
-         ->ENTRY(secondary_startup_64) (head_64.S)
-            ->start_secondary() (smpboot.c)
-               ->cpu_init();
-               ->smp_callin();
-                   ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP 
-comes here, the BSP will not prints the error message.
-
-   From above call process, we can be sure that, the AP has been stuck between 
-trampoline_data and the cpumask_set_cpu() in
-smp_callin(), we look through these codes path carefully, and only found a 
-'hlt' instruct that could block the process.
-It is located in trampoline_data():
-
-ENTRY(trampoline_data)
-           ...
-
-        call    verify_cpu              # Verify the cpu supports long mode
-        testl   %eax, %eax              # Check for return code
-        jnz     no_longmode
-
-           ...
-
-no_longmode:
-        hlt
-        jmp no_longmode
-
-For the process verify_cpu(),
-we can only find the 'cpuid' sensitive instruct that could lead VM exit from 
-No-root mode.
-This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to 
-the fail in verify_cpu.
-
-   From the message in VM, we know vcpu1 and vcpu7 is something wrong.
-[    5.060042] CPU1: Stuck ??
-[   10.170815] CPU7: Stuck ??
-[   10.171648] Brought up 6 CPUs
-
-Besides, the follow is the cpus message got from host.
-80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command 
-instance-0000000
-* CPU #0: pc=0x00007f64160c683d thread_id=68570
-     CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573
-     CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575
-     CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576
-     CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577
-     CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578
-     CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583
-     CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584
-
-Oh, i also forgot to mention in the above message that, we have bond each vCPU 
-to different physical CPU in
-host.
-
-Thanks,
-zhanghailiang
-
-
-
-
---
-To unsubscribe from this list: send the line "unsubscribe kvm" in
-the body of a message to address@hidden
-More majordomo info at
-http://vger.kernel.org/majordomo-info.html
-.
-.
-
diff --git a/results/classifier/014/risc-v/70294255 b/results/classifier/014/risc-v/70294255
deleted file mode 100644
index f8e35953..00000000
--- a/results/classifier/014/risc-v/70294255
+++ /dev/null
@@ -1,1088 +0,0 @@
-risc-v: 0.863
-mistranslation: 0.862
-assembly: 0.861
-PID: 0.859
-semantic: 0.858
-socket: 0.858
-device: 0.857
-user-level: 0.857
-graphic: 0.857
-arm: 0.856
-debug: 0.854
-permissions: 0.854
-architecture: 0.851
-performance: 0.850
-kernel: 0.848
-network: 0.846
-operating system: 0.844
-register: 0.842
-vnc: 0.837
-alpha: 0.834
-files: 0.832
-virtual: 0.832
-hypervisor: 0.828
-peripherals: 0.819
-boot: 0.811
-i386: 0.811
-KVM: 0.806
-x86: 0.803
-ppc: 0.800
-TCG: 0.792
-VMM: 0.784
-
-[Qemu-devel] 答复: Re:   答复: Re:  答复: Re: 答复: Re: [BUG]COLO failover hang
-
-hi:
-
-yes.it is better.
-
-And should we delete 
-
-
-
-
-#ifdef WIN32
-
-    QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL)
-
-#endif
-
-
-
-
-in qio_channel_socket_acceptï¼
-
-qio_channel_socket_new already have it.
-
-
-
-
-
-
-
-
-
-
-
-
-åå§é®ä»¶
-
-
-
-åä»¶äººï¼ address@hidden
-æ¶ä»¶äººï¼çå¹¿10165992
-æéäººï¼ address@hidden address@hidden address@hidden address@hidden
-æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03
-ä¸» é¢ ï¼Re: [Qemu-devel]  çå¤: Re:  çå¤: Re: çå¤: Re: [BUG]COLO failover hang
-
-
-
-
-
-Hi,
-
-On 2017/3/22 9:42, address@hidden wrote:
-ï¼ diff --git a/migration/socket.c b/migration/socket.c
-ï¼
-ï¼
-ï¼ index 13966f1..d65a0ea 100644
-ï¼
-ï¼
-ï¼ --- a/migration/socket.c
-ï¼
-ï¼
-ï¼ +++ b/migration/socket.c
-ï¼
-ï¼
-ï¼ @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
-ï¼
-ï¼
-ï¼       }
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼       trace_migration_socket_incoming_accepted()
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼       qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming")
-ï¼
-ï¼
-ï¼ +    qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN)
-ï¼
-ï¼
-ï¼       migration_channel_process_incoming(migrate_get_current(),
-ï¼
-ï¼
-ï¼                                          QIO_CHANNEL(sioc))
-ï¼
-ï¼
-ï¼       object_unref(OBJECT(sioc))
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼ Is this patch ok?
-ï¼
-
-Yes, i think this works, but a better way maybe to call 
-qio_channel_set_feature()
-in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the 
-socket accept fd,
-Or fix it by this:
-
-diff --git a/io/channel-socket.c b/io/channel-socket.c
-index f546c68..ce6894c 100644
---- a/io/channel-socket.c
-+++ b/io/channel-socket.c
-@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
-                            Error **errp)
-  {
-      QIOChannelSocket *cioc
--
--    cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET))
--    cioc-ï¼fd = -1
-+
-+    cioc = qio_channel_socket_new()
-      cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr)
-      cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr)
-
-
-Thanks,
-Hailiang
-
-ï¼ I have test it . The test could not hang any more.
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼ åå§é®ä»¶
-ï¼
-ï¼
-ï¼
-ï¼ åä»¶äººï¼ address@hidden
-ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden
-ï¼ æéäººï¼ address@hidden address@hidden address@hidden
-ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11
-ï¼ ä¸» é¢ ï¼Re: [Qemu-devel]  çå¤: Re:  çå¤: Re: [BUG]COLO failover hang
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote:
-ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote:
-ï¼ ï¼ï¼ Hi,
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug.
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in
-ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do 
-failover,
-ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration)
-ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if
-ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN).
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write()
-ï¼ ï¼ï¼ if we tried to cancel migration.
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, 
-Error **errp)
-ï¼ ï¼ï¼ {
-ï¼ ï¼ï¼      qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing")
-ï¼ ï¼ï¼      migration_channel_connect(s, ioc, NULL)
-ï¼ ï¼ï¼      ... ...
-ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN) above,
-ï¼ ï¼ï¼ and the
-ï¼ ï¼ï¼ migrate_fd_cancel()
-ï¼ ï¼ï¼ {
-ï¼ ï¼ï¼   ... ...
-ï¼ ï¼ï¼      if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) {
-ï¼ ï¼ï¼          qemu_file_shutdown(f)  --ï¼ This will not take effect. No ?
-ï¼ ï¼ï¼      }
-ï¼ ï¼ï¼ }
-ï¼ ï¼
-ï¼ ï¼ (cc'd in Daniel Berrange).
-ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, 
-QIO_CHANNEL_FEATURE_SHUTDOWN) at the
-ï¼ ï¼ top of qio_channel_socket_new  so I think that's safe isn't it?
-ï¼ ï¼
-ï¼
-ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, 
-thanks.
-ï¼
-ï¼ ï¼ Dave
-ï¼ ï¼
-ï¼ ï¼ï¼ Thanks,
-ï¼ ï¼ï¼ Hailiang
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote:
-ï¼ ï¼ï¼ï¼ Thank youã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ I have test areadyã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same 
-placeã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Incorrding
-http://wiki.qemu-project.org/Features/COLO
-ï¼kill Primary Node 
-qemu will not produce the problem,but Primary Node panic canã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support 
-QIO_CHANNEL_FEATURE_SHUTDOWN.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ I test a patch:
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ --- a/migration/socket.c
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        }
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        trace_migration_socket_incoming_accepted()
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        qio_channel_set_name(QIO_CHANNEL(sioc), 
-"migration-socket-incoming")
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ +    qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN)
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        migration_channel_process_incoming(migrate_get_current(),
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼                                           QIO_CHANNEL(sioc))
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        object_unref(OBJECT(sioc))
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ My test will not hang any more.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ åå§é®ä»¶
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden
-ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼çå¹¿10165992 address@hidden
-ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden
-ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58
-ï¼ ï¼ï¼ï¼ ä¸» é¢ ï¼Re: [Qemu-devel]  çå¤: Re:  [BUG]COLO failover hang
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Hi,Wang.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ You can test this branch:
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-http://wiki.qemu-project.org/Features/COLO
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Thanks
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Zhang Chen
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote:
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ hi.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #0  qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880,
-ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #1  0x00007f658e4aa0c2 in qio_channel_read
-ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "",
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #2  0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #3  0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #4  0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #5  0x00007f658e3ea34b in qemu_get_byte (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #6  0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #7  0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #8  0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized
-ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden)
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼     at migration/colo.c:264
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #9  0x00007f658e3e740e in colo_process_incoming_thread
-ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming"
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features        Do not support QIO_CHANNEL_FEATURE_SHUTDOWN
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $3 = 0
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #0  socket_accept_incoming_migration (ioc=0x7fdcceeafa90,
-ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #1  0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at
-ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #2  g_main_context_dispatch (context=ï¼optimized outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #3  0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #4  os_host_main_loop_wait (timeout=ï¼optimized outï¼) at
-ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #5  main_loop_wait (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #6  0x00007fdccb526187 in main_loop () at vl.c:1898
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #7  main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized
-ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $1 = 6
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener"
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should
-ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)??
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ thank you.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼
-ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46
-ï¼ ï¼ï¼ï¼ ï¼ *ä¸» é¢ ï¼**Re: [Qemu-devel] COLO failover hang*
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote:
-ï¼ ï¼ï¼ï¼ ï¼ ï¼   am testing QEMU COLO feature described here [QEMU
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki](
-http://wiki.qemu-project.org/Features/COLO
-).
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ And  I run  { 'execute': 'nbd-server-stop' } and { "execute":
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the  Secondary Node qemu still hang at recvmsg .
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development?
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated!
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully,
-ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help.
-ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO,
-ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ Thanks
-ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0  0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1  0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized 
-outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, 
-errp=0x0) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2  0x00007f3e03329472 in qio_channel_read (address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3  0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4  0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5  0x00007f3e03274a41 in qemu_peek_byte (address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6  0x00007f3e03274aab in qemu_get_byte (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7  0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8  0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9  0x00007f3e0327250d in colo_wait_handle_message 
-(errp=0x7f3d62bfaa48,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ --
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context:
-http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ --
-ï¼ ï¼ï¼ï¼ ï¼ Thanks
-ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼
-ï¼ ï¼ --
-ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK
-ï¼ ï¼
-ï¼ ï¼ .
-ï¼ ï¼
-ï¼
-
-On 2017/3/22 16:09, address@hidden wrote:
-hi:
-
-yes.it is better.
-
-And should we delete
-Yes, you are right.
-#ifdef WIN32
-
-     QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL)
-
-#endif
-
-
-
-
-in qio_channel_socket_acceptï¼
-
-qio_channel_socket_new already have it.
-
-
-
-
-
-
-
-
-
-
-
-
-åå§é®ä»¶
-
-
-
-åä»¶äººï¼ address@hidden
-æ¶ä»¶äººï¼çå¹¿10165992
-æéäººï¼ address@hidden address@hidden address@hidden address@hidden
-æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03
-ä¸» é¢ ï¼Re: [Qemu-devel]  çå¤: Re:  çå¤: Re: çå¤: Re: [BUG]COLO failover hang
-
-
-
-
-
-Hi,
-
-On 2017/3/22 9:42, address@hidden wrote:
-ï¼ diff --git a/migration/socket.c b/migration/socket.c
-ï¼
-ï¼
-ï¼ index 13966f1..d65a0ea 100644
-ï¼
-ï¼
-ï¼ --- a/migration/socket.c
-ï¼
-ï¼
-ï¼ +++ b/migration/socket.c
-ï¼
-ï¼
-ï¼ @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
-ï¼
-ï¼
-ï¼       }
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼       trace_migration_socket_incoming_accepted()
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼       qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming")
-ï¼
-ï¼
-ï¼ +    qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN)
-ï¼
-ï¼
-ï¼       migration_channel_process_incoming(migrate_get_current(),
-ï¼
-ï¼
-ï¼                                          QIO_CHANNEL(sioc))
-ï¼
-ï¼
-ï¼       object_unref(OBJECT(sioc))
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼ Is this patch ok?
-ï¼
-
-Yes, i think this works, but a better way maybe to call 
-qio_channel_set_feature()
-in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the 
-socket accept fd,
-Or fix it by this:
-
-diff --git a/io/channel-socket.c b/io/channel-socket.c
-index f546c68..ce6894c 100644
---- a/io/channel-socket.c
-+++ b/io/channel-socket.c
-@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
-                             Error **errp)
-   {
-       QIOChannelSocket *cioc
--
--    cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET))
--    cioc-ï¼fd = -1
-+
-+    cioc = qio_channel_socket_new()
-       cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr)
-       cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr)
-
-
-Thanks,
-Hailiang
-
-ï¼ I have test it . The test could not hang any more.
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼ åå§é®ä»¶
-ï¼
-ï¼
-ï¼
-ï¼ åä»¶äººï¼ address@hidden
-ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden
-ï¼ æéäººï¼ address@hidden address@hidden address@hidden
-ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11
-ï¼ ä¸» é¢ ï¼Re: [Qemu-devel]  çå¤: Re:  çå¤: Re: [BUG]COLO failover hang
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼
-ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote:
-ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote:
-ï¼ ï¼ï¼ Hi,
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug.
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in
-ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do 
-failover,
-ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration)
-ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if
-ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN).
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write()
-ï¼ ï¼ï¼ if we tried to cancel migration.
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, 
-Error **errp)
-ï¼ ï¼ï¼ {
-ï¼ ï¼ï¼      qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing")
-ï¼ ï¼ï¼      migration_channel_connect(s, ioc, NULL)
-ï¼ ï¼ï¼      ... ...
-ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN) above,
-ï¼ ï¼ï¼ and the
-ï¼ ï¼ï¼ migrate_fd_cancel()
-ï¼ ï¼ï¼ {
-ï¼ ï¼ï¼   ... ...
-ï¼ ï¼ï¼      if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) {
-ï¼ ï¼ï¼          qemu_file_shutdown(f)  --ï¼ This will not take effect. No ?
-ï¼ ï¼ï¼      }
-ï¼ ï¼ï¼ }
-ï¼ ï¼
-ï¼ ï¼ (cc'd in Daniel Berrange).
-ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, 
-QIO_CHANNEL_FEATURE_SHUTDOWN) at the
-ï¼ ï¼ top of qio_channel_socket_new  so I think that's safe isn't it?
-ï¼ ï¼
-ï¼
-ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, 
-thanks.
-ï¼
-ï¼ ï¼ Dave
-ï¼ ï¼
-ï¼ ï¼ï¼ Thanks,
-ï¼ ï¼ï¼ Hailiang
-ï¼ ï¼ï¼
-ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote:
-ï¼ ï¼ï¼ï¼ Thank youã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ I have test areadyã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same 
-placeã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Incorrding
-http://wiki.qemu-project.org/Features/COLO
-ï¼kill Primary Node 
-qemu will not produce the problem,but Primary Node panic canã
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support 
-QIO_CHANNEL_FEATURE_SHUTDOWN.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ I test a patch:
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ --- a/migration/socket.c
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean 
-socket_accept_incoming_migration(QIOChannel *ioc,
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        }
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        trace_migration_socket_incoming_accepted()
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        qio_channel_set_name(QIO_CHANNEL(sioc), 
-"migration-socket-incoming")
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ +    qio_channel_set_feature(QIO_CHANNEL(sioc), 
-QIO_CHANNEL_FEATURE_SHUTDOWN)
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        migration_channel_process_incoming(migrate_get_current(),
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼                                           QIO_CHANNEL(sioc))
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼        object_unref(OBJECT(sioc))
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ My test will not hang any more.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ åå§é®ä»¶
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden
-ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼çå¹¿10165992 address@hidden
-ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden
-ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58
-ï¼ ï¼ï¼ï¼ ä¸» é¢ ï¼Re: [Qemu-devel]  çå¤: Re:  [BUG]COLO failover hang
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Hi,Wang.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ You can test this branch:
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly.
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-http://wiki.qemu-project.org/Features/COLO
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Thanks
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ Zhang Chen
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote:
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ hi.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #0  qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880,
-ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #1  0x00007f658e4aa0c2 in qio_channel_read
-ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "",
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #2  0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #3  0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #4  0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #5  0x00007f658e3ea34b in qemu_get_byte (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #6  0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #7  0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #8  0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized
-ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden)
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼     at migration/colo.c:264
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #9  0x00007f658e3e740e in colo_process_incoming_thread
-ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming"
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features        Do not support QIO_CHANNEL_FEATURE_SHUTDOWN
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $3 = 0
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #0  socket_accept_incoming_migration (ioc=0x7fdcceeafa90,
-ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #1  0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at
-ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #2  g_main_context_dispatch (context=ï¼optimized outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #3  0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #4  os_host_main_loop_wait (timeout=ï¼optimized outï¼) at
-ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #5  main_loop_wait (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #6  0x00007fdccb526187 in main_loop () at vl.c:1898
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ #7  main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized
-ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $1 = 6
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener"
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should
-ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)??
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ thank you.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden
-ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼
-ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46
-ï¼ ï¼ï¼ï¼ ï¼ *ä¸» é¢ ï¼**Re: [Qemu-devel] COLO failover hang*
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote:
-ï¼ ï¼ï¼ï¼ ï¼ ï¼   am testing QEMU COLO feature described here [QEMU
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki](
-http://wiki.qemu-project.org/Features/COLO
-).
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ And  I run  { 'execute': 'nbd-server-stop' } and { "execute":
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the  Secondary Node qemu still hang at recvmsg .
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development?
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing.
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated!
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully,
-ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help.
-ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO,
-ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ Thanks
-ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0  0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1  0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized 
-outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, 
-errp=0x0) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2  0x00007f3e03329472 in qio_channel_read (address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3  0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4  0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5  0x00007f3e03274a41 in qemu_peek_byte (address@hidden,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6  0x00007f3e03274aab in qemu_get_byte (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7  0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8  0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9  0x00007f3e0327250d in colo_wait_handle_message 
-(errp=0x7f3d62bfaa48,
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ --
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context:
-http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html
-ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com.
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼ --
-ï¼ ï¼ï¼ï¼ ï¼ Thanks
-ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼ ï¼
-ï¼ ï¼ï¼ï¼
-ï¼ ï¼ï¼
-ï¼ ï¼ --
-ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK
-ï¼ ï¼
-ï¼ ï¼ .
-ï¼ ï¼
-ï¼
-
diff --git a/results/classifier/014/risc-v/74545755 b/results/classifier/014/risc-v/74545755
deleted file mode 100644
index d2e98605..00000000
--- a/results/classifier/014/risc-v/74545755
+++ /dev/null
@@ -1,371 +0,0 @@
-risc-v: 0.845
-user-level: 0.790
-operating system: 0.784
-register: 0.778
-permissions: 0.770
-mistranslation: 0.752
-debug: 0.740
-TCG: 0.722
-performance: 0.721
-device: 0.720
-semantic: 0.669
-virtual: 0.667
-arm: 0.662
-KVM: 0.661
-graphic: 0.660
-ppc: 0.659
-vnc: 0.650
-assembly: 0.648
-architecture: 0.636
-boot: 0.607
-VMM: 0.602
-files: 0.577
-peripherals: 0.566
-hypervisor: 0.563
-network: 0.550
-socket: 0.549
-x86: 0.545
-alpha: 0.508
-PID: 0.479
-kernel: 0.452
-i386: 0.376
-
-[Bug Report][RFC PATCH 0/1] block: fix failing assert on paused VM migration
-
-There's a bug (failing assert) which is reproduced during migration of
-a paused VM.  I am able to reproduce it on a stand with 2 nodes and a common
-NFS share, with VM's disk on that share.
-
-root@fedora40-1-vm:~# virsh domblklist alma8-vm
- Target   Source
-------------------------------------------
- sda      /mnt/shared/images/alma8.qcow2
-
-root@fedora40-1-vm:~# df -Th /mnt/shared
-Filesystem          Type  Size  Used Avail Use% Mounted on
-127.0.0.1:/srv/nfsd nfs4   63G   16G   48G  25% /mnt/shared
-
-On the 1st node:
-
-root@fedora40-1-vm:~# virsh start alma8-vm ; virsh suspend alma8-vm
-root@fedora40-1-vm:~# virsh migrate --compressed --p2p --persistent 
---undefinesource --live alma8-vm qemu+ssh://fedora40-2-vm/system
-
-Then on the 2nd node:
-
-root@fedora40-2-vm:~# virsh migrate --compressed --p2p --persistent 
---undefinesource --live alma8-vm qemu+ssh://fedora40-1-vm/system
-error: operation failed: domain is not running
-
-root@fedora40-2-vm:~# tail -3 /var/log/libvirt/qemu/alma8-vm.log
-2024-09-19 13:53:33.336+0000: initiating migration
-qemu-system-x86_64: ../block.c:6976: int 
-bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & 
-BDRV_O_INACTIVE)' failed.
-2024-09-19 13:53:42.991+0000: shutting down, reason=crashed
-
-Backtrace:
-
-(gdb) bt
-#0  0x00007f7eaa2f1664 in __pthread_kill_implementation () at /lib64/libc.so.6
-#1  0x00007f7eaa298c4e in raise () at /lib64/libc.so.6
-#2  0x00007f7eaa280902 in abort () at /lib64/libc.so.6
-#3  0x00007f7eaa28081e in __assert_fail_base.cold () at /lib64/libc.so.6
-#4  0x00007f7eaa290d87 in __assert_fail () at /lib64/libc.so.6
-#5  0x0000563c38b95eb8 in bdrv_inactivate_recurse (bs=0x563c3b6c60c0) at 
-../block.c:6976
-#6  0x0000563c38b95aeb in bdrv_inactivate_all () at ../block.c:7038
-#7  0x0000563c3884d354 in qemu_savevm_state_complete_precopy_non_iterable 
-(f=0x563c3b700c20, in_postcopy=false, inactivate_disks=true)
-    at ../migration/savevm.c:1571
-#8  0x0000563c3884dc1a in qemu_savevm_state_complete_precopy (f=0x563c3b700c20, 
-iterable_only=false, inactivate_disks=true) at ../migration/savevm.c:1631
-#9  0x0000563c3883a340 in migration_completion_precopy (s=0x563c3b4d51f0, 
-current_active_state=<optimized out>) at ../migration/migration.c:2780
-#10 migration_completion (s=0x563c3b4d51f0) at ../migration/migration.c:2844
-#11 migration_iteration_run (s=0x563c3b4d51f0) at ../migration/migration.c:3270
-#12 migration_thread (opaque=0x563c3b4d51f0) at ../migration/migration.c:3536
-#13 0x0000563c38dbcf14 in qemu_thread_start (args=0x563c3c2d5bf0) at 
-../util/qemu-thread-posix.c:541
-#14 0x00007f7eaa2ef6d7 in start_thread () at /lib64/libc.so.6
-#15 0x00007f7eaa373414 in clone () at /lib64/libc.so.6
-
-What happens here is that after 1st migration BDS related to HDD remains
-inactive as VM is still paused.  Then when we initiate 2nd migration,
-bdrv_inactivate_all() leads to the attempt to set BDRV_O_INACTIVE flag
-on that node which is already set, thus assert fails.
-
-Attached patch which simply skips setting flag if it's already set is more
-of a kludge than a clean solution.  Should we use more sophisticated logic
-which allows some of the nodes be in inactive state prior to the migration,
-and takes them into account during bdrv_inactivate_all()?  Comments would
-be appreciated.
-
-Andrey
-
-Andrey Drobyshev (1):
-  block: do not fail when inactivating node which is inactive
-
- block.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
-
--- 
-2.39.3
-
-Instead of throwing an assert let's just ignore that flag is already set
-and return.  We assume that it's going to be safe to ignore.  Otherwise
-this assert fails when migrating a paused VM back and forth.
-
-Ideally we'd like to have a more sophisticated solution, e.g. not even
-scan the nodes which should be inactive at this point.
-
-Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
----
- block.c | 10 +++++++++-
- 1 file changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/block.c b/block.c
-index 7d90007cae..c1dcf906d1 100644
---- a/block.c
-+++ b/block.c
-@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK 
-bdrv_inactivate_recurse(BlockDriverState *bs)
-         return 0;
-     }
- 
--    assert(!(bs->open_flags & BDRV_O_INACTIVE));
-+    if (bs->open_flags & BDRV_O_INACTIVE) {
-+        /*
-+         * Return here instead of throwing assert as a workaround to
-+         * prevent failure on migrating paused VM.
-+         * Here we assume that if we're trying to inactivate BDS that's
-+         * already inactive, it's safe to just ignore it.
-+         */
-+        return 0;
-+    }
- 
-     /* Inactivate this node */
-     if (bs->drv->bdrv_inactivate) {
--- 
-2.39.3
-
-[add migration maintainers]
-
-On 24.09.24 15:56, Andrey Drobyshev wrote:
-Instead of throwing an assert let's just ignore that flag is already set
-and return.  We assume that it's going to be safe to ignore.  Otherwise
-this assert fails when migrating a paused VM back and forth.
-
-Ideally we'd like to have a more sophisticated solution, e.g. not even
-scan the nodes which should be inactive at this point.
-
-Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
----
-  block.c | 10 +++++++++-
-  1 file changed, 9 insertions(+), 1 deletion(-)
-
-diff --git a/block.c b/block.c
-index 7d90007cae..c1dcf906d1 100644
---- a/block.c
-+++ b/block.c
-@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK 
-bdrv_inactivate_recurse(BlockDriverState *bs)
-          return 0;
-      }
--    assert(!(bs->open_flags & BDRV_O_INACTIVE));
-+    if (bs->open_flags & BDRV_O_INACTIVE) {
-+        /*
-+         * Return here instead of throwing assert as a workaround to
-+         * prevent failure on migrating paused VM.
-+         * Here we assume that if we're trying to inactivate BDS that's
-+         * already inactive, it's safe to just ignore it.
-+         */
-+        return 0;
-+    }
-/* Inactivate this node */
-if (bs->drv->bdrv_inactivate) {
-I doubt that this a correct way to go.
-
-As far as I understand, "inactive" actually means that "storage is not belong to 
-qemu, but to someone else (another qemu process for example), and may be changed 
-transparently". In turn this means that Qemu should do nothing with inactive disks. So the 
-problem is that nobody called bdrv_activate_all on target, and we shouldn't ignore that.
-
-Hmm, I see in process_incoming_migration_bh() we do call bdrv_activate_all(), 
-but only in some scenarios. May be, the condition should be less strict here.
-
-Why we need any condition here at all? Don't we want to activate block-layer on 
-target after migration anyway?
-
---
-Best regards,
-Vladimir
-
-On 9/30/24 12:25 PM, Vladimir Sementsov-Ogievskiy wrote:
->
-[add migration maintainers]
->
->
-On 24.09.24 15:56, Andrey Drobyshev wrote:
->
-> [...]
->
->
-I doubt that this a correct way to go.
->
->
-As far as I understand, "inactive" actually means that "storage is not
->
-belong to qemu, but to someone else (another qemu process for example),
->
-and may be changed transparently". In turn this means that Qemu should
->
-do nothing with inactive disks. So the problem is that nobody called
->
-bdrv_activate_all on target, and we shouldn't ignore that.
->
->
-Hmm, I see in process_incoming_migration_bh() we do call
->
-bdrv_activate_all(), but only in some scenarios. May be, the condition
->
-should be less strict here.
->
->
-Why we need any condition here at all? Don't we want to activate
->
-block-layer on target after migration anyway?
->
-Hmm I'm not sure about the unconditional activation, since we at least
-have to honor LATE_BLOCK_ACTIVATE cap if it's set (and probably delay it
-in such a case).  In current libvirt upstream I see such code:
-
->
-/* Migration capabilities which should always be enabled as long as they
->
->
-* are supported by QEMU. If the capability is supposed to be enabled on both
->
->
-* sides of migration, it won't be enabled unless both sides support it.
->
->
-*/
->
->
-static const qemuMigrationParamsAlwaysOnItem qemuMigrationParamsAlwaysOn[] =
->
-{
->
->
-{QEMU_MIGRATION_CAP_PAUSE_BEFORE_SWITCHOVER,
->
->
-QEMU_MIGRATION_SOURCE},
->
->
->
->
-{QEMU_MIGRATION_CAP_LATE_BLOCK_ACTIVATE,
->
->
-QEMU_MIGRATION_DESTINATION},
->
->
-};
-which means that libvirt always wants LATE_BLOCK_ACTIVATE to be set.
-
-The code from process_incoming_migration_bh() you're referring to:
-
->
-/* If capability late_block_activate is set:
->
->
-* Only fire up the block code now if we're going to restart the
->
->
-* VM, else 'cont' will do it.
->
->
-* This causes file locking to happen; so we don't want it to happen
->
->
-* unless we really are starting the VM.
->
->
-*/
->
->
-if (!migrate_late_block_activate() ||
->
->
-(autostart && (!global_state_received() ||
->
->
-runstate_is_live(global_state_get_runstate())))) {
->
->
-/* Make sure all file formats throw away their mutable metadata.
->
->
->
-* If we get an error here, just don't restart the VM yet. */
->
->
-bdrv_activate_all(&local_err);
->
->
-if (local_err) {
->
->
-error_report_err(local_err);
->
->
-local_err = NULL;
->
->
-autostart = false;
->
->
-}
->
->
-}
-It states explicitly that we're either going to start VM right at this
-point if (autostart == true), or we wait till "cont" command happens.
-None of this is going to happen if we start another migration while
-still being in PAUSED state.  So I think it seems reasonable to take
-such case into account.  For instance, this patch does prevent the crash:
-
->
-diff --git a/migration/migration.c b/migration/migration.c
->
-index ae2be31557..3222f6745b 100644
->
---- a/migration/migration.c
->
-+++ b/migration/migration.c
->
-@@ -733,7 +733,8 @@ static void process_incoming_migration_bh(void *opaque)
->
-*/
->
-if (!migrate_late_block_activate() ||
->
-(autostart && (!global_state_received() ||
->
--            runstate_is_live(global_state_get_runstate())))) {
->
-+            runstate_is_live(global_state_get_runstate()))) ||
->
-+         (!autostart && global_state_get_runstate() == RUN_STATE_PAUSED)) {
->
-/* Make sure all file formats throw away their mutable metadata.
->
-* If we get an error here, just don't restart the VM yet. */
->
-bdrv_activate_all(&local_err);
-What are your thoughts on it?
-
-Andrey
-
author	Christian Krinitsin <mail@krinitsin.com>	2025-07-03 19:39:53 +0200
committer	Christian Krinitsin <mail@krinitsin.com>	2025-07-03 19:39:53 +0200
commit	dee4dcba78baf712cab403d47d9db319ab7f95d6 (patch)
tree	418478faf06786701a56268672f73d6b0b4eb239 /results/classifier/014/risc-v
parent	4d9e26c0333abd39bdbd039dcdb30ed429c475ba (diff)
download	emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.tar.gz emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.zip