diff options
| author | Christian Krinitsin <mail@krinitsin.com> | 2025-07-03 19:39:53 +0200 |
|---|---|---|
| committer | Christian Krinitsin <mail@krinitsin.com> | 2025-07-03 19:39:53 +0200 |
| commit | dee4dcba78baf712cab403d47d9db319ab7f95d6 (patch) | |
| tree | 418478faf06786701a56268672f73d6b0b4eb239 /results/classifier/014/risc-v | |
| parent | 4d9e26c0333abd39bdbd039dcdb30ed429c475ba (diff) | |
| download | emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.tar.gz emulator-bug-study-dee4dcba78baf712cab403d47d9db319ab7f95d6.zip | |
restructure results
Diffstat (limited to 'results/classifier/014/risc-v')
| -rw-r--r-- | results/classifier/014/risc-v/25892827 | 1104 | ||||
| -rw-r--r-- | results/classifier/014/risc-v/70294255 | 1088 | ||||
| -rw-r--r-- | results/classifier/014/risc-v/74545755 | 371 |
3 files changed, 0 insertions, 2563 deletions
diff --git a/results/classifier/014/risc-v/25892827 b/results/classifier/014/risc-v/25892827 deleted file mode 100644 index 6346b0c1..00000000 --- a/results/classifier/014/risc-v/25892827 +++ /dev/null @@ -1,1104 +0,0 @@ -risc-v: 0.908 -user-level: 0.889 -permissions: 0.881 -register: 0.876 -KVM: 0.872 -hypervisor: 0.871 -operating system: 0.871 -debug: 0.868 -x86: 0.849 -vnc: 0.846 -mistranslation: 0.842 -boot: 0.839 -network: 0.839 -VMM: 0.839 -device: 0.839 -TCG: 0.837 -virtual: 0.835 -i386: 0.835 -peripherals: 0.833 -graphic: 0.832 -assembly: 0.829 -architecture: 0.825 -semantic: 0.825 -ppc: 0.824 -socket: 0.822 -arm: 0.821 -performance: 0.819 -alpha: 0.816 -kernel: 0.810 -files: 0.804 -PID: 0.792 - -[Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot - -Hi, - -Recently we encountered a problem in our project: 2 CPUs in VM are not brought -up normally after reboot. - -Our host is using KVM kmod 3.6 and QEMU 2.1. -A SLES 11 sp3 VM configured with 8 vcpus, -cpu model is configured with 'host-passthrough'. - -After VM's first time started up, everything seems to be OK. -and then VM is paniced and rebooted. -After reboot, only 6 cpus are brought up in VM, cpu1 and cpu7 are not online. - -This is the only message we can get from VM: -VM dmesg shows: -[ 0.069867] Booting Node 0, Processors #1 -[ 5.060042] CPU1: Stuck ?? -[ 5.060499] #2 -[ 5.088322] kvm-clock: cpu 2, msr 6:3fc90901, secondary cpu clock -[ 5.088335] KVM setup async PF for cpu 2 -[ 5.092967] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.094405] #3 -[ 5.108324] kvm-clock: cpu 3, msr 6:3fcd0901, secondary cpu clock -[ 5.108333] KVM setup async PF for cpu 3 -[ 5.113553] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.114970] #4 -[ 5.128325] kvm-clock: cpu 4, msr 6:3fd10901, secondary cpu clock -[ 5.128336] KVM setup async PF for cpu 4 -[ 5.134576] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.135998] #5 -[ 5.152324] kvm-clock: cpu 5, msr 6:3fd50901, secondary cpu clock -[ 5.152334] KVM setup async PF for cpu 5 -[ 5.154764] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.156467] #6 -[ 5.172327] kvm-clock: cpu 6, msr 6:3fd90901, secondary cpu clock -[ 5.172341] KVM setup async PF for cpu 6 -[ 5.180738] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.182173] #7 Ok. -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs -[ 10.172394] Total of 6 processors activated (28799.97 BogoMIPS). - -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not consuming -any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' warning, -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? - -Has anyone come across these problem before? Or any idea? - -Thanks, -zhanghailiang - -On 06/07/2015 09:54, zhanghailiang wrote: -> -> -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -> -consuming any cpu (Should be in idle state), -> -All of VCPUs' stacks in host is like bellow: -> -> -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -> -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -> -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -> -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -> -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -> -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -> -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -> -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -> -[<ffffffffffffffff>] 0xffffffffffffffff -> -> -We looked into the kernel codes that could leading to the above 'Stuck' -> -warning, -> -and found that the only possible is the emulation of 'cpuid' instruct in -> -kvm/qemu has something wrong. -> -But since we canât reproduce this problem, we are not quite sure. -> -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? - -Paolo - -On 2015/7/6 16:45, Paolo Bonzini wrote: -On 06/07/2015 09:54, zhanghailiang wrote: -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -consuming any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' -warning, -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -located in -do_boot_cpu(). It's in BSP context, the call process is: -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() --> wakeup_secondary_via_INIT() to trigger APs. -It will wait 5s for APs to startup, if some AP not startup normally, it will -print 'CPU%d Stuck' or 'CPU%d: Not responding'. - -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -begins to execute the code -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before -smp_callin()(smpboot.c). -The follow is the starup process of BSP and AP. -BSP: -start_kernel() - ->smp_init() - ->smp_boot_cpus() - ->do_boot_cpu() - ->start_ip = trampoline_address(); //set the address that AP will go -to execute - ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU - ->for (timeout = 0; timeout < 50000; timeout++) - if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP -startup or not - -APs: -ENTRY(trampoline_data) (trampoline_64.S) - ->ENTRY(secondary_startup_64) (head_64.S) - ->start_secondary() (smpboot.c) - ->cpu_init(); - ->smp_callin(); - ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes -here, the BSP will not prints the error message. - -From above call process, we can be sure that, the AP has been stuck between -trampoline_data and the cpumask_set_cpu() in -smp_callin(), we look through these codes path carefully, and only found a -'hlt' instruct that could block the process. -It is located in trampoline_data(): - -ENTRY(trampoline_data) - ... - - call verify_cpu # Verify the cpu supports long mode - testl %eax, %eax # Check for return code - jnz no_longmode - - ... - -no_longmode: - hlt - jmp no_longmode - -For the process verify_cpu(), -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -No-root mode. -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -the fail in verify_cpu. - -From the message in VM, we know vcpu1 and vcpu7 is something wrong. -[ 5.060042] CPU1: Stuck ?? -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs - -Besides, the follow is the cpus message got from host. -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command -instance-0000000 -* CPU #0: pc=0x00007f64160c683d thread_id=68570 - CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 - CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 - CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 - CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 - CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 - CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 - CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 - -Oh, i also forgot to mention in the above message that, we have bond each vCPU -to different physical CPU in -host. - -Thanks, -zhanghailiang - -On 06/07/2015 11:59, zhanghailiang wrote: -> -> -> -Besides, the follow is the cpus message got from host. -> -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh -> -qemu-monitor-command instance-0000000 -> -* CPU #0: pc=0x00007f64160c683d thread_id=68570 -> -CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 -> -CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 -> -CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 -> -CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 -> -CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 -> -CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 -> -CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 -> -> -Oh, i also forgot to mention in the above message that, we have bond -> -each vCPU to different physical CPU in -> -host. -Can you capture a trace on the host (trace-cmd record -e kvm) and send -it privately? Please note which CPUs get stuck, since I guess it's not -always 1 and 7. - -Paolo - -On Mon, 6 Jul 2015 17:59:10 +0800 -zhanghailiang <address@hidden> wrote: - -> -On 2015/7/6 16:45, Paolo Bonzini wrote: -> -> -> -> -> -> On 06/07/2015 09:54, zhanghailiang wrote: -> ->> -> ->> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -> ->> consuming any cpu (Should be in idle state), -> ->> All of VCPUs' stacks in host is like bellow: -> ->> -> ->> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -> ->> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -> ->> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -> ->> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -> ->> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -> ->> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -> ->> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b -> ->> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -> ->> [<ffffffffffffffff>] 0xffffffffffffffff -> ->> -> ->> We looked into the kernel codes that could leading to the above 'Stuck' -> ->> warning, -in current upstream there isn't any printk(...Stuck...) left since that code -path -has been reworked. -I've often seen this on over-committed host during guest CPUs up/down torture -test. -Could you update guest kernel to upstream and see if issue reproduces? - -> ->> and found that the only possible is the emulation of 'cpuid' instruct in -> ->> kvm/qemu has something wrong. -> ->> But since we canât reproduce this problem, we are not quite sure. -> ->> Is there any possible that the cupid emulation in kvm/qemu has some bug ? -> -> -> -> Can you explain the relationship to the cpuid emulation? What do the -> -> traces say about vcpus 1 and 7? -> -> -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -> -located in -> -do_boot_cpu(). It's in BSP context, the call process is: -> -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> --> wakeup_secondary_via_INIT() to trigger APs. -> -It will wait 5s for APs to startup, if some AP not startup normally, it will -> -print 'CPU%d Stuck' or 'CPU%d: Not responding'. -> -> -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -> -begins to execute the code -> -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places -> -before smp_callin()(smpboot.c). -> -The follow is the starup process of BSP and AP. -> -BSP: -> -start_kernel() -> -->smp_init() -> -->smp_boot_cpus() -> -->do_boot_cpu() -> -->start_ip = trampoline_address(); //set the address that AP will -> -go to execute -> -->wakeup_secondary_cpu_via_init(); // kick the secondary CPU -> -->for (timeout = 0; timeout < 50000; timeout++) -> -if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if -> -AP startup or not -> -> -APs: -> -ENTRY(trampoline_data) (trampoline_64.S) -> -->ENTRY(secondary_startup_64) (head_64.S) -> -->start_secondary() (smpboot.c) -> -->cpu_init(); -> -->smp_callin(); -> -->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -> -comes here, the BSP will not prints the error message. -> -> -From above call process, we can be sure that, the AP has been stuck between -> -trampoline_data and the cpumask_set_cpu() in -> -smp_callin(), we look through these codes path carefully, and only found a -> -'hlt' instruct that could block the process. -> -It is located in trampoline_data(): -> -> -ENTRY(trampoline_data) -> -... -> -> -call verify_cpu # Verify the cpu supports long mode -> -testl %eax, %eax # Check for return code -> -jnz no_longmode -> -> -... -> -> -no_longmode: -> -hlt -> -jmp no_longmode -> -> -For the process verify_cpu(), -> -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -> -No-root mode. -> -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -> -the fail in verify_cpu. -> -> -From the message in VM, we know vcpu1 and vcpu7 is something wrong. -> -[ 5.060042] CPU1: Stuck ?? -> -[ 10.170815] CPU7: Stuck ?? -> -[ 10.171648] Brought up 6 CPUs -> -> -Besides, the follow is the cpus message got from host. -> -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh -> -qemu-monitor-command instance-0000000 -> -* CPU #0: pc=0x00007f64160c683d thread_id=68570 -> -CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 -> -CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 -> -CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 -> -CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 -> -CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 -> -CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 -> -CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 -> -> -Oh, i also forgot to mention in the above message that, we have bond each -> -vCPU to different physical CPU in -> -host. -> -> -Thanks, -> -zhanghailiang -> -> -> -> -> --- -> -To unsubscribe from this list: send the line "unsubscribe kvm" in -> -the body of a message to address@hidden -> -More majordomo info at -http://vger.kernel.org/majordomo-info.html - -On 2015/7/7 19:23, Igor Mammedov wrote: -On Mon, 6 Jul 2015 17:59:10 +0800 -zhanghailiang <address@hidden> wrote: -On 2015/7/6 16:45, Paolo Bonzini wrote: -On 06/07/2015 09:54, zhanghailiang wrote: -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -consuming any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' -warning, -in current upstream there isn't any printk(...Stuck...) left since that code -path -has been reworked. -I've often seen this on over-committed host during guest CPUs up/down torture -test. -Could you update guest kernel to upstream and see if issue reproduces? -Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to -reproduce it. - -For your test case, is it a kernel bug? -Or is there any related patch could solve your test problem been merged into -upstream ? - -Thanks, -zhanghailiang -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -located in -do_boot_cpu(). It's in BSP context, the call process is: -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() --> wakeup_secondary_via_INIT() to trigger APs. -It will wait 5s for APs to startup, if some AP not startup normally, it will -print 'CPU%d Stuck' or 'CPU%d: Not responding'. - -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -begins to execute the code -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before -smp_callin()(smpboot.c). -The follow is the starup process of BSP and AP. -BSP: -start_kernel() - ->smp_init() - ->smp_boot_cpus() - ->do_boot_cpu() - ->start_ip = trampoline_address(); //set the address that AP will -go to execute - ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU - ->for (timeout = 0; timeout < 50000; timeout++) - if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if -AP startup or not - -APs: -ENTRY(trampoline_data) (trampoline_64.S) - ->ENTRY(secondary_startup_64) (head_64.S) - ->start_secondary() (smpboot.c) - ->cpu_init(); - ->smp_callin(); - ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -comes here, the BSP will not prints the error message. - - From above call process, we can be sure that, the AP has been stuck between -trampoline_data and the cpumask_set_cpu() in -smp_callin(), we look through these codes path carefully, and only found a -'hlt' instruct that could block the process. -It is located in trampoline_data(): - -ENTRY(trampoline_data) - ... - - call verify_cpu # Verify the cpu supports long mode - testl %eax, %eax # Check for return code - jnz no_longmode - - ... - -no_longmode: - hlt - jmp no_longmode - -For the process verify_cpu(), -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -No-root mode. -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -the fail in verify_cpu. - - From the message in VM, we know vcpu1 and vcpu7 is something wrong. -[ 5.060042] CPU1: Stuck ?? -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs - -Besides, the follow is the cpus message got from host. -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command -instance-0000000 -* CPU #0: pc=0x00007f64160c683d thread_id=68570 - CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 - CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 - CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 - CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 - CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 - CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 - CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 - -Oh, i also forgot to mention in the above message that, we have bond each vCPU -to different physical CPU in -host. - -Thanks, -zhanghailiang - - - - --- -To unsubscribe from this list: send the line "unsubscribe kvm" in -the body of a message to address@hidden -More majordomo info at -http://vger.kernel.org/majordomo-info.html -. - -On Tue, 7 Jul 2015 19:43:35 +0800 -zhanghailiang <address@hidden> wrote: - -> -On 2015/7/7 19:23, Igor Mammedov wrote: -> -> On Mon, 6 Jul 2015 17:59:10 +0800 -> -> zhanghailiang <address@hidden> wrote: -> -> -> ->> On 2015/7/6 16:45, Paolo Bonzini wrote: -> ->>> -> ->>> -> ->>> On 06/07/2015 09:54, zhanghailiang wrote: -> ->>>> -> ->>>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -> ->>>> consuming any cpu (Should be in idle state), -> ->>>> All of VCPUs' stacks in host is like bellow: -> ->>>> -> ->>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -> ->>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -> ->>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -> ->>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -> ->>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -> ->>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -> ->>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b -> ->>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -> ->>>> [<ffffffffffffffff>] 0xffffffffffffffff -> ->>>> -> ->>>> We looked into the kernel codes that could leading to the above 'Stuck' -> ->>>> warning, -> -> in current upstream there isn't any printk(...Stuck...) left since that -> -> code path -> -> has been reworked. -> -> I've often seen this on over-committed host during guest CPUs up/down -> -> torture test. -> -> Could you update guest kernel to upstream and see if issue reproduces? -> -> -> -> -Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to -> -reproduce it. -> -> -For your test case, is it a kernel bug? -> -Or is there any related patch could solve your test problem been merged into -> -upstream ? -I don't remember all prerequisite patches but you should be able to find -http://marc.info/?l=linux-kernel&m=140326703108009&w=2 -"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" -and then look for dependencies. - - -> -> -Thanks, -> -zhanghailiang -> -> ->>>> and found that the only possible is the emulation of 'cpuid' instruct in -> ->>>> kvm/qemu has something wrong. -> ->>>> But since we canât reproduce this problem, we are not quite sure. -> ->>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ? -> ->>> -> ->>> Can you explain the relationship to the cpuid emulation? What do the -> ->>> traces say about vcpus 1 and 7? -> ->> -> ->> OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -> ->> located in -> ->> do_boot_cpu(). It's in BSP context, the call process is: -> ->> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> -> ->> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. -> ->> It will wait 5s for APs to startup, if some AP not startup normally, it -> ->> will print 'CPU%d Stuck' or 'CPU%d: Not responding'. -> ->> -> ->> If it prints 'Stuck', it means the AP has received the SIPI interrupt and -> ->> begins to execute the code -> ->> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places -> ->> before smp_callin()(smpboot.c). -> ->> The follow is the starup process of BSP and AP. -> ->> BSP: -> ->> start_kernel() -> ->> ->smp_init() -> ->> ->smp_boot_cpus() -> ->> ->do_boot_cpu() -> ->> ->start_ip = trampoline_address(); //set the address that AP -> ->> will go to execute -> ->> ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU -> ->> ->for (timeout = 0; timeout < 50000; timeout++) -> ->> if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// -> ->> check if AP startup or not -> ->> -> ->> APs: -> ->> ENTRY(trampoline_data) (trampoline_64.S) -> ->> ->ENTRY(secondary_startup_64) (head_64.S) -> ->> ->start_secondary() (smpboot.c) -> ->> ->cpu_init(); -> ->> ->smp_callin(); -> ->> ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -> ->> comes here, the BSP will not prints the error message. -> ->> -> ->> From above call process, we can be sure that, the AP has been stuck -> ->> between trampoline_data and the cpumask_set_cpu() in -> ->> smp_callin(), we look through these codes path carefully, and only found a -> ->> 'hlt' instruct that could block the process. -> ->> It is located in trampoline_data(): -> ->> -> ->> ENTRY(trampoline_data) -> ->> ... -> ->> -> ->> call verify_cpu # Verify the cpu supports long mode -> ->> testl %eax, %eax # Check for return code -> ->> jnz no_longmode -> ->> -> ->> ... -> ->> -> ->> no_longmode: -> ->> hlt -> ->> jmp no_longmode -> ->> -> ->> For the process verify_cpu(), -> ->> we can only find the 'cpuid' sensitive instruct that could lead VM exit -> ->> from No-root mode. -> ->> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading -> ->> to the fail in verify_cpu. -> ->> -> ->> From the message in VM, we know vcpu1 and vcpu7 is something wrong. -> ->> [ 5.060042] CPU1: Stuck ?? -> ->> [ 10.170815] CPU7: Stuck ?? -> ->> [ 10.171648] Brought up 6 CPUs -> ->> -> ->> Besides, the follow is the cpus message got from host. -> ->> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh -> ->> qemu-monitor-command instance-0000000 -> ->> * CPU #0: pc=0x00007f64160c683d thread_id=68570 -> ->> CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 -> ->> CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 -> ->> CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 -> ->> CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 -> ->> CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 -> ->> CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 -> ->> CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 -> ->> -> ->> Oh, i also forgot to mention in the above message that, we have bond each -> ->> vCPU to different physical CPU in -> ->> host. -> ->> -> ->> Thanks, -> ->> zhanghailiang -> ->> -> ->> -> ->> -> ->> -> ->> -- -> ->> To unsubscribe from this list: send the line "unsubscribe kvm" in -> ->> the body of a message to address@hidden -> ->> More majordomo info at -http://vger.kernel.org/majordomo-info.html -> -> -> -> -> -> . -> -> -> -> -> - -On 2015/7/7 20:21, Igor Mammedov wrote: -On Tue, 7 Jul 2015 19:43:35 +0800 -zhanghailiang <address@hidden> wrote: -On 2015/7/7 19:23, Igor Mammedov wrote: -On Mon, 6 Jul 2015 17:59:10 +0800 -zhanghailiang <address@hidden> wrote: -On 2015/7/6 16:45, Paolo Bonzini wrote: -On 06/07/2015 09:54, zhanghailiang wrote: -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -consuming any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' -warning, -in current upstream there isn't any printk(...Stuck...) left since that code -path -has been reworked. -I've often seen this on over-committed host during guest CPUs up/down torture -test. -Could you update guest kernel to upstream and see if issue reproduces? -Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to -reproduce it. - -For your test case, is it a kernel bug? -Or is there any related patch could solve your test problem been merged into -upstream ? -I don't remember all prerequisite patches but you should be able to find -http://marc.info/?l=linux-kernel&m=140326703108009&w=2 -"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" -and then look for dependencies. -Er, we have investigated this patch, and it is not related to our problem, :) - -Thanks. -Thanks, -zhanghailiang -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -located in -do_boot_cpu(). It's in BSP context, the call process is: -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() --> wakeup_secondary_via_INIT() to trigger APs. -It will wait 5s for APs to startup, if some AP not startup normally, it will -print 'CPU%d Stuck' or 'CPU%d: Not responding'. - -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -begins to execute the code -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before -smp_callin()(smpboot.c). -The follow is the starup process of BSP and AP. -BSP: -start_kernel() - ->smp_init() - ->smp_boot_cpus() - ->do_boot_cpu() - ->start_ip = trampoline_address(); //set the address that AP will -go to execute - ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU - ->for (timeout = 0; timeout < 50000; timeout++) - if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if -AP startup or not - -APs: -ENTRY(trampoline_data) (trampoline_64.S) - ->ENTRY(secondary_startup_64) (head_64.S) - ->start_secondary() (smpboot.c) - ->cpu_init(); - ->smp_callin(); - ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -comes here, the BSP will not prints the error message. - - From above call process, we can be sure that, the AP has been stuck between -trampoline_data and the cpumask_set_cpu() in -smp_callin(), we look through these codes path carefully, and only found a -'hlt' instruct that could block the process. -It is located in trampoline_data(): - -ENTRY(trampoline_data) - ... - - call verify_cpu # Verify the cpu supports long mode - testl %eax, %eax # Check for return code - jnz no_longmode - - ... - -no_longmode: - hlt - jmp no_longmode - -For the process verify_cpu(), -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -No-root mode. -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -the fail in verify_cpu. - - From the message in VM, we know vcpu1 and vcpu7 is something wrong. -[ 5.060042] CPU1: Stuck ?? -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs - -Besides, the follow is the cpus message got from host. -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command -instance-0000000 -* CPU #0: pc=0x00007f64160c683d thread_id=68570 - CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 - CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 - CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 - CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 - CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 - CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 - CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 - -Oh, i also forgot to mention in the above message that, we have bond each vCPU -to different physical CPU in -host. - -Thanks, -zhanghailiang - - - - --- -To unsubscribe from this list: send the line "unsubscribe kvm" in -the body of a message to address@hidden -More majordomo info at -http://vger.kernel.org/majordomo-info.html -. -. - diff --git a/results/classifier/014/risc-v/70294255 b/results/classifier/014/risc-v/70294255 deleted file mode 100644 index f8e35953..00000000 --- a/results/classifier/014/risc-v/70294255 +++ /dev/null @@ -1,1088 +0,0 @@ -risc-v: 0.863 -mistranslation: 0.862 -assembly: 0.861 -PID: 0.859 -semantic: 0.858 -socket: 0.858 -device: 0.857 -user-level: 0.857 -graphic: 0.857 -arm: 0.856 -debug: 0.854 -permissions: 0.854 -architecture: 0.851 -performance: 0.850 -kernel: 0.848 -network: 0.846 -operating system: 0.844 -register: 0.842 -vnc: 0.837 -alpha: 0.834 -files: 0.832 -virtual: 0.832 -hypervisor: 0.828 -peripherals: 0.819 -boot: 0.811 -i386: 0.811 -KVM: 0.806 -x86: 0.803 -ppc: 0.800 -TCG: 0.792 -VMM: 0.784 - -[Qemu-devel] 答复: Re: 答复: Re: 答复: Re: 答复: Re: [BUG]COLO failover hang - -hi: - -yes.it is better. - -And should we delete - - - - -#ifdef WIN32 - - QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) - -#endif - - - - -in qio_channel_socket_acceptï¼ - -qio_channel_socket_new already have it. - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 -æéäººï¼ address@hidden address@hidden address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang - - - - - -Hi, - -On 2017/3/22 9:42, address@hidden wrote: -ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ -ï¼ -ï¼ index 13966f1..d65a0ea 100644 -ï¼ -ï¼ -ï¼ --- a/migration/socket.c -ï¼ -ï¼ -ï¼ +++ b/migration/socket.c -ï¼ -ï¼ -ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ -ï¼ -ï¼ } -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ trace_migration_socket_incoming_accepted() -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") -ï¼ -ï¼ -ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ -ï¼ -ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ -ï¼ -ï¼ QIO_CHANNEL(sioc)) -ï¼ -ï¼ -ï¼ object_unref(OBJECT(sioc)) -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ Is this patch ok? -ï¼ - -Yes, i think this works, but a better way maybe to call -qio_channel_set_feature() -in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the -socket accept fd, -Or fix it by this: - -diff --git a/io/channel-socket.c b/io/channel-socket.c -index f546c68..ce6894c 100644 ---- a/io/channel-socket.c -+++ b/io/channel-socket.c -@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, - Error **errp) - { - QIOChannelSocket *cioc -- -- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) -- cioc-ï¼fd = -1 -+ -+ cioc = qio_channel_socket_new() - cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) - cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) - - -Thanks, -Hailiang - -ï¼ I have test it . The test could not hang any more. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ -ï¼ -ï¼ -ï¼ åä»¶äººï¼ address@hidden -ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden -ï¼ æéäººï¼ address@hidden address@hidden address@hidden -ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 -ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: -ï¼ ï¼ï¼ Hi, -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do -failover, -ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) -ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if -ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() -ï¼ ï¼ï¼ if we tried to cancel migration. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, -Error **errp) -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") -ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -ï¼ ï¼ï¼ and the -ï¼ ï¼ï¼ migrate_fd_cancel() -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { -ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? -ï¼ ï¼ï¼ } -ï¼ ï¼ï¼ } -ï¼ ï¼ -ï¼ ï¼ (cc'd in Daniel Berrange). -ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, -QIO_CHANNEL_FEATURE_SHUTDOWN) at the -ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? -ï¼ ï¼ -ï¼ -ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, -thanks. -ï¼ -ï¼ ï¼ Dave -ï¼ ï¼ -ï¼ ï¼ï¼ Thanks, -ï¼ ï¼ï¼ Hailiang -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: -ï¼ ï¼ï¼ï¼ Thank youã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I have test areadyã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same -placeã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node -qemu will not produce the problem,but Primary Node panic canã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I test a patch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ --- a/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ } -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), -"migration-socket-incoming") -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ My test will not hang any more. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden -ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden -ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden -ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Hi,Wang. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ You can test this branch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -http://wiki.qemu-project.org/Features/COLO -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Thanks -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ hi. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", -ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should -ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ thank you. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ -ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, -ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, -ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized -outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, -errp=0x0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message -(errp=0x7f3d62bfaa48, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ -ï¼ ï¼ -- -ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK -ï¼ ï¼ -ï¼ ï¼ . -ï¼ ï¼ -ï¼ - -On 2017/3/22 16:09, address@hidden wrote: -hi: - -yes.it is better. - -And should we delete -Yes, you are right. -#ifdef WIN32 - - QIO_CHANNEL(cioc)-ï¼event = CreateEvent(NULL, FALSE, FALSE, NULL) - -#endif - - - - -in qio_channel_socket_acceptï¼ - -qio_channel_socket_new already have it. - - - - - - - - - - - - -åå§é®ä»¶ - - - -åä»¶äººï¼ address@hidden -æ¶ä»¶äººï¼ç广10165992 -æéäººï¼ address@hidden address@hidden address@hidden address@hidden -æ¥ æ ï¼2017å¹´03æ22æ¥ 15:03 -主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: çå¤: Re: [BUG]COLO failover hang - - - - - -Hi, - -On 2017/3/22 9:42, address@hidden wrote: -ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ -ï¼ -ï¼ index 13966f1..d65a0ea 100644 -ï¼ -ï¼ -ï¼ --- a/migration/socket.c -ï¼ -ï¼ -ï¼ +++ b/migration/socket.c -ï¼ -ï¼ -ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ -ï¼ -ï¼ } -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ trace_migration_socket_incoming_accepted() -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), "migration-socket-incoming") -ï¼ -ï¼ -ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ -ï¼ -ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ -ï¼ -ï¼ QIO_CHANNEL(sioc)) -ï¼ -ï¼ -ï¼ object_unref(OBJECT(sioc)) -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ Is this patch ok? -ï¼ - -Yes, i think this works, but a better way maybe to call -qio_channel_set_feature() -in qio_channel_socket_accept(), we didn't set the SHUTDOWN feature for the -socket accept fd, -Or fix it by this: - -diff --git a/io/channel-socket.c b/io/channel-socket.c -index f546c68..ce6894c 100644 ---- a/io/channel-socket.c -+++ b/io/channel-socket.c -@@ -330,9 +330,8 @@ qio_channel_socket_accept(QIOChannelSocket *ioc, - Error **errp) - { - QIOChannelSocket *cioc -- -- cioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET)) -- cioc-ï¼fd = -1 -+ -+ cioc = qio_channel_socket_new() - cioc-ï¼remoteAddrLen = sizeof(ioc-ï¼remoteAddr) - cioc-ï¼localAddrLen = sizeof(ioc-ï¼localAddr) - - -Thanks, -Hailiang - -ï¼ I have test it . The test could not hang any more. -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ åå§é®ä»¶ -ï¼ -ï¼ -ï¼ -ï¼ åä»¶äººï¼ address@hidden -ï¼ æ¶ä»¶äººï¼ address@hidden address@hidden -ï¼ æéäººï¼ address@hidden address@hidden address@hidden -ï¼ æ¥ æ ï¼2017å¹´03æ22æ¥ 09:11 -ï¼ ä¸» é¢ ï¼Re: [Qemu-devel] çå¤: Re: çå¤: Re: [BUG]COLO failover hang -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ -ï¼ On 2017/3/21 19:56, Dr. David Alan Gilbert wrote: -ï¼ ï¼ * Hailiang Zhang (address@hidden) wrote: -ï¼ ï¼ï¼ Hi, -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Thanks for reporting this, and i confirmed it in my test, and it is a bug. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Though we tried to call qemu_file_shutdown() to shutdown the related fd, in -ï¼ ï¼ï¼ case COLO thread/incoming thread is stuck in read/write() while do -failover, -ï¼ ï¼ï¼ but it didn't take effect, because all the fd used by COLO (also migration) -ï¼ ï¼ï¼ has been wrapped by qio channel, and it will not call the shutdown API if -ï¼ ï¼ï¼ we didn't qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN). -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ Cc: Dr. David Alan Gilbert address@hidden -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ I doubted migration cancel has the same problem, it may be stuck in write() -ï¼ ï¼ï¼ if we tried to cancel migration. -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ void fd_start_outgoing_migration(MigrationState *s, const char *fdname, -Error **errp) -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(ioc), "migration-fd-outgoing") -ï¼ ï¼ï¼ migration_channel_connect(s, ioc, NULL) -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ We didn't call qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) above, -ï¼ ï¼ï¼ and the -ï¼ ï¼ï¼ migrate_fd_cancel() -ï¼ ï¼ï¼ { -ï¼ ï¼ï¼ ... ... -ï¼ ï¼ï¼ if (s-ï¼state == MIGRATION_STATUS_CANCELLING && f) { -ï¼ ï¼ï¼ qemu_file_shutdown(f) --ï¼ This will not take effect. No ? -ï¼ ï¼ï¼ } -ï¼ ï¼ï¼ } -ï¼ ï¼ -ï¼ ï¼ (cc'd in Daniel Berrange). -ï¼ ï¼ I see that we call qio_channel_set_feature(ioc, -QIO_CHANNEL_FEATURE_SHUTDOWN) at the -ï¼ ï¼ top of qio_channel_socket_new so I think that's safe isn't it? -ï¼ ï¼ -ï¼ -ï¼ Hmm, you are right, this problem is only exist for the migration incoming fd, -thanks. -ï¼ -ï¼ ï¼ Dave -ï¼ ï¼ -ï¼ ï¼ï¼ Thanks, -ï¼ ï¼ï¼ Hailiang -ï¼ ï¼ï¼ -ï¼ ï¼ï¼ On 2017/3/21 16:10, address@hidden wrote: -ï¼ ï¼ï¼ï¼ Thank youã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I have test areadyã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ When the Primary Node panic,the Secondary Node qemu hang at the same -placeã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Incorrding -http://wiki.qemu-project.org/Features/COLO -ï¼kill Primary Node -qemu will not produce the problem,but Primary Node panic canã -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I think due to the feature of channel does not support -QIO_CHANNEL_FEATURE_SHUTDOWN. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ when failover,channel_shutdown could not shut down the channel. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ so the colo_process_incoming_thread will hang at recvmsg. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ I test a patch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ diff --git a/migration/socket.c b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ index 13966f1..d65a0ea 100644 -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ --- a/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ +++ b/migration/socket.c -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ @@ -147,8 +147,9 @@ static gboolean -socket_accept_incoming_migration(QIOChannel *ioc, -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ } -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ trace_migration_socket_incoming_accepted() -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ qio_channel_set_name(QIO_CHANNEL(sioc), -"migration-socket-incoming") -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ + qio_channel_set_feature(QIO_CHANNEL(sioc), -QIO_CHANNEL_FEATURE_SHUTDOWN) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ migration_channel_process_incoming(migrate_get_current(), -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ QIO_CHANNEL(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ object_unref(OBJECT(sioc)) -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ My test will not hang any more. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ åä»¶äººï¼ address@hidden -ï¼ ï¼ï¼ï¼ æ¶ä»¶äººï¼ç广10165992 address@hidden -ï¼ ï¼ï¼ï¼ æéäººï¼ address@hidden address@hidden -ï¼ ï¼ï¼ï¼ æ¥ æ ï¼2017å¹´03æ21æ¥ 15:58 -ï¼ ï¼ï¼ï¼ 主 é¢ ï¼Re: [Qemu-devel] çå¤: Re: [BUG]COLO failover hang -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Hi,Wang. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ You can test this branch: -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -https://github.com/coloft/qemu/tree/colo-v5.1-developing-COLO-frame-v21-with-shared-disk -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ and please follow wiki ensure your own configuration correctly. -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -http://wiki.qemu-project.org/Features/COLO -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Thanks -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ï¼ On 03/21/2017 03:27 PM, address@hidden wrote: -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ hi. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ I test the git qemu master have the same problem. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 qio_channel_socket_readv (ioc=0x7f65911b4e50, iov=0x7f64ef3fd880, -ï¼ ï¼ï¼ï¼ ï¼ niov=1, fds=0x0, nfds=0x0, errp=0x0) at io/channel-socket.c:461 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007f658e4aa0c2 in qio_channel_read -ï¼ ï¼ï¼ï¼ ï¼ (address@hidden, address@hidden "", -ï¼ ï¼ï¼ï¼ ï¼ address@hidden, address@hidden) at io/channel.c:114 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 0x00007f658e3ea990 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ buf=0x7f65907cb838 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007f658e3e97fc in qemu_fill_buffer (f=0x7f65907cb800) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:295 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 0x00007f658e3ea2e1 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/qemu-file.c:555 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 0x00007f658e3ea34b in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:568 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007f658e3ea552 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ migration/qemu-file.c:648 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 0x00007f658e3e66e5 in colo_receive_message (f=0x7f65907cb800, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at migration/colo.c:244 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #8 0x00007f658e3e681e in colo_receive_check_message (f=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼, address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ at migration/colo.c:264 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #9 0x00007f658e3e740e in colo_process_incoming_thread -ï¼ ï¼ï¼ï¼ ï¼ (opaque=0x7f658eb30360 ï¼mis_current.31286ï¼) at migration/colo.c:577 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #10 0x00007f658be09df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #11 0x00007f65881983ed in clone () from /lib64/libc.so.6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7f658ff7d5c0 "migration-socket-incoming" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features Do not support QIO_CHANNEL_FEATURE_SHUTDOWN -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $3 = 0 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #0 socket_accept_incoming_migration (ioc=0x7fdcceeafa90, -ï¼ ï¼ï¼ï¼ ï¼ condition=G_IO_IN, opaque=0x7fdcceeafa90) at migration/socket.c:137 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #1 0x00007fdcc6966350 in g_main_dispatch (context=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ gmain.c:3054 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #2 g_main_context_dispatch (context=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ address@hidden) at gmain.c:3630 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #3 0x00007fdccb8a6dcc in glib_pollfds_poll () at util/main-loop.c:213 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #4 os_host_main_loop_wait (timeout=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:258 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #5 main_loop_wait (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ util/main-loop.c:506 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #6 0x00007fdccb526187 in main_loop () at vl.c:1898 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ #7 main (argc=ï¼optimized outï¼, argv=ï¼optimized outï¼, envp=ï¼optimized -ï¼ ï¼ï¼ï¼ ï¼ outï¼) at vl.c:4709 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼features -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $1 = 6 -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ (gdb) p ioc-ï¼name -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ $2 = 0x7fdcce1b1ab0 "migration-socket-listener" -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ May be socket_accept_incoming_migration should -ï¼ ï¼ï¼ï¼ ï¼ call qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN)?? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ thank you. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ åå§é®ä»¶ -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden -ï¼ ï¼ï¼ï¼ ï¼ address@hidden@huawei.comï¼ -ï¼ ï¼ï¼ï¼ ï¼ *æ¥ æ ï¼*2017å¹´03æ16æ¥ 14:46 -ï¼ ï¼ï¼ï¼ ï¼ *主 é¢ ï¼**Re: [Qemu-devel] COLO failover hang* -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ On 03/15/2017 05:06 PM, wangguang wrote: -ï¼ ï¼ï¼ï¼ ï¼ ï¼ am testing QEMU COLO feature described here [QEMU -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Wiki]( -http://wiki.qemu-project.org/Features/COLO -). -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ When the Primary Node panic,the Secondary Node qemu hang. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ hang at recvmsg in qio_channel_socket_readv. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ And I run { 'execute': 'nbd-server-stop' } and { "execute": -ï¼ ï¼ï¼ï¼ ï¼ ï¼ "x-colo-lost-heartbeat" } in Secondary VM's -ï¼ ï¼ï¼ï¼ ï¼ ï¼ monitor,the Secondary Node qemu still hang at recvmsg . -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ I found that the colo in qemu is not complete yet. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Do the colo have any plan for development? -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Yes, We are developing. You can see some of patch we pushing. -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Has anyone ever run it successfully? Any help is appreciated! -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ In our internal version can run it successfully, -ï¼ ï¼ï¼ï¼ ï¼ The failover detail you can ask Zhanghailiang for help. -ï¼ ï¼ï¼ï¼ ï¼ Next time if you have some question about COLO, -ï¼ ï¼ï¼ï¼ ï¼ please cc me and zhanghailiang address@hidden -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ centos7.2+qemu2.7.50 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ (gdb) bt -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #0 0x00007f3e00cc86ad in recvmsg () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #1 0x00007f3e0332b738 in qio_channel_socket_readv (ioc=ï¼optimized -outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ iov=ï¼optimized outï¼, niov=ï¼optimized outï¼, fds=0x0, nfds=0x0, -errp=0x0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ io/channel-socket.c:497 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #2 0x00007f3e03329472 in qio_channel_read (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden "", address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at io/channel.c:97 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #3 0x00007f3e032750e0 in channel_get_buffer (opaque=ï¼optimized outï¼, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ buf=0x7f3e05910f38 "", pos=ï¼optimized outï¼, size=32768) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file-channel.c:78 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #4 0x00007f3e0327412c in qemu_fill_buffer (f=0x7f3e05910f00) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:257 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #5 0x00007f3e03274a41 in qemu_peek_byte (address@hidden, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/qemu-file.c:510 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #6 0x00007f3e03274aab in qemu_get_byte (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:523 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #7 0x00007f3e03274cb2 in qemu_get_be32 (address@hidden) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/qemu-file.c:603 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #8 0x00007f3e03271735 in colo_receive_message (f=0x7f3e05910f00, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ address@hidden) at migration/colo.c:215 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #9 0x00007f3e0327250d in colo_wait_handle_message -(errp=0x7f3d62bfaa48, -ï¼ ï¼ï¼ï¼ ï¼ ï¼ checkpoint_request=ï¼synthetic pointerï¼, f=ï¼optimized outï¼) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:546 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #10 colo_process_incoming_thread (opaque=0x7f3e067245e0) at -ï¼ ï¼ï¼ï¼ ï¼ ï¼ migration/colo.c:649 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #11 0x00007f3e00cc1df3 in start_thread () from /lib64/libpthread.so.0 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ #12 0x00007f3dfc9c03ed in clone () from /lib64/libc..so.6 -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ ï¼ View this message in context: -http://qemu.11.n7.nabble.com/COLO-failover-hang-tp473250.html -ï¼ ï¼ï¼ï¼ ï¼ ï¼ Sent from the Developer mailing list archive at Nabble.com. -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -- -ï¼ ï¼ï¼ï¼ ï¼ Thanks -ï¼ ï¼ï¼ï¼ ï¼ Zhang Chen -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ ï¼ -ï¼ ï¼ï¼ï¼ -ï¼ ï¼ï¼ -ï¼ ï¼ -- -ï¼ ï¼ Dr. David Alan Gilbert / address@hidden / Manchester, UK -ï¼ ï¼ -ï¼ ï¼ . -ï¼ ï¼ -ï¼ - diff --git a/results/classifier/014/risc-v/74545755 b/results/classifier/014/risc-v/74545755 deleted file mode 100644 index d2e98605..00000000 --- a/results/classifier/014/risc-v/74545755 +++ /dev/null @@ -1,371 +0,0 @@ -risc-v: 0.845 -user-level: 0.790 -operating system: 0.784 -register: 0.778 -permissions: 0.770 -mistranslation: 0.752 -debug: 0.740 -TCG: 0.722 -performance: 0.721 -device: 0.720 -semantic: 0.669 -virtual: 0.667 -arm: 0.662 -KVM: 0.661 -graphic: 0.660 -ppc: 0.659 -vnc: 0.650 -assembly: 0.648 -architecture: 0.636 -boot: 0.607 -VMM: 0.602 -files: 0.577 -peripherals: 0.566 -hypervisor: 0.563 -network: 0.550 -socket: 0.549 -x86: 0.545 -alpha: 0.508 -PID: 0.479 -kernel: 0.452 -i386: 0.376 - -[Bug Report][RFC PATCH 0/1] block: fix failing assert on paused VM migration - -There's a bug (failing assert) which is reproduced during migration of -a paused VM. I am able to reproduce it on a stand with 2 nodes and a common -NFS share, with VM's disk on that share. - -root@fedora40-1-vm:~# virsh domblklist alma8-vm - Target Source ------------------------------------------- - sda /mnt/shared/images/alma8.qcow2 - -root@fedora40-1-vm:~# df -Th /mnt/shared -Filesystem Type Size Used Avail Use% Mounted on -127.0.0.1:/srv/nfsd nfs4 63G 16G 48G 25% /mnt/shared - -On the 1st node: - -root@fedora40-1-vm:~# virsh start alma8-vm ; virsh suspend alma8-vm -root@fedora40-1-vm:~# virsh migrate --compressed --p2p --persistent ---undefinesource --live alma8-vm qemu+ssh://fedora40-2-vm/system - -Then on the 2nd node: - -root@fedora40-2-vm:~# virsh migrate --compressed --p2p --persistent ---undefinesource --live alma8-vm qemu+ssh://fedora40-1-vm/system -error: operation failed: domain is not running - -root@fedora40-2-vm:~# tail -3 /var/log/libvirt/qemu/alma8-vm.log -2024-09-19 13:53:33.336+0000: initiating migration -qemu-system-x86_64: ../block.c:6976: int -bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & -BDRV_O_INACTIVE)' failed. -2024-09-19 13:53:42.991+0000: shutting down, reason=crashed - -Backtrace: - -(gdb) bt -#0 0x00007f7eaa2f1664 in __pthread_kill_implementation () at /lib64/libc.so.6 -#1 0x00007f7eaa298c4e in raise () at /lib64/libc.so.6 -#2 0x00007f7eaa280902 in abort () at /lib64/libc.so.6 -#3 0x00007f7eaa28081e in __assert_fail_base.cold () at /lib64/libc.so.6 -#4 0x00007f7eaa290d87 in __assert_fail () at /lib64/libc.so.6 -#5 0x0000563c38b95eb8 in bdrv_inactivate_recurse (bs=0x563c3b6c60c0) at -../block.c:6976 -#6 0x0000563c38b95aeb in bdrv_inactivate_all () at ../block.c:7038 -#7 0x0000563c3884d354 in qemu_savevm_state_complete_precopy_non_iterable -(f=0x563c3b700c20, in_postcopy=false, inactivate_disks=true) - at ../migration/savevm.c:1571 -#8 0x0000563c3884dc1a in qemu_savevm_state_complete_precopy (f=0x563c3b700c20, -iterable_only=false, inactivate_disks=true) at ../migration/savevm.c:1631 -#9 0x0000563c3883a340 in migration_completion_precopy (s=0x563c3b4d51f0, -current_active_state=<optimized out>) at ../migration/migration.c:2780 -#10 migration_completion (s=0x563c3b4d51f0) at ../migration/migration.c:2844 -#11 migration_iteration_run (s=0x563c3b4d51f0) at ../migration/migration.c:3270 -#12 migration_thread (opaque=0x563c3b4d51f0) at ../migration/migration.c:3536 -#13 0x0000563c38dbcf14 in qemu_thread_start (args=0x563c3c2d5bf0) at -../util/qemu-thread-posix.c:541 -#14 0x00007f7eaa2ef6d7 in start_thread () at /lib64/libc.so.6 -#15 0x00007f7eaa373414 in clone () at /lib64/libc.so.6 - -What happens here is that after 1st migration BDS related to HDD remains -inactive as VM is still paused. Then when we initiate 2nd migration, -bdrv_inactivate_all() leads to the attempt to set BDRV_O_INACTIVE flag -on that node which is already set, thus assert fails. - -Attached patch which simply skips setting flag if it's already set is more -of a kludge than a clean solution. Should we use more sophisticated logic -which allows some of the nodes be in inactive state prior to the migration, -and takes them into account during bdrv_inactivate_all()? Comments would -be appreciated. - -Andrey - -Andrey Drobyshev (1): - block: do not fail when inactivating node which is inactive - - block.c | 10 +++++++++- - 1 file changed, 9 insertions(+), 1 deletion(-) - --- -2.39.3 - -Instead of throwing an assert let's just ignore that flag is already set -and return. We assume that it's going to be safe to ignore. Otherwise -this assert fails when migrating a paused VM back and forth. - -Ideally we'd like to have a more sophisticated solution, e.g. not even -scan the nodes which should be inactive at this point. - -Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> ---- - block.c | 10 +++++++++- - 1 file changed, 9 insertions(+), 1 deletion(-) - -diff --git a/block.c b/block.c -index 7d90007cae..c1dcf906d1 100644 ---- a/block.c -+++ b/block.c -@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK -bdrv_inactivate_recurse(BlockDriverState *bs) - return 0; - } - -- assert(!(bs->open_flags & BDRV_O_INACTIVE)); -+ if (bs->open_flags & BDRV_O_INACTIVE) { -+ /* -+ * Return here instead of throwing assert as a workaround to -+ * prevent failure on migrating paused VM. -+ * Here we assume that if we're trying to inactivate BDS that's -+ * already inactive, it's safe to just ignore it. -+ */ -+ return 0; -+ } - - /* Inactivate this node */ - if (bs->drv->bdrv_inactivate) { --- -2.39.3 - -[add migration maintainers] - -On 24.09.24 15:56, Andrey Drobyshev wrote: -Instead of throwing an assert let's just ignore that flag is already set -and return. We assume that it's going to be safe to ignore. Otherwise -this assert fails when migrating a paused VM back and forth. - -Ideally we'd like to have a more sophisticated solution, e.g. not even -scan the nodes which should be inactive at this point. - -Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com> ---- - block.c | 10 +++++++++- - 1 file changed, 9 insertions(+), 1 deletion(-) - -diff --git a/block.c b/block.c -index 7d90007cae..c1dcf906d1 100644 ---- a/block.c -+++ b/block.c -@@ -6973,7 +6973,15 @@ static int GRAPH_RDLOCK -bdrv_inactivate_recurse(BlockDriverState *bs) - return 0; - } -- assert(!(bs->open_flags & BDRV_O_INACTIVE)); -+ if (bs->open_flags & BDRV_O_INACTIVE) { -+ /* -+ * Return here instead of throwing assert as a workaround to -+ * prevent failure on migrating paused VM. -+ * Here we assume that if we're trying to inactivate BDS that's -+ * already inactive, it's safe to just ignore it. -+ */ -+ return 0; -+ } -/* Inactivate this node */ -if (bs->drv->bdrv_inactivate) { -I doubt that this a correct way to go. - -As far as I understand, "inactive" actually means that "storage is not belong to -qemu, but to someone else (another qemu process for example), and may be changed -transparently". In turn this means that Qemu should do nothing with inactive disks. So the -problem is that nobody called bdrv_activate_all on target, and we shouldn't ignore that. - -Hmm, I see in process_incoming_migration_bh() we do call bdrv_activate_all(), -but only in some scenarios. May be, the condition should be less strict here. - -Why we need any condition here at all? Don't we want to activate block-layer on -target after migration anyway? - --- -Best regards, -Vladimir - -On 9/30/24 12:25 PM, Vladimir Sementsov-Ogievskiy wrote: -> -[add migration maintainers] -> -> -On 24.09.24 15:56, Andrey Drobyshev wrote: -> -> [...] -> -> -I doubt that this a correct way to go. -> -> -As far as I understand, "inactive" actually means that "storage is not -> -belong to qemu, but to someone else (another qemu process for example), -> -and may be changed transparently". In turn this means that Qemu should -> -do nothing with inactive disks. So the problem is that nobody called -> -bdrv_activate_all on target, and we shouldn't ignore that. -> -> -Hmm, I see in process_incoming_migration_bh() we do call -> -bdrv_activate_all(), but only in some scenarios. May be, the condition -> -should be less strict here. -> -> -Why we need any condition here at all? Don't we want to activate -> -block-layer on target after migration anyway? -> -Hmm I'm not sure about the unconditional activation, since we at least -have to honor LATE_BLOCK_ACTIVATE cap if it's set (and probably delay it -in such a case). In current libvirt upstream I see such code: - -> -/* Migration capabilities which should always be enabled as long as they -> -> -* are supported by QEMU. If the capability is supposed to be enabled on both -> -> -* sides of migration, it won't be enabled unless both sides support it. -> -> -*/ -> -> -static const qemuMigrationParamsAlwaysOnItem qemuMigrationParamsAlwaysOn[] = -> -{ -> -> -{QEMU_MIGRATION_CAP_PAUSE_BEFORE_SWITCHOVER, -> -> -QEMU_MIGRATION_SOURCE}, -> -> -> -> -{QEMU_MIGRATION_CAP_LATE_BLOCK_ACTIVATE, -> -> -QEMU_MIGRATION_DESTINATION}, -> -> -}; -which means that libvirt always wants LATE_BLOCK_ACTIVATE to be set. - -The code from process_incoming_migration_bh() you're referring to: - -> -/* If capability late_block_activate is set: -> -> -* Only fire up the block code now if we're going to restart the -> -> -* VM, else 'cont' will do it. -> -> -* This causes file locking to happen; so we don't want it to happen -> -> -* unless we really are starting the VM. -> -> -*/ -> -> -if (!migrate_late_block_activate() || -> -> -(autostart && (!global_state_received() || -> -> -runstate_is_live(global_state_get_runstate())))) { -> -> -/* Make sure all file formats throw away their mutable metadata. -> -> -> -* If we get an error here, just don't restart the VM yet. */ -> -> -bdrv_activate_all(&local_err); -> -> -if (local_err) { -> -> -error_report_err(local_err); -> -> -local_err = NULL; -> -> -autostart = false; -> -> -} -> -> -} -It states explicitly that we're either going to start VM right at this -point if (autostart == true), or we wait till "cont" command happens. -None of this is going to happen if we start another migration while -still being in PAUSED state. So I think it seems reasonable to take -such case into account. For instance, this patch does prevent the crash: - -> -diff --git a/migration/migration.c b/migration/migration.c -> -index ae2be31557..3222f6745b 100644 -> ---- a/migration/migration.c -> -+++ b/migration/migration.c -> -@@ -733,7 +733,8 @@ static void process_incoming_migration_bh(void *opaque) -> -*/ -> -if (!migrate_late_block_activate() || -> -(autostart && (!global_state_received() || -> -- runstate_is_live(global_state_get_runstate())))) { -> -+ runstate_is_live(global_state_get_runstate()))) || -> -+ (!autostart && global_state_get_runstate() == RUN_STATE_PAUSED)) { -> -/* Make sure all file formats throw away their mutable metadata. -> -* If we get an error here, just don't restart the VM yet. */ -> -bdrv_activate_all(&local_err); -What are your thoughts on it? - -Andrey - |
