diff options
Diffstat (limited to 'classification_output/02/other/25892827')
| -rw-r--r-- | classification_output/02/other/25892827 | 1078 |
1 files changed, 0 insertions, 1078 deletions
diff --git a/classification_output/02/other/25892827 b/classification_output/02/other/25892827 deleted file mode 100644 index b33268d01..000000000 --- a/classification_output/02/other/25892827 +++ /dev/null @@ -1,1078 +0,0 @@ -other: 0.892 -instruction: 0.842 -mistranslation: 0.842 -boot: 0.839 -semantic: 0.825 - -[Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot - -Hi, - -Recently we encountered a problem in our project: 2 CPUs in VM are not brought -up normally after reboot. - -Our host is using KVM kmod 3.6 and QEMU 2.1. -A SLES 11 sp3 VM configured with 8 vcpus, -cpu model is configured with 'host-passthrough'. - -After VM's first time started up, everything seems to be OK. -and then VM is paniced and rebooted. -After reboot, only 6 cpus are brought up in VM, cpu1 and cpu7 are not online. - -This is the only message we can get from VM: -VM dmesg shows: -[ 0.069867] Booting Node 0, Processors #1 -[ 5.060042] CPU1: Stuck ?? -[ 5.060499] #2 -[ 5.088322] kvm-clock: cpu 2, msr 6:3fc90901, secondary cpu clock -[ 5.088335] KVM setup async PF for cpu 2 -[ 5.092967] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.094405] #3 -[ 5.108324] kvm-clock: cpu 3, msr 6:3fcd0901, secondary cpu clock -[ 5.108333] KVM setup async PF for cpu 3 -[ 5.113553] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.114970] #4 -[ 5.128325] kvm-clock: cpu 4, msr 6:3fd10901, secondary cpu clock -[ 5.128336] KVM setup async PF for cpu 4 -[ 5.134576] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.135998] #5 -[ 5.152324] kvm-clock: cpu 5, msr 6:3fd50901, secondary cpu clock -[ 5.152334] KVM setup async PF for cpu 5 -[ 5.154764] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.156467] #6 -[ 5.172327] kvm-clock: cpu 6, msr 6:3fd90901, secondary cpu clock -[ 5.172341] KVM setup async PF for cpu 6 -[ 5.180738] NMI watchdog enabled, takes one hw-pmu counter. -[ 5.182173] #7 Ok. -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs -[ 10.172394] Total of 6 processors activated (28799.97 BogoMIPS). - -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not consuming -any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' warning, -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? - -Has anyone come across these problem before? Or any idea? - -Thanks, -zhanghailiang - -On 06/07/2015 09:54, zhanghailiang wrote: -> -> -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -> -consuming any cpu (Should be in idle state), -> -All of VCPUs' stacks in host is like bellow: -> -> -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -> -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -> -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -> -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -> -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -> -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -> -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -> -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -> -[<ffffffffffffffff>] 0xffffffffffffffff -> -> -We looked into the kernel codes that could leading to the above 'Stuck' -> -warning, -> -and found that the only possible is the emulation of 'cpuid' instruct in -> -kvm/qemu has something wrong. -> -But since we canât reproduce this problem, we are not quite sure. -> -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? - -Paolo - -On 2015/7/6 16:45, Paolo Bonzini wrote: -On 06/07/2015 09:54, zhanghailiang wrote: -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -consuming any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' -warning, -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -located in -do_boot_cpu(). It's in BSP context, the call process is: -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() --> wakeup_secondary_via_INIT() to trigger APs. -It will wait 5s for APs to startup, if some AP not startup normally, it will -print 'CPU%d Stuck' or 'CPU%d: Not responding'. - -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -begins to execute the code -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before -smp_callin()(smpboot.c). -The follow is the starup process of BSP and AP. -BSP: -start_kernel() - ->smp_init() - ->smp_boot_cpus() - ->do_boot_cpu() - ->start_ip = trampoline_address(); //set the address that AP will go -to execute - ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU - ->for (timeout = 0; timeout < 50000; timeout++) - if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP -startup or not - -APs: -ENTRY(trampoline_data) (trampoline_64.S) - ->ENTRY(secondary_startup_64) (head_64.S) - ->start_secondary() (smpboot.c) - ->cpu_init(); - ->smp_callin(); - ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes -here, the BSP will not prints the error message. - -From above call process, we can be sure that, the AP has been stuck between -trampoline_data and the cpumask_set_cpu() in -smp_callin(), we look through these codes path carefully, and only found a -'hlt' instruct that could block the process. -It is located in trampoline_data(): - -ENTRY(trampoline_data) - ... - - call verify_cpu # Verify the cpu supports long mode - testl %eax, %eax # Check for return code - jnz no_longmode - - ... - -no_longmode: - hlt - jmp no_longmode - -For the process verify_cpu(), -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -No-root mode. -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -the fail in verify_cpu. - -From the message in VM, we know vcpu1 and vcpu7 is something wrong. -[ 5.060042] CPU1: Stuck ?? -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs - -Besides, the follow is the cpus message got from host. -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command -instance-0000000 -* CPU #0: pc=0x00007f64160c683d thread_id=68570 - CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 - CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 - CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 - CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 - CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 - CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 - CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 - -Oh, i also forgot to mention in the above message that, we have bond each vCPU -to different physical CPU in -host. - -Thanks, -zhanghailiang - -On 06/07/2015 11:59, zhanghailiang wrote: -> -> -> -Besides, the follow is the cpus message got from host. -> -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh -> -qemu-monitor-command instance-0000000 -> -* CPU #0: pc=0x00007f64160c683d thread_id=68570 -> -CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 -> -CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 -> -CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 -> -CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 -> -CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 -> -CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 -> -CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 -> -> -Oh, i also forgot to mention in the above message that, we have bond -> -each vCPU to different physical CPU in -> -host. -Can you capture a trace on the host (trace-cmd record -e kvm) and send -it privately? Please note which CPUs get stuck, since I guess it's not -always 1 and 7. - -Paolo - -On Mon, 6 Jul 2015 17:59:10 +0800 -zhanghailiang <address@hidden> wrote: - -> -On 2015/7/6 16:45, Paolo Bonzini wrote: -> -> -> -> -> -> On 06/07/2015 09:54, zhanghailiang wrote: -> ->> -> ->> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -> ->> consuming any cpu (Should be in idle state), -> ->> All of VCPUs' stacks in host is like bellow: -> ->> -> ->> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -> ->> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -> ->> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -> ->> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -> ->> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -> ->> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -> ->> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b -> ->> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -> ->> [<ffffffffffffffff>] 0xffffffffffffffff -> ->> -> ->> We looked into the kernel codes that could leading to the above 'Stuck' -> ->> warning, -in current upstream there isn't any printk(...Stuck...) left since that code -path -has been reworked. -I've often seen this on over-committed host during guest CPUs up/down torture -test. -Could you update guest kernel to upstream and see if issue reproduces? - -> ->> and found that the only possible is the emulation of 'cpuid' instruct in -> ->> kvm/qemu has something wrong. -> ->> But since we canât reproduce this problem, we are not quite sure. -> ->> Is there any possible that the cupid emulation in kvm/qemu has some bug ? -> -> -> -> Can you explain the relationship to the cpuid emulation? What do the -> -> traces say about vcpus 1 and 7? -> -> -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -> -located in -> -do_boot_cpu(). It's in BSP context, the call process is: -> -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> --> wakeup_secondary_via_INIT() to trigger APs. -> -It will wait 5s for APs to startup, if some AP not startup normally, it will -> -print 'CPU%d Stuck' or 'CPU%d: Not responding'. -> -> -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -> -begins to execute the code -> -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places -> -before smp_callin()(smpboot.c). -> -The follow is the starup process of BSP and AP. -> -BSP: -> -start_kernel() -> -->smp_init() -> -->smp_boot_cpus() -> -->do_boot_cpu() -> -->start_ip = trampoline_address(); //set the address that AP will -> -go to execute -> -->wakeup_secondary_cpu_via_init(); // kick the secondary CPU -> -->for (timeout = 0; timeout < 50000; timeout++) -> -if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if -> -AP startup or not -> -> -APs: -> -ENTRY(trampoline_data) (trampoline_64.S) -> -->ENTRY(secondary_startup_64) (head_64.S) -> -->start_secondary() (smpboot.c) -> -->cpu_init(); -> -->smp_callin(); -> -->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -> -comes here, the BSP will not prints the error message. -> -> -From above call process, we can be sure that, the AP has been stuck between -> -trampoline_data and the cpumask_set_cpu() in -> -smp_callin(), we look through these codes path carefully, and only found a -> -'hlt' instruct that could block the process. -> -It is located in trampoline_data(): -> -> -ENTRY(trampoline_data) -> -... -> -> -call verify_cpu # Verify the cpu supports long mode -> -testl %eax, %eax # Check for return code -> -jnz no_longmode -> -> -... -> -> -no_longmode: -> -hlt -> -jmp no_longmode -> -> -For the process verify_cpu(), -> -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -> -No-root mode. -> -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -> -the fail in verify_cpu. -> -> -From the message in VM, we know vcpu1 and vcpu7 is something wrong. -> -[ 5.060042] CPU1: Stuck ?? -> -[ 10.170815] CPU7: Stuck ?? -> -[ 10.171648] Brought up 6 CPUs -> -> -Besides, the follow is the cpus message got from host. -> -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh -> -qemu-monitor-command instance-0000000 -> -* CPU #0: pc=0x00007f64160c683d thread_id=68570 -> -CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 -> -CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 -> -CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 -> -CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 -> -CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 -> -CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 -> -CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 -> -> -Oh, i also forgot to mention in the above message that, we have bond each -> -vCPU to different physical CPU in -> -host. -> -> -Thanks, -> -zhanghailiang -> -> -> -> -> --- -> -To unsubscribe from this list: send the line "unsubscribe kvm" in -> -the body of a message to address@hidden -> -More majordomo info at -http://vger.kernel.org/majordomo-info.html - -On 2015/7/7 19:23, Igor Mammedov wrote: -On Mon, 6 Jul 2015 17:59:10 +0800 -zhanghailiang <address@hidden> wrote: -On 2015/7/6 16:45, Paolo Bonzini wrote: -On 06/07/2015 09:54, zhanghailiang wrote: -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -consuming any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' -warning, -in current upstream there isn't any printk(...Stuck...) left since that code -path -has been reworked. -I've often seen this on over-committed host during guest CPUs up/down torture -test. -Could you update guest kernel to upstream and see if issue reproduces? -Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to -reproduce it. - -For your test case, is it a kernel bug? -Or is there any related patch could solve your test problem been merged into -upstream ? - -Thanks, -zhanghailiang -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -located in -do_boot_cpu(). It's in BSP context, the call process is: -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() --> wakeup_secondary_via_INIT() to trigger APs. -It will wait 5s for APs to startup, if some AP not startup normally, it will -print 'CPU%d Stuck' or 'CPU%d: Not responding'. - -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -begins to execute the code -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before -smp_callin()(smpboot.c). -The follow is the starup process of BSP and AP. -BSP: -start_kernel() - ->smp_init() - ->smp_boot_cpus() - ->do_boot_cpu() - ->start_ip = trampoline_address(); //set the address that AP will -go to execute - ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU - ->for (timeout = 0; timeout < 50000; timeout++) - if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if -AP startup or not - -APs: -ENTRY(trampoline_data) (trampoline_64.S) - ->ENTRY(secondary_startup_64) (head_64.S) - ->start_secondary() (smpboot.c) - ->cpu_init(); - ->smp_callin(); - ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -comes here, the BSP will not prints the error message. - - From above call process, we can be sure that, the AP has been stuck between -trampoline_data and the cpumask_set_cpu() in -smp_callin(), we look through these codes path carefully, and only found a -'hlt' instruct that could block the process. -It is located in trampoline_data(): - -ENTRY(trampoline_data) - ... - - call verify_cpu # Verify the cpu supports long mode - testl %eax, %eax # Check for return code - jnz no_longmode - - ... - -no_longmode: - hlt - jmp no_longmode - -For the process verify_cpu(), -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -No-root mode. -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -the fail in verify_cpu. - - From the message in VM, we know vcpu1 and vcpu7 is something wrong. -[ 5.060042] CPU1: Stuck ?? -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs - -Besides, the follow is the cpus message got from host. -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command -instance-0000000 -* CPU #0: pc=0x00007f64160c683d thread_id=68570 - CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 - CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 - CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 - CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 - CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 - CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 - CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 - -Oh, i also forgot to mention in the above message that, we have bond each vCPU -to different physical CPU in -host. - -Thanks, -zhanghailiang - - - - --- -To unsubscribe from this list: send the line "unsubscribe kvm" in -the body of a message to address@hidden -More majordomo info at -http://vger.kernel.org/majordomo-info.html -. - -On Tue, 7 Jul 2015 19:43:35 +0800 -zhanghailiang <address@hidden> wrote: - -> -On 2015/7/7 19:23, Igor Mammedov wrote: -> -> On Mon, 6 Jul 2015 17:59:10 +0800 -> -> zhanghailiang <address@hidden> wrote: -> -> -> ->> On 2015/7/6 16:45, Paolo Bonzini wrote: -> ->>> -> ->>> -> ->>> On 06/07/2015 09:54, zhanghailiang wrote: -> ->>>> -> ->>>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -> ->>>> consuming any cpu (Should be in idle state), -> ->>>> All of VCPUs' stacks in host is like bellow: -> ->>>> -> ->>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -> ->>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -> ->>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -> ->>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -> ->>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -> ->>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -> ->>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b -> ->>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -> ->>>> [<ffffffffffffffff>] 0xffffffffffffffff -> ->>>> -> ->>>> We looked into the kernel codes that could leading to the above 'Stuck' -> ->>>> warning, -> -> in current upstream there isn't any printk(...Stuck...) left since that -> -> code path -> -> has been reworked. -> -> I've often seen this on over-committed host during guest CPUs up/down -> -> torture test. -> -> Could you update guest kernel to upstream and see if issue reproduces? -> -> -> -> -Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to -> -reproduce it. -> -> -For your test case, is it a kernel bug? -> -Or is there any related patch could solve your test problem been merged into -> -upstream ? -I don't remember all prerequisite patches but you should be able to find -http://marc.info/?l=linux-kernel&m=140326703108009&w=2 -"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" -and then look for dependencies. - - -> -> -Thanks, -> -zhanghailiang -> -> ->>>> and found that the only possible is the emulation of 'cpuid' instruct in -> ->>>> kvm/qemu has something wrong. -> ->>>> But since we canât reproduce this problem, we are not quite sure. -> ->>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ? -> ->>> -> ->>> Can you explain the relationship to the cpuid emulation? What do the -> ->>> traces say about vcpus 1 and 7? -> ->> -> ->> OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -> ->> located in -> ->> do_boot_cpu(). It's in BSP context, the call process is: -> ->> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> -> ->> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. -> ->> It will wait 5s for APs to startup, if some AP not startup normally, it -> ->> will print 'CPU%d Stuck' or 'CPU%d: Not responding'. -> ->> -> ->> If it prints 'Stuck', it means the AP has received the SIPI interrupt and -> ->> begins to execute the code -> ->> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places -> ->> before smp_callin()(smpboot.c). -> ->> The follow is the starup process of BSP and AP. -> ->> BSP: -> ->> start_kernel() -> ->> ->smp_init() -> ->> ->smp_boot_cpus() -> ->> ->do_boot_cpu() -> ->> ->start_ip = trampoline_address(); //set the address that AP -> ->> will go to execute -> ->> ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU -> ->> ->for (timeout = 0; timeout < 50000; timeout++) -> ->> if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// -> ->> check if AP startup or not -> ->> -> ->> APs: -> ->> ENTRY(trampoline_data) (trampoline_64.S) -> ->> ->ENTRY(secondary_startup_64) (head_64.S) -> ->> ->start_secondary() (smpboot.c) -> ->> ->cpu_init(); -> ->> ->smp_callin(); -> ->> ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -> ->> comes here, the BSP will not prints the error message. -> ->> -> ->> From above call process, we can be sure that, the AP has been stuck -> ->> between trampoline_data and the cpumask_set_cpu() in -> ->> smp_callin(), we look through these codes path carefully, and only found a -> ->> 'hlt' instruct that could block the process. -> ->> It is located in trampoline_data(): -> ->> -> ->> ENTRY(trampoline_data) -> ->> ... -> ->> -> ->> call verify_cpu # Verify the cpu supports long mode -> ->> testl %eax, %eax # Check for return code -> ->> jnz no_longmode -> ->> -> ->> ... -> ->> -> ->> no_longmode: -> ->> hlt -> ->> jmp no_longmode -> ->> -> ->> For the process verify_cpu(), -> ->> we can only find the 'cpuid' sensitive instruct that could lead VM exit -> ->> from No-root mode. -> ->> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading -> ->> to the fail in verify_cpu. -> ->> -> ->> From the message in VM, we know vcpu1 and vcpu7 is something wrong. -> ->> [ 5.060042] CPU1: Stuck ?? -> ->> [ 10.170815] CPU7: Stuck ?? -> ->> [ 10.171648] Brought up 6 CPUs -> ->> -> ->> Besides, the follow is the cpus message got from host. -> ->> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh -> ->> qemu-monitor-command instance-0000000 -> ->> * CPU #0: pc=0x00007f64160c683d thread_id=68570 -> ->> CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 -> ->> CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 -> ->> CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 -> ->> CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 -> ->> CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 -> ->> CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 -> ->> CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 -> ->> -> ->> Oh, i also forgot to mention in the above message that, we have bond each -> ->> vCPU to different physical CPU in -> ->> host. -> ->> -> ->> Thanks, -> ->> zhanghailiang -> ->> -> ->> -> ->> -> ->> -> ->> -- -> ->> To unsubscribe from this list: send the line "unsubscribe kvm" in -> ->> the body of a message to address@hidden -> ->> More majordomo info at -http://vger.kernel.org/majordomo-info.html -> -> -> -> -> -> . -> -> -> -> -> - -On 2015/7/7 20:21, Igor Mammedov wrote: -On Tue, 7 Jul 2015 19:43:35 +0800 -zhanghailiang <address@hidden> wrote: -On 2015/7/7 19:23, Igor Mammedov wrote: -On Mon, 6 Jul 2015 17:59:10 +0800 -zhanghailiang <address@hidden> wrote: -On 2015/7/6 16:45, Paolo Bonzini wrote: -On 06/07/2015 09:54, zhanghailiang wrote: -From host, we found that QEMU vcpu1 thread and vcpu7 thread were not -consuming any cpu (Should be in idle state), -All of VCPUs' stacks in host is like bellow: - -[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] -[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] -[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] -[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] -[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 -[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 -[<ffffffff81468092>] system_call_fastpath+0x16/0x1b -[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 -[<ffffffffffffffff>] 0xffffffffffffffff - -We looked into the kernel codes that could leading to the above 'Stuck' -warning, -in current upstream there isn't any printk(...Stuck...) left since that code -path -has been reworked. -I've often seen this on over-committed host during guest CPUs up/down torture -test. -Could you update guest kernel to upstream and see if issue reproduces? -Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to -reproduce it. - -For your test case, is it a kernel bug? -Or is there any related patch could solve your test problem been merged into -upstream ? -I don't remember all prerequisite patches but you should be able to find -http://marc.info/?l=linux-kernel&m=140326703108009&w=2 -"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" -and then look for dependencies. -Er, we have investigated this patch, and it is not related to our problem, :) - -Thanks. -Thanks, -zhanghailiang -and found that the only possible is the emulation of 'cpuid' instruct in -kvm/qemu has something wrong. -But since we canât reproduce this problem, we are not quite sure. -Is there any possible that the cupid emulation in kvm/qemu has some bug ? -Can you explain the relationship to the cpuid emulation? What do the -traces say about vcpus 1 and 7? -OK, we searched the VM's kernel codes with the 'Stuck' message, and it is -located in -do_boot_cpu(). It's in BSP context, the call process is: -BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() --> wakeup_secondary_via_INIT() to trigger APs. -It will wait 5s for APs to startup, if some AP not startup normally, it will -print 'CPU%d Stuck' or 'CPU%d: Not responding'. - -If it prints 'Stuck', it means the AP has received the SIPI interrupt and -begins to execute the code -'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before -smp_callin()(smpboot.c). -The follow is the starup process of BSP and AP. -BSP: -start_kernel() - ->smp_init() - ->smp_boot_cpus() - ->do_boot_cpu() - ->start_ip = trampoline_address(); //set the address that AP will -go to execute - ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU - ->for (timeout = 0; timeout < 50000; timeout++) - if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if -AP startup or not - -APs: -ENTRY(trampoline_data) (trampoline_64.S) - ->ENTRY(secondary_startup_64) (head_64.S) - ->start_secondary() (smpboot.c) - ->cpu_init(); - ->smp_callin(); - ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP -comes here, the BSP will not prints the error message. - - From above call process, we can be sure that, the AP has been stuck between -trampoline_data and the cpumask_set_cpu() in -smp_callin(), we look through these codes path carefully, and only found a -'hlt' instruct that could block the process. -It is located in trampoline_data(): - -ENTRY(trampoline_data) - ... - - call verify_cpu # Verify the cpu supports long mode - testl %eax, %eax # Check for return code - jnz no_longmode - - ... - -no_longmode: - hlt - jmp no_longmode - -For the process verify_cpu(), -we can only find the 'cpuid' sensitive instruct that could lead VM exit from -No-root mode. -This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to -the fail in verify_cpu. - - From the message in VM, we know vcpu1 and vcpu7 is something wrong. -[ 5.060042] CPU1: Stuck ?? -[ 10.170815] CPU7: Stuck ?? -[ 10.171648] Brought up 6 CPUs - -Besides, the follow is the cpus message got from host. -80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command -instance-0000000 -* CPU #0: pc=0x00007f64160c683d thread_id=68570 - CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 - CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 - CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 - CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 - CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 - CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 - CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 - -Oh, i also forgot to mention in the above message that, we have bond each vCPU -to different physical CPU in -host. - -Thanks, -zhanghailiang - - - - --- -To unsubscribe from this list: send the line "unsubscribe kvm" in -the body of a message to address@hidden -More majordomo info at -http://vger.kernel.org/majordomo-info.html -. -. - |