diff options
| author | Christian Krinitsin <mail@krinitsin.com> | 2025-06-03 12:04:13 +0000 |
|---|---|---|
| committer | Christian Krinitsin <mail@krinitsin.com> | 2025-06-03 12:04:13 +0000 |
| commit | 256709d2eb3fd80d768a99964be5caa61effa2a0 (patch) | |
| tree | 05b2352fba70923126836a64b6a0de43902e976a /results/classifier/002/other/25892827 | |
| parent | 2ab14fa96a6c5484b5e4ba8337551bb8dcc79cc5 (diff) | |
| download | qemu-analysis-256709d2eb3fd80d768a99964be5caa61effa2a0.tar.gz qemu-analysis-256709d2eb3fd80d768a99964be5caa61effa2a0.zip | |
add new classifier result
Diffstat (limited to 'results/classifier/002/other/25892827')
| -rw-r--r-- | results/classifier/002/other/25892827 | 1078 |
1 files changed, 1078 insertions, 0 deletions
diff --git a/results/classifier/002/other/25892827 b/results/classifier/002/other/25892827 new file mode 100644 index 000000000..b33268d01 --- /dev/null +++ b/results/classifier/002/other/25892827 @@ -0,0 +1,1078 @@ +other: 0.892 +instruction: 0.842 +mistranslation: 0.842 +boot: 0.839 +semantic: 0.825 + +[Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot + +Hi, + +Recently we encountered a problem in our project: 2 CPUs in VM are not brought +up normally after reboot. + +Our host is using KVM kmod 3.6 and QEMU 2.1. +A SLES 11 sp3 VM configured with 8 vcpus, +cpu model is configured with 'host-passthrough'. + +After VM's first time started up, everything seems to be OK. +and then VM is paniced and rebooted. +After reboot, only 6 cpus are brought up in VM, cpu1 and cpu7 are not online. + +This is the only message we can get from VM: +VM dmesg shows: +[ 0.069867] Booting Node 0, Processors #1 +[ 5.060042] CPU1: Stuck ?? +[ 5.060499] #2 +[ 5.088322] kvm-clock: cpu 2, msr 6:3fc90901, secondary cpu clock +[ 5.088335] KVM setup async PF for cpu 2 +[ 5.092967] NMI watchdog enabled, takes one hw-pmu counter. +[ 5.094405] #3 +[ 5.108324] kvm-clock: cpu 3, msr 6:3fcd0901, secondary cpu clock +[ 5.108333] KVM setup async PF for cpu 3 +[ 5.113553] NMI watchdog enabled, takes one hw-pmu counter. +[ 5.114970] #4 +[ 5.128325] kvm-clock: cpu 4, msr 6:3fd10901, secondary cpu clock +[ 5.128336] KVM setup async PF for cpu 4 +[ 5.134576] NMI watchdog enabled, takes one hw-pmu counter. +[ 5.135998] #5 +[ 5.152324] kvm-clock: cpu 5, msr 6:3fd50901, secondary cpu clock +[ 5.152334] KVM setup async PF for cpu 5 +[ 5.154764] NMI watchdog enabled, takes one hw-pmu counter. +[ 5.156467] #6 +[ 5.172327] kvm-clock: cpu 6, msr 6:3fd90901, secondary cpu clock +[ 5.172341] KVM setup async PF for cpu 6 +[ 5.180738] NMI watchdog enabled, takes one hw-pmu counter. +[ 5.182173] #7 Ok. +[ 10.170815] CPU7: Stuck ?? +[ 10.171648] Brought up 6 CPUs +[ 10.172394] Total of 6 processors activated (28799.97 BogoMIPS). + +From host, we found that QEMU vcpu1 thread and vcpu7 thread were not consuming +any cpu (Should be in idle state), +All of VCPUs' stacks in host is like bellow: + +[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] +[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] +[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] +[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] +[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 +[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 +[<ffffffff81468092>] system_call_fastpath+0x16/0x1b +[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 +[<ffffffffffffffff>] 0xffffffffffffffff + +We looked into the kernel codes that could leading to the above 'Stuck' warning, +and found that the only possible is the emulation of 'cpuid' instruct in +kvm/qemu has something wrong. +But since we canât reproduce this problem, we are not quite sure. +Is there any possible that the cupid emulation in kvm/qemu has some bug ? + +Has anyone come across these problem before? Or any idea? + +Thanks, +zhanghailiang + +On 06/07/2015 09:54, zhanghailiang wrote: +> +> +From host, we found that QEMU vcpu1 thread and vcpu7 thread were not +> +consuming any cpu (Should be in idle state), +> +All of VCPUs' stacks in host is like bellow: +> +> +[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] +> +[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] +> +[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] +> +[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] +> +[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 +> +[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 +> +[<ffffffff81468092>] system_call_fastpath+0x16/0x1b +> +[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 +> +[<ffffffffffffffff>] 0xffffffffffffffff +> +> +We looked into the kernel codes that could leading to the above 'Stuck' +> +warning, +> +and found that the only possible is the emulation of 'cpuid' instruct in +> +kvm/qemu has something wrong. +> +But since we canât reproduce this problem, we are not quite sure. +> +Is there any possible that the cupid emulation in kvm/qemu has some bug ? +Can you explain the relationship to the cpuid emulation? What do the +traces say about vcpus 1 and 7? + +Paolo + +On 2015/7/6 16:45, Paolo Bonzini wrote: +On 06/07/2015 09:54, zhanghailiang wrote: +From host, we found that QEMU vcpu1 thread and vcpu7 thread were not +consuming any cpu (Should be in idle state), +All of VCPUs' stacks in host is like bellow: + +[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] +[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] +[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] +[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] +[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 +[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 +[<ffffffff81468092>] system_call_fastpath+0x16/0x1b +[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 +[<ffffffffffffffff>] 0xffffffffffffffff + +We looked into the kernel codes that could leading to the above 'Stuck' +warning, +and found that the only possible is the emulation of 'cpuid' instruct in +kvm/qemu has something wrong. +But since we canât reproduce this problem, we are not quite sure. +Is there any possible that the cupid emulation in kvm/qemu has some bug ? +Can you explain the relationship to the cpuid emulation? What do the +traces say about vcpus 1 and 7? +OK, we searched the VM's kernel codes with the 'Stuck' message, and it is +located in +do_boot_cpu(). It's in BSP context, the call process is: +BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() +-> wakeup_secondary_via_INIT() to trigger APs. +It will wait 5s for APs to startup, if some AP not startup normally, it will +print 'CPU%d Stuck' or 'CPU%d: Not responding'. + +If it prints 'Stuck', it means the AP has received the SIPI interrupt and +begins to execute the code +'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before +smp_callin()(smpboot.c). +The follow is the starup process of BSP and AP. +BSP: +start_kernel() + ->smp_init() + ->smp_boot_cpus() + ->do_boot_cpu() + ->start_ip = trampoline_address(); //set the address that AP will go +to execute + ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU + ->for (timeout = 0; timeout < 50000; timeout++) + if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP +startup or not + +APs: +ENTRY(trampoline_data) (trampoline_64.S) + ->ENTRY(secondary_startup_64) (head_64.S) + ->start_secondary() (smpboot.c) + ->cpu_init(); + ->smp_callin(); + ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes +here, the BSP will not prints the error message. + +From above call process, we can be sure that, the AP has been stuck between +trampoline_data and the cpumask_set_cpu() in +smp_callin(), we look through these codes path carefully, and only found a +'hlt' instruct that could block the process. +It is located in trampoline_data(): + +ENTRY(trampoline_data) + ... + + call verify_cpu # Verify the cpu supports long mode + testl %eax, %eax # Check for return code + jnz no_longmode + + ... + +no_longmode: + hlt + jmp no_longmode + +For the process verify_cpu(), +we can only find the 'cpuid' sensitive instruct that could lead VM exit from +No-root mode. +This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to +the fail in verify_cpu. + +From the message in VM, we know vcpu1 and vcpu7 is something wrong. +[ 5.060042] CPU1: Stuck ?? +[ 10.170815] CPU7: Stuck ?? +[ 10.171648] Brought up 6 CPUs + +Besides, the follow is the cpus message got from host. +80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command +instance-0000000 +* CPU #0: pc=0x00007f64160c683d thread_id=68570 + CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 + CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 + CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 + CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 + CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 + CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 + CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 + +Oh, i also forgot to mention in the above message that, we have bond each vCPU +to different physical CPU in +host. + +Thanks, +zhanghailiang + +On 06/07/2015 11:59, zhanghailiang wrote: +> +> +> +Besides, the follow is the cpus message got from host. +> +80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh +> +qemu-monitor-command instance-0000000 +> +* CPU #0: pc=0x00007f64160c683d thread_id=68570 +> +CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 +> +CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 +> +CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 +> +CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 +> +CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 +> +CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 +> +CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 +> +> +Oh, i also forgot to mention in the above message that, we have bond +> +each vCPU to different physical CPU in +> +host. +Can you capture a trace on the host (trace-cmd record -e kvm) and send +it privately? Please note which CPUs get stuck, since I guess it's not +always 1 and 7. + +Paolo + +On Mon, 6 Jul 2015 17:59:10 +0800 +zhanghailiang <address@hidden> wrote: + +> +On 2015/7/6 16:45, Paolo Bonzini wrote: +> +> +> +> +> +> On 06/07/2015 09:54, zhanghailiang wrote: +> +>> +> +>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not +> +>> consuming any cpu (Should be in idle state), +> +>> All of VCPUs' stacks in host is like bellow: +> +>> +> +>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] +> +>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] +> +>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] +> +>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] +> +>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 +> +>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 +> +>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b +> +>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 +> +>> [<ffffffffffffffff>] 0xffffffffffffffff +> +>> +> +>> We looked into the kernel codes that could leading to the above 'Stuck' +> +>> warning, +in current upstream there isn't any printk(...Stuck...) left since that code +path +has been reworked. +I've often seen this on over-committed host during guest CPUs up/down torture +test. +Could you update guest kernel to upstream and see if issue reproduces? + +> +>> and found that the only possible is the emulation of 'cpuid' instruct in +> +>> kvm/qemu has something wrong. +> +>> But since we canât reproduce this problem, we are not quite sure. +> +>> Is there any possible that the cupid emulation in kvm/qemu has some bug ? +> +> +> +> Can you explain the relationship to the cpuid emulation? What do the +> +> traces say about vcpus 1 and 7? +> +> +OK, we searched the VM's kernel codes with the 'Stuck' message, and it is +> +located in +> +do_boot_cpu(). It's in BSP context, the call process is: +> +BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() +> +-> wakeup_secondary_via_INIT() to trigger APs. +> +It will wait 5s for APs to startup, if some AP not startup normally, it will +> +print 'CPU%d Stuck' or 'CPU%d: Not responding'. +> +> +If it prints 'Stuck', it means the AP has received the SIPI interrupt and +> +begins to execute the code +> +'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places +> +before smp_callin()(smpboot.c). +> +The follow is the starup process of BSP and AP. +> +BSP: +> +start_kernel() +> +->smp_init() +> +->smp_boot_cpus() +> +->do_boot_cpu() +> +->start_ip = trampoline_address(); //set the address that AP will +> +go to execute +> +->wakeup_secondary_cpu_via_init(); // kick the secondary CPU +> +->for (timeout = 0; timeout < 50000; timeout++) +> +if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if +> +AP startup or not +> +> +APs: +> +ENTRY(trampoline_data) (trampoline_64.S) +> +->ENTRY(secondary_startup_64) (head_64.S) +> +->start_secondary() (smpboot.c) +> +->cpu_init(); +> +->smp_callin(); +> +->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP +> +comes here, the BSP will not prints the error message. +> +> +From above call process, we can be sure that, the AP has been stuck between +> +trampoline_data and the cpumask_set_cpu() in +> +smp_callin(), we look through these codes path carefully, and only found a +> +'hlt' instruct that could block the process. +> +It is located in trampoline_data(): +> +> +ENTRY(trampoline_data) +> +... +> +> +call verify_cpu # Verify the cpu supports long mode +> +testl %eax, %eax # Check for return code +> +jnz no_longmode +> +> +... +> +> +no_longmode: +> +hlt +> +jmp no_longmode +> +> +For the process verify_cpu(), +> +we can only find the 'cpuid' sensitive instruct that could lead VM exit from +> +No-root mode. +> +This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to +> +the fail in verify_cpu. +> +> +From the message in VM, we know vcpu1 and vcpu7 is something wrong. +> +[ 5.060042] CPU1: Stuck ?? +> +[ 10.170815] CPU7: Stuck ?? +> +[ 10.171648] Brought up 6 CPUs +> +> +Besides, the follow is the cpus message got from host. +> +80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh +> +qemu-monitor-command instance-0000000 +> +* CPU #0: pc=0x00007f64160c683d thread_id=68570 +> +CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 +> +CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 +> +CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 +> +CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 +> +CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 +> +CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 +> +CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 +> +> +Oh, i also forgot to mention in the above message that, we have bond each +> +vCPU to different physical CPU in +> +host. +> +> +Thanks, +> +zhanghailiang +> +> +> +> +> +-- +> +To unsubscribe from this list: send the line "unsubscribe kvm" in +> +the body of a message to address@hidden +> +More majordomo info at +http://vger.kernel.org/majordomo-info.html + +On 2015/7/7 19:23, Igor Mammedov wrote: +On Mon, 6 Jul 2015 17:59:10 +0800 +zhanghailiang <address@hidden> wrote: +On 2015/7/6 16:45, Paolo Bonzini wrote: +On 06/07/2015 09:54, zhanghailiang wrote: +From host, we found that QEMU vcpu1 thread and vcpu7 thread were not +consuming any cpu (Should be in idle state), +All of VCPUs' stacks in host is like bellow: + +[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] +[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] +[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] +[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] +[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 +[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 +[<ffffffff81468092>] system_call_fastpath+0x16/0x1b +[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 +[<ffffffffffffffff>] 0xffffffffffffffff + +We looked into the kernel codes that could leading to the above 'Stuck' +warning, +in current upstream there isn't any printk(...Stuck...) left since that code +path +has been reworked. +I've often seen this on over-committed host during guest CPUs up/down torture +test. +Could you update guest kernel to upstream and see if issue reproduces? +Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to +reproduce it. + +For your test case, is it a kernel bug? +Or is there any related patch could solve your test problem been merged into +upstream ? + +Thanks, +zhanghailiang +and found that the only possible is the emulation of 'cpuid' instruct in +kvm/qemu has something wrong. +But since we canât reproduce this problem, we are not quite sure. +Is there any possible that the cupid emulation in kvm/qemu has some bug ? +Can you explain the relationship to the cpuid emulation? What do the +traces say about vcpus 1 and 7? +OK, we searched the VM's kernel codes with the 'Stuck' message, and it is +located in +do_boot_cpu(). It's in BSP context, the call process is: +BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() +-> wakeup_secondary_via_INIT() to trigger APs. +It will wait 5s for APs to startup, if some AP not startup normally, it will +print 'CPU%d Stuck' or 'CPU%d: Not responding'. + +If it prints 'Stuck', it means the AP has received the SIPI interrupt and +begins to execute the code +'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before +smp_callin()(smpboot.c). +The follow is the starup process of BSP and AP. +BSP: +start_kernel() + ->smp_init() + ->smp_boot_cpus() + ->do_boot_cpu() + ->start_ip = trampoline_address(); //set the address that AP will +go to execute + ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU + ->for (timeout = 0; timeout < 50000; timeout++) + if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if +AP startup or not + +APs: +ENTRY(trampoline_data) (trampoline_64.S) + ->ENTRY(secondary_startup_64) (head_64.S) + ->start_secondary() (smpboot.c) + ->cpu_init(); + ->smp_callin(); + ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP +comes here, the BSP will not prints the error message. + + From above call process, we can be sure that, the AP has been stuck between +trampoline_data and the cpumask_set_cpu() in +smp_callin(), we look through these codes path carefully, and only found a +'hlt' instruct that could block the process. +It is located in trampoline_data(): + +ENTRY(trampoline_data) + ... + + call verify_cpu # Verify the cpu supports long mode + testl %eax, %eax # Check for return code + jnz no_longmode + + ... + +no_longmode: + hlt + jmp no_longmode + +For the process verify_cpu(), +we can only find the 'cpuid' sensitive instruct that could lead VM exit from +No-root mode. +This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to +the fail in verify_cpu. + + From the message in VM, we know vcpu1 and vcpu7 is something wrong. +[ 5.060042] CPU1: Stuck ?? +[ 10.170815] CPU7: Stuck ?? +[ 10.171648] Brought up 6 CPUs + +Besides, the follow is the cpus message got from host. +80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command +instance-0000000 +* CPU #0: pc=0x00007f64160c683d thread_id=68570 + CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 + CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 + CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 + CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 + CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 + CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 + CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 + +Oh, i also forgot to mention in the above message that, we have bond each vCPU +to different physical CPU in +host. + +Thanks, +zhanghailiang + + + + +-- +To unsubscribe from this list: send the line "unsubscribe kvm" in +the body of a message to address@hidden +More majordomo info at +http://vger.kernel.org/majordomo-info.html +. + +On Tue, 7 Jul 2015 19:43:35 +0800 +zhanghailiang <address@hidden> wrote: + +> +On 2015/7/7 19:23, Igor Mammedov wrote: +> +> On Mon, 6 Jul 2015 17:59:10 +0800 +> +> zhanghailiang <address@hidden> wrote: +> +> +> +>> On 2015/7/6 16:45, Paolo Bonzini wrote: +> +>>> +> +>>> +> +>>> On 06/07/2015 09:54, zhanghailiang wrote: +> +>>>> +> +>>>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not +> +>>>> consuming any cpu (Should be in idle state), +> +>>>> All of VCPUs' stacks in host is like bellow: +> +>>>> +> +>>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] +> +>>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] +> +>>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] +> +>>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] +> +>>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 +> +>>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 +> +>>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b +> +>>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 +> +>>>> [<ffffffffffffffff>] 0xffffffffffffffff +> +>>>> +> +>>>> We looked into the kernel codes that could leading to the above 'Stuck' +> +>>>> warning, +> +> in current upstream there isn't any printk(...Stuck...) left since that +> +> code path +> +> has been reworked. +> +> I've often seen this on over-committed host during guest CPUs up/down +> +> torture test. +> +> Could you update guest kernel to upstream and see if issue reproduces? +> +> +> +> +Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to +> +reproduce it. +> +> +For your test case, is it a kernel bug? +> +Or is there any related patch could solve your test problem been merged into +> +upstream ? +I don't remember all prerequisite patches but you should be able to find +http://marc.info/?l=linux-kernel&m=140326703108009&w=2 +"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" +and then look for dependencies. + + +> +> +Thanks, +> +zhanghailiang +> +> +>>>> and found that the only possible is the emulation of 'cpuid' instruct in +> +>>>> kvm/qemu has something wrong. +> +>>>> But since we canât reproduce this problem, we are not quite sure. +> +>>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ? +> +>>> +> +>>> Can you explain the relationship to the cpuid emulation? What do the +> +>>> traces say about vcpus 1 and 7? +> +>> +> +>> OK, we searched the VM's kernel codes with the 'Stuck' message, and it is +> +>> located in +> +>> do_boot_cpu(). It's in BSP context, the call process is: +> +>> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> +> +>> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. +> +>> It will wait 5s for APs to startup, if some AP not startup normally, it +> +>> will print 'CPU%d Stuck' or 'CPU%d: Not responding'. +> +>> +> +>> If it prints 'Stuck', it means the AP has received the SIPI interrupt and +> +>> begins to execute the code +> +>> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places +> +>> before smp_callin()(smpboot.c). +> +>> The follow is the starup process of BSP and AP. +> +>> BSP: +> +>> start_kernel() +> +>> ->smp_init() +> +>> ->smp_boot_cpus() +> +>> ->do_boot_cpu() +> +>> ->start_ip = trampoline_address(); //set the address that AP +> +>> will go to execute +> +>> ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU +> +>> ->for (timeout = 0; timeout < 50000; timeout++) +> +>> if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// +> +>> check if AP startup or not +> +>> +> +>> APs: +> +>> ENTRY(trampoline_data) (trampoline_64.S) +> +>> ->ENTRY(secondary_startup_64) (head_64.S) +> +>> ->start_secondary() (smpboot.c) +> +>> ->cpu_init(); +> +>> ->smp_callin(); +> +>> ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP +> +>> comes here, the BSP will not prints the error message. +> +>> +> +>> From above call process, we can be sure that, the AP has been stuck +> +>> between trampoline_data and the cpumask_set_cpu() in +> +>> smp_callin(), we look through these codes path carefully, and only found a +> +>> 'hlt' instruct that could block the process. +> +>> It is located in trampoline_data(): +> +>> +> +>> ENTRY(trampoline_data) +> +>> ... +> +>> +> +>> call verify_cpu # Verify the cpu supports long mode +> +>> testl %eax, %eax # Check for return code +> +>> jnz no_longmode +> +>> +> +>> ... +> +>> +> +>> no_longmode: +> +>> hlt +> +>> jmp no_longmode +> +>> +> +>> For the process verify_cpu(), +> +>> we can only find the 'cpuid' sensitive instruct that could lead VM exit +> +>> from No-root mode. +> +>> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading +> +>> to the fail in verify_cpu. +> +>> +> +>> From the message in VM, we know vcpu1 and vcpu7 is something wrong. +> +>> [ 5.060042] CPU1: Stuck ?? +> +>> [ 10.170815] CPU7: Stuck ?? +> +>> [ 10.171648] Brought up 6 CPUs +> +>> +> +>> Besides, the follow is the cpus message got from host. +> +>> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh +> +>> qemu-monitor-command instance-0000000 +> +>> * CPU #0: pc=0x00007f64160c683d thread_id=68570 +> +>> CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 +> +>> CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 +> +>> CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 +> +>> CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 +> +>> CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 +> +>> CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 +> +>> CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 +> +>> +> +>> Oh, i also forgot to mention in the above message that, we have bond each +> +>> vCPU to different physical CPU in +> +>> host. +> +>> +> +>> Thanks, +> +>> zhanghailiang +> +>> +> +>> +> +>> +> +>> +> +>> -- +> +>> To unsubscribe from this list: send the line "unsubscribe kvm" in +> +>> the body of a message to address@hidden +> +>> More majordomo info at +http://vger.kernel.org/majordomo-info.html +> +> +> +> +> +> . +> +> +> +> +> + +On 2015/7/7 20:21, Igor Mammedov wrote: +On Tue, 7 Jul 2015 19:43:35 +0800 +zhanghailiang <address@hidden> wrote: +On 2015/7/7 19:23, Igor Mammedov wrote: +On Mon, 6 Jul 2015 17:59:10 +0800 +zhanghailiang <address@hidden> wrote: +On 2015/7/6 16:45, Paolo Bonzini wrote: +On 06/07/2015 09:54, zhanghailiang wrote: +From host, we found that QEMU vcpu1 thread and vcpu7 thread were not +consuming any cpu (Should be in idle state), +All of VCPUs' stacks in host is like bellow: + +[<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] +[<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] +[<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] +[<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] +[<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 +[<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 +[<ffffffff81468092>] system_call_fastpath+0x16/0x1b +[<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 +[<ffffffffffffffff>] 0xffffffffffffffff + +We looked into the kernel codes that could leading to the above 'Stuck' +warning, +in current upstream there isn't any printk(...Stuck...) left since that code +path +has been reworked. +I've often seen this on over-committed host during guest CPUs up/down torture +test. +Could you update guest kernel to upstream and see if issue reproduces? +Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to +reproduce it. + +For your test case, is it a kernel bug? +Or is there any related patch could solve your test problem been merged into +upstream ? +I don't remember all prerequisite patches but you should be able to find +http://marc.info/?l=linux-kernel&m=140326703108009&w=2 +"x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" +and then look for dependencies. +Er, we have investigated this patch, and it is not related to our problem, :) + +Thanks. +Thanks, +zhanghailiang +and found that the only possible is the emulation of 'cpuid' instruct in +kvm/qemu has something wrong. +But since we canât reproduce this problem, we are not quite sure. +Is there any possible that the cupid emulation in kvm/qemu has some bug ? +Can you explain the relationship to the cpuid emulation? What do the +traces say about vcpus 1 and 7? +OK, we searched the VM's kernel codes with the 'Stuck' message, and it is +located in +do_boot_cpu(). It's in BSP context, the call process is: +BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() +-> wakeup_secondary_via_INIT() to trigger APs. +It will wait 5s for APs to startup, if some AP not startup normally, it will +print 'CPU%d Stuck' or 'CPU%d: Not responding'. + +If it prints 'Stuck', it means the AP has received the SIPI interrupt and +begins to execute the code +'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before +smp_callin()(smpboot.c). +The follow is the starup process of BSP and AP. +BSP: +start_kernel() + ->smp_init() + ->smp_boot_cpus() + ->do_boot_cpu() + ->start_ip = trampoline_address(); //set the address that AP will +go to execute + ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU + ->for (timeout = 0; timeout < 50000; timeout++) + if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if +AP startup or not + +APs: +ENTRY(trampoline_data) (trampoline_64.S) + ->ENTRY(secondary_startup_64) (head_64.S) + ->start_secondary() (smpboot.c) + ->cpu_init(); + ->smp_callin(); + ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP +comes here, the BSP will not prints the error message. + + From above call process, we can be sure that, the AP has been stuck between +trampoline_data and the cpumask_set_cpu() in +smp_callin(), we look through these codes path carefully, and only found a +'hlt' instruct that could block the process. +It is located in trampoline_data(): + +ENTRY(trampoline_data) + ... + + call verify_cpu # Verify the cpu supports long mode + testl %eax, %eax # Check for return code + jnz no_longmode + + ... + +no_longmode: + hlt + jmp no_longmode + +For the process verify_cpu(), +we can only find the 'cpuid' sensitive instruct that could lead VM exit from +No-root mode. +This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to +the fail in verify_cpu. + + From the message in VM, we know vcpu1 and vcpu7 is something wrong. +[ 5.060042] CPU1: Stuck ?? +[ 10.170815] CPU7: Stuck ?? +[ 10.171648] Brought up 6 CPUs + +Besides, the follow is the cpus message got from host. +80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command +instance-0000000 +* CPU #0: pc=0x00007f64160c683d thread_id=68570 + CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 + CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 + CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 + CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 + CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 + CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 + CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 + +Oh, i also forgot to mention in the above message that, we have bond each vCPU +to different physical CPU in +host. + +Thanks, +zhanghailiang + + + + +-- +To unsubscribe from this list: send the line "unsubscribe kvm" in +the body of a message to address@hidden +More majordomo info at +http://vger.kernel.org/majordomo-info.html +. +. + |