diff options
Diffstat (limited to 'results/classifier/108/other/1618431')
| -rw-r--r-- | results/classifier/108/other/1618431 | 275 |
1 files changed, 275 insertions, 0 deletions
diff --git a/results/classifier/108/other/1618431 b/results/classifier/108/other/1618431 new file mode 100644 index 00000000..9521df68 --- /dev/null +++ b/results/classifier/108/other/1618431 @@ -0,0 +1,275 @@ +vnc: 0.748 +KVM: 0.742 +graphic: 0.681 +other: 0.622 +performance: 0.611 +device: 0.603 +permissions: 0.580 +network: 0.575 +boot: 0.561 +debug: 0.547 +socket: 0.545 +semantic: 0.535 +PID: 0.515 +files: 0.491 + +windows hangs after live migration with virtio + +Several of our users reported problems with windows machines hanging +after live migrations. The common denominator _seems_ to be virtio +devices. +I've managed to reproduce this reliably on a windows 10 (+ +virtio-win-0.1.118) guest, always within 1 to 5 migrations, with a +virtio-scsi hard drive and a virtio-net network device. (When I +replace the virtio-net device with an e1000 it takes 10 or more +migrations, and without virtio devices I have not (yet) been able to +reproduce this problem. I also could not reproduce this with a linux +guest. Also spice seems to improve the situation, but doesn't solve +it completely). + +I've tested quite a few tags from qemu-git (v2.2.0 through v2.6.1, +and 2.6.1 with the patches mentioned on qemu-stable by Peter Lieven) +and the behavior is the same everywhere. + +The reproducibility seems to be somewhat dependent on the host +hardware, which makes investigating this issue that much harder. + +Symptoms: +After the migration the windows graphics stack just hangs. +Background processes are still running (eg. after installing an ssh +server I could still login and get a command prompt after the hang was +triggered... not that I'd know what to do with that on a windows +machine...) - commands which need no GUI access work, the rest just +hangs there on the command line, too. +It's also capable of responding to an NMI sent via the qemu monitor: +it then seems to "recover" and manages to show the blue sad-face +screen that something happened, reboots successfully and is usable +again without restarting the qemu process in between. +From there whole the process can be repeated. + +Here's what our command line usually looks like: + +/usr/bin/qemu -daemonize \ + -enable-kvm \ + -chardev socket,id=qmp,path=/var/run/qemu-server/101.qmp,server,nowait -mon chardev=qmp,mode=control \ + -pidfile /var/run/qemu-server/101.pid \ + -smbios type=1,uuid=07fc916e-24c2-4eef-9827-4ab4026501d4 \ + -name win10 \ + -smp 6,sockets=1,cores=6,maxcpus=6 \ + -nodefaults \ + -boot menu=on,strict=on,reboot-timeout=1000 \ + -vga std \ + -vnc unix:/var/run/qemu-server/101.vnc \ + -no-hpet \ + -cpu kvm64,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,+lahf_lm,+sep,+kvm_pv_unhalt,+kvm_pv_eoi,enforce \ + -m 2048 \ + -device pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f \ + -device pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e \ + -device piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2 \ + -device usb-tablet,id=tablet,bus=uhci.0,port=1 \ + -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \ + -iscsi initiator-name=iqn.1993-08.org.debian:01:1ba48d46fb8 \ + -drive if=none,id=drive-ide0,media=cdrom,aio=threads \ + -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0,id=ide0,bootindex=200 \ + -device virtio-scsi-pci,id=scsihw0,bus=pci.0,addr=0x5 \ + -drive file=/mnt/pve/data1/images/101/vm-101-disk-1.qcow2,if=none,id=drive-scsi0,cache=writeback,discard=on,format=qcow2,aio=threads,detect-zeroes=unmap \ + -device scsi-hd,bus=scsihw0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,bootindex=100 \ + -netdev type=tap,id=net0,ifname=tap101i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown,vhost=on \ + -device virtio-net-pci,mac=F2:2B:20:37:E6:D7,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=300 \ + -rtc driftfix=slew,base=localtime \ + -global kvm-pit.lost_tick_policy=discard + +I'm not sure it's virtio - I've got similar cases which happen even without any virtio; for me it goes away if you enable hpet or switch you kvm-put.lost_tick_policy=delay. + +Dave + +As the virtio related parts aren't the ones hanging (network and disks +still work...) it's unlikely, but it makes a night and day difference. + +Removing -no-hpet as suggested does seem to make a difference, too. +(Changing the tick policy doesn't, for me.) +However, I've found that there are various options which when changed +can massively influence the likelihood of hangs - but it's not always +the same options for all VMs. +With the difference being hangups after 1 to at most 2 migrations with +one setting, or the VMs still being alive and kicking after 20 and +more migrations with the other. +However the options I've tested appear to be unrelated. Eq. in my test +setups this happened with VNC settings, CPU types, toggling our +backend's ssh tunnel for encryption (which should cause nothing but +changes in timing from the perspective of qemu); and of course +replacing virtio devices always had this effect in my tests. +All this might point to some kind of race condition or time keeping +problem, but I can't seem to pinpoint it. + +Enabling hpet isn't a good option btw., since #599958 [Timedrift +problems with Win7: hpet missing time drift fixups] appears to +still be an open issue. => https://bugs.launchpad.net/qemu/+bug/599958 +(This entry is from 2010 :-( ) + +I can reproduce this bug also on Ubuntu 16.04 with libvirt. +The interesting thing is that this bug triggers faster, +if I use tunneled migration instead direct. +Using the virt-manager for migration. + +The test VM is a Win10 with virtio driver from fedora 0.1.118. + +<!-- +WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE +OVERWRITTEN AND LOST. Changes to this xml configuration should be made using: + virsh edit win10-1 +or other application using the libvirt API. +--> + +<domain type='kvm'> + <name>win10-1</name> + <uuid>4b3533c1-20d4-4556-9d99-4fb3d04b19dc</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static'>6</vcpu> + <os> + <type arch='x86_64' machine='pc-i440fx-wily'>hvm</type> + </os> + <features> + <acpi/> + <apic/> + <hyperv> + <relaxed state='on'/> + <vapic state='on'/> + <spinlocks state='on' retries='8191'/> + </hyperv> + </features> + <cpu mode='custom' match='exact'> + <model fallback='allow'>Haswell-noTSX</model> + <topology sockets='1' cores='6' threads='1'/> + </cpu> + <clock offset='localtime'> + <timer name='rtc' tickpolicy='catchup'/> + <timer name='pit' tickpolicy='delay'/> + <timer name='hpet' present='no'/> + <timer name='hypervclock' present='yes'/> + </clock> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>restart</on_crash> + <pm> + <suspend-to-mem enabled='no'/> + <suspend-to-disk enabled='no'/> + </pm> + <devices> + <emulator>/usr/bin/kvm-spice</emulator> + <disk type='file' device='disk'> + <driver name='qemu' type='qcow2' cache='none' io='threads'/> + <source file='/mnt/traini3/vm-win10-1.qcow2'/> + <target dev='sda' bus='scsi'/> + <boot order='1'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <disk type='file' device='cdrom'> + <driver name='qemu' type='raw'/> + <source file='/mnt/nasi/template/iso/Win10_EnglishInternational_x64.iso'/> + <target dev='hdb' bus='ide'/> + <readonly/> + <boot order='2'/> + <address type='drive' controller='0' bus='0' target='0' unit='1'/> + </disk> + <disk type='file' device='cdrom'> + <driver name='qemu' type='raw' cache='none'/> + <source file='/mnt/nasi/template/iso/virtio-win-0.1.118.iso'/> + <target dev='hdc' bus='ide'/> + <readonly/> + <boot order='3'/> + <address type='drive' controller='0' bus='1' target='0' unit='0'/> + </disk> + <controller type='usb' index='0' model='ich9-ehci1'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x7'/> + </controller> + <controller type='usb' index='0' model='ich9-uhci1'> + <master startport='0'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/> + </controller> + <controller type='usb' index='0' model='ich9-uhci2'> + <master startport='2'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/> + </controller> + <controller type='usb' index='0' model='ich9-uhci3'> + <master startport='4'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x2'/> + </controller> + <controller type='scsi' index='0' model='virtio-scsi'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </controller> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <interface type='bridge'> + <mac address='52:54:00:2e:4f:ea'/> + <source bridge='vmbr0'/> + <model type='virtio'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </interface> + <serial type='pty'> + <target port='0'/> + </serial> + <console type='pty'> + <target type='serial' port='0'/> + </console> + <input type='tablet' bus='usb'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <graphics type='vnc' port='-1' autoport='yes' listen='0.0.0.0'> + <listen type='address' address='0.0.0.0'/> + </graphics> + <video> + <model type='cirrus' vram='16384' heads='1'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </video> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </memballoon> + </devices> +</domain> + + +When I set -rtc clock=vm the mig problem is solved, but what does this flag exactly? + +In the docu there is only + +" If you want to isolate the guest time from the host, you can set clock to "rt" instead. + To even prevent it from progressing during suspension, you can set it to "vm"." + +what does this means? + +Will the VM never be synced with the HW clock and so it will run slower on cpu load on normal running? + +Hmm, I'd not tried that one; I don't think that should change the behaviour during normal running, but the behaviour on pause and interactions with things like host ntp clock syncing is probably different - how different I'd have to dig in a bit more. + +However, we've done two patches this week that help windows migration - I'd be interested if either of them help your case; + +https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg02658.html is a qemu fix (now in current head qemu) that I wrote that helps one windows migration test case. + +https://lkml.org/lkml/2016/9/14/857 is a kernel fix that fixes some related problems. + +If one or both of these fixes together help I'd love to know either way! + +Dave + +Thank I test the 2 patches and they worked for me. +It works also if you apply only the qemu patch, +in combination the ubuntu kernel 4.4.0-38.57 and qemu 2.6.1. + +Excellent news; thanks for testing! + +Hi WOLI, + Note, if you pick up a new (4.8 ish) kernel you'll probably find you'll need to also pick up two patches that we've just posted to the qemu list: + + target-i386: introduce kvm_put_one_msr + kvm: apic: set APIC base as part of kvm_apic_put + +otherwise you get weird reboot hangs with Linux guests. + +Dave + +The patches should be part of QEMU v2.8 ==> Fix released + |