graphic: 0.948
instruction: 0.941
other: 0.941
assembly: 0.937
semantic: 0.935
socket: 0.933
device: 0.932
vnc: 0.899
KVM: 0.879
boot: 0.867
network: 0.859
mistranslation: 0.835
Windows (10?) guest freezes entire host on shutdown if using PCI passthrough
Problem: after leaving a Windows VM that uses PCI passthrough (as we do for gaming graphics cards, sound cards, and in my case, a USB card) running for some amount of time between 1 and 2 hours (it's not consistent with exactly how long), and for any amount of time longer than that, shutting down that guest will, right as it finishes shutting down, freeze the host computer, making it require a hard reboot. Unbinding (or in the other user's case, unbinding and THEN binding) any PCI device in sysfs, even one that has nothing to do with the VM, also has the same effect as shutting down the VM (if the VM has been running long enough). So, it's probably an issue related to unbinding and binding PCI devices.
There's a lot of info on this problem over at https://bbs.archlinux.org/viewtopic.php?id=206050
Here's a better-organized list of main details:
-at least 2 confirmed victims of this bug; 2 (including me) have provided lots of info in the link
-I'm on Arch Linux and the other one is on Gentoo (distro-nonspecific)
-issue affects my Windows 10 guest and others' Windows guests, but not my Arch Linux guest (the others don't have non-Windows guests to test)
-I'm using libvirt but the other user is not, so it's not an issue with libvirt
-It seems to be version non-specific, too. I first noticed it at, or when testing versions still had the issue at (whichever version is lower), Linux 4.1 and qemu 2.4.0. It still persists in all releases of both since, including the newest ones.
-I can't track down exactly what package downgrade can fix it, as downgrading further than Linux 4.1 and qemu 2.4.0 requires Herculean and system-destroying changes such as downgrading ncurses, meaning I don't know whether it's a bug in QEMU, the Linux kernel, or some weird seemingly unrelated thing.
-According to the other user, "graphics intensive gameplay (GTA V) can cause the crash to happen sooner," as soon as "15 minutes"
-Also, "bringing up a second passthrough VM with separate hardware will cause the same crash," and "bringing up another VM before the two-hour mark will not result in a crash," further cementing that it's triggered by the un/binding of PCI devices.
-This is NOT related to the very similar bug that can be worked around by not passing through the HDMI device or sound card. Even when we removed all traces of any sort of sound card from the VM, it still had the same behavior.
I am seeing this issue on arch also. I also tried Fedora24 to see if it was a Arch only issue.
If I start a VM and stop it shortly after everything works fine.
If I start a VM and game for a while, on VM shutdown the host will totally lock. Tailing the journal to see if anything gets logged shows nothing (hangs before any errors are logged). Have to hard power cycle PC to regain use.
I'm willing to do any test to try to figure this out.
Hardware details:
i7-5820K 3.3 GHz (hex core)
12g ram
ASRock X99 Extreme4 LGA2011
GTX 970 nvidia drivers (pass thru card) using Display port
Asus Rog Swift 27"
Oh, I should post my hardware:
i7-5820K (also) (4/6 cores (8/12 threads) being passed to VMs)
12GB RAM (also) (8GB being passed to VMs)
MSI X99 SLI Plus (though I don't use SLI)
NVidia GTX 960 2GB pass-thru (also had this problem on a GTX 660 before that died)
GT 740 host card, using nouveau when VMs are running
We have some pretty similar hardware there.
Here is my startup script.
#!/bin/bash
echo "Starting virtual machine..."
cp /usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd /tmp/my_vars.fd
sudo \
qemu-system-x86_64 \
-name "Windows 10" \
-enable-kvm \
-m 12288 \
-cpu host,kvm=off \
-smp threads=2,cores=4,sockets=1 \
-vga none \
-soundhw hda \
-net nic -net bridge,br=br0 \
-usb -usbdevice host:1af3:0001 -usbdevice host:04d9:2221 -usbdevice host:046d:0a4d \
-device vfio-pci,host=01:00.0,multifunction=on \
-device vfio-pci,host=01:00.1 \
-drive if=pflash,format=raw,readonly,file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd \
-drive if=pflash,format=raw,file=/tmp/my_vars.fd \
-boot order=cd \
-device virtio-scsi-pci,id=scsi \
-drive file=/home/jason/kvm/win.img,id=disk,format=qcow2,if=none,cache=writeback -device scsi-hd,drive=disk \
exit 0
I should also post my "scripts" (libvirt XML files in my case):
But, since the Windows VM and Linux VM are completely identical beyond the OS that's installed, I don't think our VM configurations have anything to do with this bug. I mean, they aren't completely identical right now because I removed the HDMI sound card from the Linux VM in favor of PulseAudio "network" streaming, but I did that recently and they had the same behavior or lack thereof before I did that.
Also, yeah, the Linux one is called SteamOS, but it is actually just an almost identical install of Arch. SteamOS wasn't playing nice with most of my hardware when I tried to install it.
I think this is what's happening to me on my windows 8.1 vm although it might be slightly different.
Just about everything you guys talked about applies except I don't have to shutdown for it to freeze up in my case(although if it's on for long enough and I shut it off it freezes). It freezes up on it's own seemingly at random taking the host with it.
First happened to me on a freshly installed Arch(antergos), then tried it on Debian after updating my kernel from 4.3 to 4.5(there was a bug that made the vm excruciatingly slow before 4.4) and it happened again.
My hardware:
i7 5820k
8GB Ram (Upgrading to 32GB when the ram I ordered gets here)
MSI X99S SLI Plus
AMD Radeon R9-270X (Host GPU using "radeon" drivers)
AMD Radeon HD 6950 1GB (Passthrough GPU)
Interesting that aside from the GPUs(which I'm pretty sure aren't the problem) we all have very similar hardware.
When I get some free time I'll try to replicate this bug on another OS but I have a feeling I'll just get the same result. I just want to see if it'll happen no matter what distro I use.
I doubt you have a different issue. My VM has randomly hanged my computer without a shut down a few times during the life of this bug, and there are two very possible ways it could happen: the VM suddenly crashed, making a situation similar to it shutting down, or something in your host caused some PCI device to be bound or unbound to a driver.
I see, it's definitely the same issue then.
Could it be something to do with our hardware unbinding and binding pci devices or something of the sort? I sort of doubt it but it is strange someone else with a more different CPU/mobo combo hasn't reported this problem yet.
That being said, we have a very small sample size so I don't know if that means anything.
Whoops, I clicked the wrong button and added the wrong thing for Arch Linux, and I don't know how to delete it. (new to launchpad here)
OK, I figured out how to delete it.
I am having the exact same issue!
My Setup:
Model: unRaid 6.2 Beta
M/B: ASUSTeK Computer INC. - Z8P(N)E-D12(X)
CPU: Intel® Xeon® CPU X5690 @ 3.47GHz
HVM: Enabled
IOMMU: Enabled
Cache: 384 kB, 1536 kB, 12288 kB
Memory: 32768 MB (max. installable capacity 96 GB)
Network: bond0: fault-tolerance (active-backup), mtu 1500
eth0: 100Mb/s, Full Duplex, mtu 1500
eth1: 1000Mb/s, Full Duplex, mtu 1500
Kernel: Linux 4.4.6-unRAID x86_64
OpenSSL: 1.0.2g
Desktop ComputerSystem Product Name (To Be Filled By O.E.M.)System manufacturerSystem Version[REMOVED]4294967295SMBIOS version 2.6DMI version 2.6Symmetric Multi-ProcessingMotherboardZ8P(N)E-D12(X)ASUSTeK Computer INC.0Rev 1.0xG[REMOVED]To Be Filled By O.E.M.BIOSAmerican Megatrends Inc.0130206/25/2012655362031616ISA busPCI busPlug-and-PlayBIOS EEPROM can be upgradedBIOS shadowingESCDBooting from CD-ROM/DVDSelectable boot pathBIOS ROM is socketedEnhanced Disk Drive extensions5.25" 1.2MB floppy3.5" 720KB floppy3.5" 2.88MB floppyPrint Screen keyi8042 keyboard controllerINT14 serial line controlINT17 printer controlINT10 CGA/Mono videoACPIUSB legacy emulationBooting from LS-120Booting from ATAPI ZIPBIOS boot specificationCPUIntel(R) Xeon(R) CPU X5690 @ 3.47GHzIntel Corp.4cpu@0Intel(R) Xeon(R) CPU X5690 @ 3.47GHz[REMOVED]CPU 1346800000036000000006413300000064bits extensions (x86-64)mathematical co-processorFPU exceptions reportingvirtual mode extensionsdebugging extensionspage size extensionstime stamp countermodel-specific registers4GB+ memory addressing (Physical Address Extension)machine check exceptionscompare and exchange 8-byteon-chip advanced programmable interrupt controller (APIC)fast system callsmemory type range registerspage global enablemachine check architectureconditional move instructionpage attribute table36-bit page size extensionsdebug trace and EMON store MSRsthermal control (ACPI)multimedia extensions (MMX)fast floating point save/restorestreaming SIMD extensions (SSE)streaming SIMD extensions (SSE2)self-snoopHyperThreadingthermal interrupt and statuspending break eventfast system callsno-execute bit (NX)CPU Frequency scalingL1 cache5L1-Cache393216393216InternalWrite-troughInstruction cacheL2 cache6L2-Cache15728641572864InternalWrite-troughUnified cacheL3 cache7L3-Cache1258291212582912InternalWrite-backUnified cacheCPUIntel(R) Xeon(R) CPU X5690 @ 3.47GHzIntel Corp.8cpu@1Intel(R) Xeon(R) CPU X5690 @ 3.47GHz[REMOVED]CPU 2173300000036000000006413300000064bits extensions (x86-64)mathematical co-processorFPU exceptions reportingvirtual mode extensionsdebugging extensionspage size extensionstime stamp countermodel-specific registers4GB+ memory addressing (Physical Address Extension)machine check exceptionscompare and exchange 8-byteon-chip advanced programmable interrupt controller (APIC)fast system callsmemory type range registerspage global enablemachine check architectureconditional move instructionpage attribute table36-bit page size extensionsdebug trace and EMON store MSRsthermal control (ACPI)multimedia extensions (MMX)fast floating point save/restorestreaming SIMD extensions (SSE)streaming SIMD extensions (SSE2)self-snoopHyperThreadingthermal interrupt and statuspending break eventfast system callsno-execute bit (NX)CPU Frequency scalingL1 cache9L1-Cache393216393216InternalWrite-troughInstruction cacheL2 cacheaL2-Cache15728641572864InternalWrite-troughUnified cacheL3 cachebL3-Cache1258291212582912InternalWrite-backUnified cacheSystem Memory30System board or motherboard34359738368Multi-bit error-correcting code (ECC)DIMM DDR3 1333 MHz (0.8 ns)ModulePartNumber00Manufacturer000[REMOVED]DIMM_A18589934592641333000000DIMM DDR3 [empty]ModulePartNumber01Manufacturer011[REMOVED]DIMM_A264DIMM DDR3 1333 MHz (0.8 ns)ModulePartNumber02Manufacturer022[REMOVED]DIMM_B18589934592641333000000DIMM DDR3 [empty]ModulePartNumber03Manufacturer033[REMOVED]DIMM_B264DIMM DDR3 [empty]ModulePartNumber04Manufacturer044[REMOVED]DIMM_C164DIMM DDR3 [empty]ModulePartNumber05Manufacturer055[REMOVED]DIMM_C264DIMM DDR3 1333 MHz (0.8 ns)ModulePartNumber06Manufacturer066[REMOVED]DIMM_D18589934592641333000000DIMM DDR3 [empty]ModulePartNumber07Manufacturer077[REMOVED]DIMM_D264DIMM DDR3 1333 MHz (0.8 ns)ModulePartNumber08Manufacturer088[REMOVED]DIMM_E18589934592641333000000DIMM DDR3 [empty]ModulePartNumber09Manufacturer099[REMOVED]DIMM_E264DIMM DDR3 [empty]ModulePartNumber10Manufacturer10a[REMOVED]DIMM_F164DIMM DDR3 [empty]ModulePartNumber11Manufacturer11b[REMOVED]DIMM_F264Host bridge5520 I/O Hub to ESI PortIntel Corporation100pci@0000:00:00.0223233000000PCI bridge5520/5500/X58 I/O Hub PCI Express Root Port 1Intel Corporation1pci@0000:00:01.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingSerial Attached SCSI controllerSAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]LSI Logic / Symbios Logic0pci@0000:10:00.0scsi2036433000000Power ManagementPCI ExpressVital Product DataMessage Signalled InterruptsMSI-Xbus masteringPCI capabilities listingextension ROMATA DiskST9750420ASSeagate0.3.0scsi@2:0.3.0/dev/sdj8:144SDM5[REMOVED]75015637401675015651328015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.3.0,1/dev/sdj18:145750156341248Primary partitionATA DiskST9750420ASSeagate0.4.0scsi@2:0.4.0/dev/sdk8:160SDM5[REMOVED]75015637401675015651328015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.4.0,1/dev/sdk18:161750156341248Primary partitionATA DiskST1000LX001-1EM1Seagate0.5.0scsi@2:0.5.0/dev/sdl8:176SD02[REMOVED]1000204886016100020518912015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.5.0,1/dev/sdl18:1771000204853248Primary partitionATA DiskST1000LX001-1EM1Seagate0.6.0scsi@2:0.6.0/dev/sdm8:192SD02[REMOVED]1000204886016100020518912015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.6.0,1/dev/sdm18:1931000204853248Primary partitionATA DiskST1000LX001-1EM1Seagate0.7.0scsi@2:0.7.0/dev/sdn8:208SD02[REMOVED]1000204886016100020518912015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.7.0,1/dev/sdn18:2091000204853248Primary partitionATA DiskST1000LX001-1EM1Seagate0.0.0scsi@2:0.0.0/dev/sdg8:96SD02[REMOVED]1000204886016100020518912015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.0.0,1/dev/sdg18:971000204853248Primary partitionATA DiskST9750420ASSeagate0.1.0scsi@2:0.1.0/dev/sdh8:112SDM5[REMOVED]75015637401675015651328015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.1.0,1/dev/sdh18:113750156341248Primary partitionATA DiskST9750420ASSeagate0.2.0scsi@2:0.2.0/dev/sdi8:128SDM5[REMOVED]75015637401675015651328015000 rotations per minutePartitioned diskMS-DOS partition tableLinux filesystem partition1scsi@2:0.2.0,1/dev/sdi18:129750156341248Primary partitionPCI bridge5520/5500/X58 I/O Hub PCI Express Root Port 2Intel Corporation2pci@0000:00:02.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingPCI bridge5520/5500/X58 I/O Hub PCI Express Root Port 3Intel Corporation3pci@0000:00:03.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingPCI bridge5520/X58 I/O Hub PCI Express Root Port 4Intel Corporation4pci@0000:00:04.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingPCI bridge5520/X58 I/O Hub PCI Express Root Port 5Intel Corporation5pci@0000:00:05.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingUSB controllerEJ188/EJ198 USB 3.0 Host ControllerEtron Technology, Inc.0pci@0000:0c:00.0006433000000Power ManagementMessage Signalled InterruptsPCI Expressbus masteringPCI capabilities listingPCI bridge5520/X58 I/O Hub PCI Express Root Port 6Intel Corporation6pci@0000:00:06.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingPCI bridge5520/5500/X58 I/O Hub PCI Express Root Port 7Intel Corporation7pci@0000:00:07.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingVGA compatible controllerGM204 [GeForce GTX 970]NVIDIA Corporation0pci@0000:0a:00.0a16433000000Power ManagementMessage Signalled InterruptsPCI Expressbus masteringPCI capabilities listingextension ROMAudio deviceGM204 High Definition Audio ControllerNVIDIA Corporation0.1pci@0000:0a:00.1a13233000000Power ManagementMessage Signalled InterruptsPCI Expressbus masteringPCI capabilities listingPCI bridge5520/5500/X58 I/O Hub PCI Express Root Port 8Intel Corporation8pci@0000:00:08.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingPCI bridge7500/5520/5500/X58 I/O Hub PCI Express Root Port 9Intel Corporation9pci@0000:00:09.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingPCI bridge7500/5520/5500/X58 I/O Hub PCI Express Root Port 10Intel Corporationapci@0000:00:0a.0223233000000Message Signalled InterruptsPCI ExpressPower Managementbus masteringPCI capabilities listingPIC7500/5520/5500/X58 Physical and Link Layer Registers Port 0Intel Corporation10pci@0000:00:10.0223233000000PCI capabilities listingPIC7500/5520/5500/X58 Routing and Protocol Layer Registers Port 0Intel Corporation10.1pci@0000:00:10.1223233000000PIC7500/5520/5500 Physical and Link Layer Registers Port 1Intel Corporation11pci@0000:00:11.0223233000000PCI capabilities listingPIC7500/5520/5500 Routing & Protocol Layer Register Port 1Intel Corporation11.1pci@0000:00:11.1223233000000PIC7500/5520/5500/X58 I/O Hub System Management RegistersIntel Corporation14pci@0000:00:14.0223233000000PCI ExpressPCI capabilities listingPIC7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad RegistersIntel Corporation14.1pci@0000:00:14.1223233000000PCI ExpressPCI capabilities listingPIC7500/5520/5500/X58 I/O Hub Control Status and RAS RegistersIntel Corporation14.2pci@0000:00:14.2223233000000PCI ExpressPCI capabilities listingPIC7500/5520/5500/X58 I/O Hub Throttle RegistersIntel Corporation14.3pci@0000:00:14.3223233000000System peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16pci@0000:00:16.0226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingSystem peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16.1pci@0000:00:16.1226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingSystem peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16.2pci@0000:00:16.2226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingSystem peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16.3pci@0000:00:16.3226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingSystem peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16.4pci@0000:00:16.4226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingSystem peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16.5pci@0000:00:16.5226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingSystem peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16.6pci@0000:00:16.6226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingSystem peripheral5520/5500/X58 Chipset QuickData Technology DeviceIntel Corporation16.7pci@0000:00:16.7226433000000MSI-XPCI ExpressPower ManagementPCI capabilities listingUSB controller82801JI (ICH10 Family) USB UHCI Controller #4Intel Corporation1apci@0000:00:1a.0003233000000Universal Host Controller Interface (USB1)bus masteringPCI capabilities listingUSB controller82801JI (ICH10 Family) USB UHCI Controller #5Intel Corporation1a.1pci@0000:00:1a.1003233000000Universal Host Controller Interface (USB1)bus masteringPCI capabilities listingUSB controller82801JI (ICH10 Family) USB2 EHCI Controller #2Intel Corporation1a.7pci@0000:00:1a.7003233000000Power ManagementDebug portEnhanced Host Controller Interface (USB2)bus masteringPCI capabilities listingAudio device82801JI (ICH10 Family) HD Audio ControllerIntel Corporation1bpci@0000:00:1b.0006433000000Power ManagementMessage Signalled InterruptsPCI Expressbus masteringPCI capabilities listingPCI bridge82801JI (ICH10 Family) PCI Express Root Port 1Intel Corporation1cpci@0000:00:1c.0003233000000PCI ExpressMessage Signalled InterruptsPower Managementbus masteringPCI capabilities listingPCI bridgeuPD720400 PCI Express - PCI/PCI-X BridgeNEC Corporation0pci@0000:04:00.0066433000000PCI ExpressPCI-XPower Managementbus masteringPCI capabilities listingPCI bridgeuPD720400 PCI Express - PCI/PCI-X BridgeNEC Corporation0.1pci@0000:04:00.1066433000000PCI ExpressPCI-XPower ManagementMessage Signalled Interruptsbus masteringPCI capabilities listingPCI bridge82801JI (ICH10 Family) PCI Express Root Port 5Intel Corporation1c.4pci@0000:00:1c.4003233000000PCI ExpressMessage Signalled InterruptsPower Managementbus masteringPCI capabilities listingEthernet interface82574L Gigabit Network ConnectionIntel Corporation0pci@0000:03:00.0eth000[REMOVED]10000000010000000003233000000Power ManagementMessage Signalled InterruptsPCI ExpressMSI-Xbus masteringPCI capabilities listingPhysical interfacetwisted pair10Mbit/s10Mbit/s (full duplex)100Mbit/s100Mbit/s (full duplex)1Gbit/s (full duplex)Auto-negotiationPCI bridge82801JI (ICH10 Family) PCI Express Root Port 6Intel Corporation1c.5pci@0000:00:1c.5003233000000PCI ExpressMessage Signalled InterruptsPower Managementbus masteringPCI capabilities listingEthernet interface82574L Gigabit Network ConnectionIntel Corporation0pci@0000:02:00.0eth100[REMOVED]100000000010000000003233000000Power ManagementMessage Signalled InterruptsPCI ExpressMSI-Xbus masteringPCI capabilities listingPhysical interfacetwisted pair10Mbit/s10Mbit/s (full duplex)100Mbit/s100Mbit/s (full duplex)1Gbit/s (full duplex)Auto-negotiationUSB controller82801JI (ICH10 Family) USB UHCI Controller #1Intel Corporation1dpci@0000:00:1d.0003233000000Universal Host Controller Interface (USB1)bus masteringPCI capabilities listingUSB controller82801JI (ICH10 Family) USB UHCI Controller #2Intel Corporation1d.1pci@0000:00:1d.1003233000000Universal Host Controller Interface (USB1)bus masteringPCI capabilities listingUSB controller82801JI (ICH10 Family) USB UHCI Controller #3Intel Corporation1d.2pci@0000:00:1d.2003233000000Universal Host Controller Interface (USB1)bus masteringPCI capabilities listingUSB controller82801JI (ICH10 Family) USB UHCI Controller #6Intel Corporation1d.3pci@0000:00:1d.3003233000000Universal Host Controller Interface (USB1)bus masteringPCI capabilities listingUSB controller82801JI (ICH10 Family) USB2 EHCI Controller #1Intel Corporation1d.7pci@0000:00:1d.7003233000000Power ManagementDebug portEnhanced Host Controller Interface (USB2)bus masteringPCI capabilities listingPCI bridge82801 PCI BridgeIntel Corporation1epci@0000:00:1e.0903233000000bus masteringPCI capabilities listingVGA compatible controllerASPEED Graphics FamilyASPEED Technology, Inc.1pci@0000:01:01.0103233000000Power ManagementPCI capabilities listingextension ROMIDE interfaceIT8213 IDE ControllerIntegrated Technology Express, Inc.4pci@0000:01:04.0003233000000Power Managementbus masteringPCI capabilities listingISA bridge82801JIR (ICH10R) LPC Interface ControllerIntel Corporation1fpci@0000:00:1f.0003233000000bus masteringPCI capabilities listingSATA controller82801JI (ICH10 Family) SATA AHCI ControllerIntel Corporation1f.2pci@0000:00:1f.2003266000000Message Signalled InterruptsPower Managementbus masteringPCI capabilities listingSMBus82801JI (ICH10 Family) SMBus ControllerIntel Corporation1f.3pci@0000:00:1f.3006433000000Host bridgeXeon 5600 Series QuickPath Architecture Generic Non-core RegistersIntel Corporation101pci@0000:fe:00.0023233000000Host bridgeXeon 5600 Series QuickPath Architecture System Address DecoderIntel Corporation102pci@0000:fe:00.1023233000000Host bridgeXeon 5600 Series QPI Link 0Intel Corporation103pci@0000:fe:02.0023233000000Host bridgeXeon 5600 Series QPI Physical 0Intel Corporation104pci@0000:fe:02.1023233000000Host bridgeXeon 5600 Series Mirror Port Link 0Intel Corporation105pci@0000:fe:02.2023233000000Host bridgeXeon 5600 Series Mirror Port Link 1Intel Corporation106pci@0000:fe:02.3023233000000Host bridgeXeon 5600 Series QPI Link 1Intel Corporation107pci@0000:fe:02.4023233000000Host bridgeXeon 5600 Series QPI Physical 1Intel Corporation108pci@0000:fe:02.5023233000000Host bridgeXeon 5600 Series Integrated Memory Controller RegistersIntel Corporation109pci@0000:fe:03.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Target Address DecoderIntel Corporation10apci@0000:fe:03.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller RAS RegistersIntel Corporation10bpci@0000:fe:03.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Test RegistersIntel Corporation10cpci@0000:fe:03.4023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 ControlIntel Corporation10dpci@0000:fe:04.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 AddressIntel Corporation10epci@0000:fe:04.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 RankIntel Corporation10fpci@0000:fe:04.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 Thermal ControlIntel Corporation110pci@0000:fe:04.3023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 ControlIntel Corporation111pci@0000:fe:05.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 AddressIntel Corporation112pci@0000:fe:05.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 RankIntel Corporation113pci@0000:fe:05.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 Thermal ControlIntel Corporation114pci@0000:fe:05.3023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 ControlIntel Corporation115pci@0000:fe:06.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 AddressIntel Corporation116pci@0000:fe:06.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 RankIntel Corporation117pci@0000:fe:06.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 Thermal ControlIntel Corporation118pci@0000:fe:06.3023233000000Host bridgeXeon 5600 Series QuickPath Architecture Generic Non-core RegistersIntel Corporation119pci@0000:ff:00.0023233000000Host bridgeXeon 5600 Series QuickPath Architecture System Address DecoderIntel Corporation11apci@0000:ff:00.1023233000000Host bridgeXeon 5600 Series QPI Link 0Intel Corporation11bpci@0000:ff:02.0023233000000Host bridgeXeon 5600 Series QPI Physical 0Intel Corporation11cpci@0000:ff:02.1023233000000Host bridgeXeon 5600 Series Mirror Port Link 0Intel Corporation11dpci@0000:ff:02.2023233000000Host bridgeXeon 5600 Series Mirror Port Link 1Intel Corporation11epci@0000:ff:02.3023233000000Host bridgeXeon 5600 Series QPI Link 1Intel Corporation11fpci@0000:ff:02.4023233000000Host bridgeXeon 5600 Series QPI Physical 1Intel Corporation120pci@0000:ff:02.5023233000000Host bridgeXeon 5600 Series Integrated Memory Controller RegistersIntel Corporation121pci@0000:ff:03.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Target Address DecoderIntel Corporation122pci@0000:ff:03.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller RAS RegistersIntel Corporation123pci@0000:ff:03.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Test RegistersIntel Corporation124pci@0000:ff:03.4023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 ControlIntel Corporation125pci@0000:ff:04.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 AddressIntel Corporation126pci@0000:ff:04.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 RankIntel Corporation127pci@0000:ff:04.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 0 Thermal ControlIntel Corporation128pci@0000:ff:04.3023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 ControlIntel Corporation129pci@0000:ff:05.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 AddressIntel Corporation12apci@0000:ff:05.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 RankIntel Corporation12bpci@0000:ff:05.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 1 Thermal ControlIntel Corporation12cpci@0000:ff:05.3023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 ControlIntel Corporation12dpci@0000:ff:06.0023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 AddressIntel Corporation12epci@0000:ff:06.1023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 RankIntel Corporation12fpci@0000:ff:06.2023233000000Host bridgeXeon 5600 Series Integrated Memory Controller Channel 2 Thermal ControlIntel Corporation130pci@0000:ff:06.30232330000001usb@1:4scsi0Emulated deviceSCSI DiskCruzer FitSanDisk0.0.0scsi@0:0.0.0/dev/sda8:01.27[REMOVED]15631122432support is removable0/dev/sda8:015631122432Partitioned diskMS-DOS partition tableWindows FAT volumeSYSLINUX1/dev/sda1/boot8:1FAT32[REMOVED]1563032985615631106048Primary partitionBootable partition (active)Windows FATinitialized volume2usb@2:2.1scsi1Emulated deviceSCSI DiskMy Book 1140WD0.0.0scsi@1:0.0.0/dev/sdb8:161003[REMOVED]3000558944256Partitioned diskMS-DOS partition tableHPFS/NTFS partition1scsi@1:0.0.0,1375069736960Primary partitionSCSI EnclosureSES DeviceWD0.0.1scsi@1:0.0.11003[REMOVED]3scsi3Emulated deviceATA DiskKINGSTON SV300S30.0.0scsi@3:0.0.0/dev/sdc8:32BBF0[REMOVED]240057409536Partitioned diskMS-DOS partition tableLinux filesystem partition1scsi@3:0.0.0,1/dev/sdc1/mnt/cache8:33240057376768Primary partition5scsi4Emulated deviceATA DiskSamsung SSD 8500.0.0scsi@4:0.0.0/dev/sdd8:482B6Q[REMOVED]256060514304GUID Partition Table version 1.00Partitioned diskGUID partition tableWindows NTFS volumeWindows1scsi@4:0.0.0,1/dev/sdd18:493.1[REMOVED]470810112471858688Contains boot codeWindows NTFSinitialized volumeWindows FAT volumeMSDOS5.02scsi@4:0.0.0,2/dev/sdd28:50FAT32[REMOVED]98305024104857088Contains boot codeWindows FATinitialized volumereserved partitionWindows3scsi@4:0.0.0,3/dev/sdd38:51[REMOVED]16776704No filesystemWindows NTFS volumeWindows4scsi@4:0.0.0,4/dev/sdd48:523.1[REMOVED]255441833984255465954304Windows NTFSinitialized volume6scsi5Emulated deviceATA DiskSanDisk Ultra II0.0.0scsi@5:0.0.0/dev/sde8:6400RL[REMOVED]480103981056GUID Partition Table version 1.00Partitioned diskGUID partition tableWindows FAT volumeMSWIN4.11scsi@5:0.0.0,1/dev/sde18:65FAT32[REMOVED]535802880536870400Contains boot codeWindows FATinitialized volumeEFI partitionLinux2scsi@5:0.0.0,2/dev/sde28:661.0[REMOVED]511705088Extended Attributes4GB+ filesEXT2/EXT3initialized volumeLVM Physical VolumeLinux3scsi@5:0.0.0,3/dev/sde38:67[REMOVED]52636418048479053479424Multi-volumes7scsi6Emulated deviceATA DiskSanDisk Ultra II0.0.0scsi@6:0.0.0/dev/sdf8:8000RL[REMOVED]480103981056Partitioned diskMS-DOS partition tableWindows NTFS volume1scsi@6:0.0.0,1/dev/sdf18:813.1[REMOVED]480101006848480102055936Primary partitionWindows NTFSinitialized volumeEthernet interface1vnet0[REMOVED]10000000Physical interfaceEthernet interface2virbr0-nic[REMOVED]10000000Physical interfaceEthernet interface3gretap0Physical interfaceEthernet interface4docker0[REMOVED]Physical interfaceEthernet interface5bond0[REMOVED]100000000Physical interfaceEthernet interface6br0[REMOVED]Physical interfaceEthernet interface7virbr0[REMOVED]Physical interfaceEthernet interface8vnet1[REMOVED]10000000Physical interfaceEthernet interface9veth5321ae1[REMOVED]10000000000Physical interface
I have 2 running virtual machines.
1. Ubuntu Server 16.04 acting as a headless game server
2. Windows 10 Pro used for gaming and other daily activities
I too can start/stop the Win 10 vm for a period of time after a cold boot but if it is logged in for a certain period of time, when I go to shut it down the entire system will freeze. I can reboot the Ubuntu server at will. It too has a SSD being passed thru.
Win 10 VM
csmccarronwx0082c5e4f6-6991-cd5f-8207-49db04386cc9csmccarronwx00 i440fx-2.5 OVMF110100481101004812/machinehvm/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd/etc/libvirt/qemu/nvram/82c5e4f6-6991-cd5f-8207-49db04386cc9_VARS-pure-efi.fddestroyrestartrestart/usr/local/sbin/qemu
Ubuntu Server VM
Ubuntu Server232de5eb-2276-0762-2e29-29dc917ef34dUbuntu Server Q35 OVMF11010048110100484/machinehvm/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd/etc/libvirt/qemu/nvram/232de5eb-2276-0762-2e29-29dc917ef34d_VARS-pure-efi.fddestroyrestartrestart/usr/local/sbin/qemu
Well, now we finally know that it isn't the i7-5820K's or X99 chipset's or LGA 2011 socket's faults.
I have tried everything to keep it from happening but have had no success. The likely hood of an entire system lock up is based on how long the Win 10 VM is on. I personally have not timed it but usually i can shutdown/restart without problems for about an hour, maybe more.
My Ubuntu vm is not effected by this issue. I am passing thru 4 vcpus and a SSD that the vm boots from.
What can we do to help troubleshoot this issue? I find it strange the the problem happens at VM power off and not while the VM is in use. What happens at VM power off that can lock the entire system up and cause CPU stall errors.
The posts in syslog vary from time to time but they all end in cpu stalls.
Additional syslog image
Additional syslog image
Additional syslog image
Additional syslog image
I am not having any issues with my drives during normal operation on the server. I only see the ata errors when the system locks up.
If there is something I can do please let me know. I have been trying to figure this out for over a month now but have had no luck.
Remember, I think we've done enough testing to know that it isn't specifically the VM shutting down that causes this, but the binding or unbinding of PCI devices in sysfs, which is something a VM will do on shutdown if you're passing hardware into it. It *is* caused by the VM running for more than an hour, but it is *not* technically caused by the shutdown itself. I titled it as a shutdown issue because that's pretty much the only situation anybody's going to notice this problem, and we need to be Google-friendly.
Has any one found a way to shutdown/restart the vm without causing a system lockup or is this just the way it is until a fix is found?
I've got the same issue. Pretty much just as it has been described by everyone else. Same on shutdown or certain events. Same for delay. Similar setups and hardware/software. (X99, Arch, Qemu, libvirt, pcie passthrough, windows 10, etc...) I've attached my system info (Hardware, lscpu, Archlinux package versions, qemu/libvirt xml files).
Brand new pc build, super fresh and clean system and images. Run 2 different Windows 10 vms, and occasionally another Arch vm for some game server stuffs.
What is the proper way of going about troubleshooting such things? Is there a way to enable a kernel debug mode or anything? I develop software and hardware, and am a novice linux user, just haven't ever troubleshot a hard lock like this. Willing to help if anyone can give me some direction. :)
Unsure how to edit a post.
Also wanted to say, I can provide BIOS settings later, and any kernel logs if anyone wants. Wanted to note though that I am using UEFI with GPT style partitioning. I'm using bttrfs for the host fs. OVMF for guests (See package list in my system info for versioning). Guest main drive images are qcow2. Some SATA hard drives with NTFS partitions are passed through for each guest additional storage. Systemd Boot as the boot manager.
Can't think of much else, but hoping to get this fixed up.
Well, that's a bunch more stuff ruled out. My host is a BIOS with MBR partitioning, using ext4, and the images are all raw. For each guest, there's an image of the OS (so the C: drive on Windows and the / partition on Linux) on my SSD, and Windows also has a bigger image on my HDD (drive D:). I don't pass in any storage media; just the video card, its HDMI soundcard, and a USB card.
Jimi, does your HDMI sound lag? I am using a usb sound card and tries switching to the GTX970 sound and I got horrible lag, sounds like sound is in slow motion. Was completely unusable.
Chris
I know it didn't with the GTX 660. It worked perfectly fine. But, I went fully into Steam streaming everything before I got the 960, so the 960 could have that issue for all I know.
I have been able to stop this from happening by recompiling my kernel without SND support. If you can live without sound in your host (it is still there in your guest if you pass through the sound device of your card) then try removing SND support from your hosts kernel. You can also try blacklisting the snd module and snd-hda-intel instead of removing it from your kernel if they are modules. I have not had a crash from a shutdown in a couple of months after removing SND from my hosts kernel. In my mind that points more of a finger at idea that the root of the problem has to do with binding/unbinding of the device.
Chris, for your HDMI sound issue there are a couple of things that might help. I would have that issue immediately if I was using a certain virtual network card in the guest. Using virtio as your network driver helps quite a bit, however it would still mess up on me every now and again. In order to fix everything, I switched it over to MSI signalling from IRQ on the sound device in Windows 10. I also switched the graphics card driver over to MSI and have to switch them each time one of the nVidia drivers gets an update.
Hm. Sound was the issue in that other bug. Have you already confirmed that you don't have that other, similar bug? If you undo all the other fixes you've done, including enabling SND again, does the VM still crash if you have NO sound device assigned to it at all, whether it be a pass-thru device or a virtual one?
I'm not really sure what the other similar bug was, but what I was experiencing was a Win10 VM locking up the host machine upon shutdown of the VM after several minutes of gaming (or even several hours of youtube/netflix). It didn't happen all of the time, but most of the time after the VM had be up for a while.
I am positive that recompiling without SND support is when the host stopped crashing upon shutdown of the Windows 10 VM as I was only doing one change at a time. I had the issue for many months before removing CONFIG_SND. Since then, 2 months ago, I've upgraded qemu, libvirt, the kernel and win10 updates, including the nVidia drivers. I'm not really wanting to compile SND back in as my server is also doing a lot more than just hosting a Win10 VM and I don't want it to crash without anyone else trying the fix. If others try removing SND and continue to have the issue, I will recompile to help troubleshoot but I am very confident that is what stopped my system from locking up when shutting down a Windows 10 VM. If I were to take a guess, my guess is that just removing snd-hda-intel would do the trick.
My hardware is a X99 board, i7-5820K, and a nVidia 980 graphics card being passed through to the guest. The host video card is a cheap 1x radeon with HDMI sound.
I will try an blacklist the sound module in the unRaid kernel. Waiting on instructions on how to do it.
Chris
If your Windows VM does and always has a sound card being passed in (like the .1 address of your video card), then we can't know for sure that you don't have that other bug. In that other bug, you can fix the crash by not passing in any sound cards, real or virtual, to the VM. It's definitely not the same bug as this one.
Well for now my issue is resolved. This morning when I was shutting down my unRaid server to blacklist the intel sound module, snd-hda-intel, I first stopped my ubuntu vm and my two dockers then logged out of unraid. I then proceeded to shutdown my Windows 10 VM and like magic it shutdown nicely without locking up the entire system. Also, I found out from unRaid tech support that the unRaid kernel does not include any sound modules and it was not necessary to blacklist them.
So this is what I have changed since the last lockup last Thursday night.
1. Removed the NVIDIA Audio hardware from the VM Setup. I did this because the sound was lagging horribly and I could not figure out how to fix it. So I removed the sound hardware and I am now using a USB sound card that is plugged into the USB3 PCI-Express card that is being passed to the VM.
2. I enabled MSI Interrupts on the GPU using this URL as my reference.
http://lime-technology.com/wiki/index.php/UnRAID_6/VM_Guest_Support#Enable_MSI_for_Interrupts_to_Fix_HDMI_Audio_Support
I should also mention that while I have the system NIC, USB1, and USB2 virtual modules mapped, they are disabled in the VM. I did this to improve latency issues inside the VM. I am using a wireless NIC plugged into the USB3 PCI-Express card and I do not require USB1 or USB2. These changes where made on Thursday prior to the last lockup, so while I do believe they have helped overall latency they had no effect on the system locking up.
USB3 card is handling Logitech G910 keyboard, WOW MMO Legendary Gaming Mouse, ASUS XONARU3 Sound Card, ASUS USB-AC56 Wireless NIC, and a USB Mouse.
I still would like to add the NVIDIA Sound card back into the VM and when I do I will enable MSI Interrupts. My goal is not not have to use the USB Sound card.
See next post for current VM setup.
Current VM Config
csmccarronwx0082c5e4f6-6991-cd5f-8207-49db04386cc9csmccarronwx00 i440fx-2.5 OVMF104857601048576012/machinehvm/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd/etc/libvirt/qemu/nvram/82c5e4f6-6991-cd5f-8207-49db04386cc9_VARS-pure-efi.fddestroyrestartrestart/usr/local/sbin/qemu
SYSLINUX.CFG
default /syslinux/menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS
kernel /bzimage
append isolcpus=4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 pci-stub.ids=1b6f:7052,10de:13c2,10de:0fbb intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_acs_override=downstream initrd=/bzroot
label unRAID OS GUI Mode
menu default
kernel /bzimage
append isolcpus=4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 pci-stub-ids=1b6f:7052,10de:13c2,10de:0fbb intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
kernel /bzimage
append initrd=/bzroot unraidsafemode
label Memtest86+
kernel /memtest
pci-stub-ids=1b6f:7052,10de:13c2,10de:0fbb
1b6f:7052 = Etron Technology, Inc. EJ188/EJ198 USB 3.0 Host Controller
10de:13c2 = NVIDIA Corporation GM204 [GeForce GTX 970]
10de:0fbb = NVIDIA Corporation GM204 High Definition Audio Controller
So guys, new information.
I was having trouble getting the HTC Vive passed through in host mode. The thing shows up as 10+ devices! I've also some logitech webcams that don't seem to work via usb host passthrough. So I gave windows my entire usb controller (only 1 for all my ports on this mobo). Since then, I haven't noticed an issue. Furthermore, waaaay more stable overall. I used to get random blue screens.
I'm going to order a usb3 pcie card for my other windows host. For now, I'm using a remote desktop connection to it for IO.
Anyway, still tinkering. I'm curious if anyone having the issues would try with no usb 'host' passthrough?
I've been not using USB host passthrough this whole time, as my PCI USB3 card covers that need pretty well. Speaking of those cards, for those of you who also use one, does it work perfectly? If so, I'd like to know its model so I can go buy it, because while my card works, about 50% of the time I try to use it, I get some bad output when I run "dmesg | grep -i vfio" (the standard spam when a device doesn't get passed through properly that's full of messages related to power management) and the VM doesn't seem to have any access to it. When this happens, I have to restart the whole host to get another 50% chance at using the card.
FYI I had a similar issue years ago until I figured out that adding the vgarom file fixes it, eg.:
-device vfio-pci,host=04:00.0,bus=root.1,multifunction=on,x-vga=on,addr=0.0,romfile=Sapphire.R7260X.1024.131106.rom
For radeon, you can look in /sys. eg. we see /sys/devices/pci0000:00/0000:00:0b.0/0000:04:00.0/rom, and first we `echo 1 > rom` to prevent "invalid argument" error, and then `cat rom > ~/yourfile.rom` and you have it.
For nouveau, you have to bind nouveau driver (rather than vfio-pci) and you can find it somewhere like /sys/kernel/debug/dri/0/vbios.rom
Can someone else please confirm that? I can't test it because nouveau doesn't support the GTX 960 yet. If it turns out solid, then I could just ask EVGA support for the rom file.
I just added the romfile argument to mine, will report back later tonight. (Don't want to reboot now, as my machine will hang and I'm at work)
I got impatient and got the rom file from EVGA and loaded it in, but for me and my GTX 960, I get no graphical output when it's loaded. I don't know anything beyond that. I don't get any error messages in dmesg or anything--just no video output whatsoever. It was also strangely booting into the Tianocore UEFI command line instead of Windows, so there could be something else going on here for me that stayed broken after I removed the romfile option.
I managed to fix that issue and properly load the VM with the rom file (what had gone wrong was it inexplicably acted like it had no hard drives, until I restored the libvirt XML file from a backup). I got a good test out of it: played video games in Windows for 2 hours, with the rom file loaded. It still froze on shutdown. So that's confirmedly not a fix.
My system has been behaving well the last couple of weeks. I can reboot at will with no lockups. I am still not passing the NVIDIA sound card to the VM and have GPU configure to use MSI interrupts. I am not passing the ROM for my GTX 970 gpu.
I know this is not related but I was able to lockup the entire system by installing BOINC software and configured it to use 100% of cpu's and cpu time. Backed those 2 settings down to 90% and no more lockups.
What are MSI interrupts and how did you configure your card to use them?
Apparently Passthrough devices work better when using a MSI Interrupt instead of a traditional interrupt.
See post 32 https://bugs.launchpad.net/qemu/+bug/1580459/comments/32 item 2.
2. I enabled MSI Interrupts on the GPU using this URL as my reference.
http://lime-technology.com/wiki/index.php/UnRAID_6/VM_Guest_Support#Enable_MSI_for_Interrupts_to_Fix_HDMI_Audio_Support
Chris
I enabled MSI interrupts, and now for 2 nights in a row I gamed 2 hours straight and shut down the Windows VM without a freeze. Never in my 7 months of living with this bug have I gotten no freeze twice in a row. I think the MSI interrupts have fixed it for me, and no, I did not remove my HDMI sound card from the VM, so that wasn't part of the issue and should be safe to leave in for those who needed this fix. That's 2 people who this fix has worked for now. Hopefully it'll work for the rest of you, too. I'll post back if I ever get this freeze again after confirmed it hasn't suddenly switched my hardware off MSI interrupts or anything.
Note: I didn't just make my video card use MSI interrupts. Most of the VM's hardware was already set to use them by default--namely the VirtIO stuff--and I set EVERYTHING else to also use it, which is the video card, its HDMI, the USB3 card, and the virtual USB2 controller that I don't need but libvirt refuses to remove. I figured that'd work out because the USB3 card is also PCIe, which works better with MSI, and the USB2 controller doesn't matter. So, if this doesn't fix it for you, try making every last MSI-capable device use MSI interrupts.
Thats good to know, I want to reenable my Nvidia sound card as well.
Note: When you update the video card driver, it will disable the MSI interrupt so you will have to reenable it.
I was also experiencing the host hard locking when shutting down a Windows 10 guest with a Nvidia GPU passed-through, but the issue appears to be completely solved after switching the card to MSI mode in the Windows guest.
However, I would be interested in understanding *why* using the card in line-interrupt mode in the guest causes the host to lockup when the guest relinquishes control of the device. Is it a bug in qemu or vfio, or even the Linux kernel?
I don't know if its relevant, but I've noticed when the card is not being used by the guest it is listed as MSI: Enable- by lspci, suggesting that vfio is keeping the card in line-interrupt mode when not in use.
Oh, that is interesting. Using lscpi -v on my computer reveals that Linux tends to default to enabling MSI on my PCIe devices that support it (since the common opinion is that it's better for PCIe), including all my graphics cards, so the fact that vfio-pci and Windows 10 both default to disabling it is pretty odd indeed.
(Forgot to clarify: yes, vfio-pci devices disable MSI by default for me just like for Clif Houck, but all other PCIe devices have it enabled.)
Hi guys, not sure if I'm on the right track here but I think I'm experiencing the same issue. My install might be a bit of a mess combining bits from the VFIO Tips site and Ubuntu guides on GPU passthrough, but I *did* have it all working for a few hours at a stretch before I got this lock up.
The trouble with this is that after the host lockup, the Windows VM seems to corrupt the EFI config or something like that as I can never get it to boot again properly, even though the main partition seems fine when tested in a bootable WinPE distro.
I'd be happy to supply versions and configs to help if it's related however.
Enabling MSI interrupts works for me. One note is that Windows updates will sometimes revert the changes so if this starts breaking after an update you may need to re-apply the registry changes.
Updating NVIDIA drivers in the guest also seems to disable MSI for some reason. Oddly enough I did not run into the host hard locking though.
I haven't remembered to reset those interrupts in a year, but I also haven't remembered to update my drivers in about as long, so I could be still on the right setting. I've also been on AMD for that year, and I don't remember whether this bug applies to modern AMD cards.
I've been experiencing something that sounds very similar to what has been described in this issue post and want to see if you guys think it's the same issue. For me from a cold boot everything is fine for a while and I can restart my vm and such just fine. but after a long time or stressful stuff mining/gaming if I shutdown my vm the host displays will all go to sleep and the system locks up which I had been assuming is a display driver crash. I can also sometimes trigger the exact same lockup by calling lspci. once such a lockup has happened I have to hard reset. where this gets even weirder is that after this happens I will get the same lockup during the startup process around when xorg loads. when this happens I either have to leave my computer alone for around 30 minutes to an hour, or I can get it to boot by disabling iommu with iommu=off as a kernel param, and then if I wait around 30 minutes to an hour I can restart and it will boot fine again with iommu=pt (I get a kernel panic if i don't use iommu=pt)
Hardware
Ryzen R5 1600
asrock ab350m pro4
32gb ram
Host gpu RX580
Guest gpu GTX1070
Looking through old bug tickets... can you still reproduce this issue with the latest available versions? Or could we close this ticket nowadays?
I am no longer having any issues at all. I am using the NVidia Sound Card as well.
My hardware and the way I run my VM are both now very different from back then, and I haven't had the issue described here for years. So either it was fixed or I'm no longer an accurate test subject.
Ok, thanks for answering! So I'm closing this issue now. In case anybody still has similar issues, please open a new bug ticket instead.