PCIe cards passthrough to TCG guest works on 2GB of guest memory but fails on 4GB (vfio_dma_map invalid arg) During one meeting coworker asked "did someone tried to passthrough PCIe card to other arch guest?" and I decided to check it. Plugged SATA and USB3 controllers into spare slots on mainboard and started playing. On 1GB VM instance it worked (both cold- and hot-plugged). On 4GB one it did not: Błąd podczas uruchamiania domeny: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22 2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument) Traceback (most recent call last): File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper callback(asyncjob, *args, **kwargs) File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb callback(*args, **kwargs) File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 66, in newfn ret = fn(self, *args, **kwargs) File "/usr/share/virt-manager/virtManager/object/domain.py", line 1279, in startup self._backend.create() File "/usr/lib64/python3.8/site-packages/libvirt.py", line 1234, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22 2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument) I played with memory and 3054 MB is maximum value possible to boot VM with coldplugged host PCIe cards. Qemu command line for booting VM was generated by libvirt: /usr/bin/qemu-system-aarch64 -name guest=fedora-aarch64-pcie,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-fedora-aarch64-pcie/master-key.aes -blockdev {"driver":"file","filename":"/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"} -blockdev {"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/fedora-aarch64-pcie_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"} -machine virt-4.2,accel=tcg,usb=off,dump-guest-core=off,gic-version=2,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format -cpu cortex-a57 -m 2048 -overcommit mem-lock=off -smp 3,sockets=3,cores=1,threads=1 -uuid 139dc97a-1511-480d-b215-c58a5c80e646 -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=32,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 -device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 -device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 -device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 -device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4 -device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5 -device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6 -device pcie-root-port,port=0xf,chassis=8,id=pci.8,bus=pcie.0,addr=0x1.0x7 -device virtio-serial-pci,id=virtio-serial0,bus=pci.2,addr=0x0 -netdev tap,fd=29,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:87:3e:d3,bus=pci.1,addr=0x0 -chardev pty,id=charserial0 -serial chardev:charserial0 -chardev socket,id=charchannel0,fd=33,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on -device virtio-gpu-pci,id=video0,max_outputs=1,bus=pci.7,addr=0x0 -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0 -device vfio-pci,host=0000:28:00.0,id=hostdev1,bus=pci.4,addr=0x0 -device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on lspci -vvv output of used cards (host side): 28:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03) (prog-if 30 [XHCI]) DeviceName: RTL8111EPV Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- hrw: under /sys/kernel/iommu_groups/ there's a reserved_regions file for every group. cat the ones associated with the groups for these devices 14:59 < hrw> 14:58 (0s) hrw@puchatek:28$ cat reserved_regions 14:59 < hrw> 0x00000000fee00000 0x00000000feefffff msi 14:59 < hrw> 0x000000fd00000000 0x000000ffffffffff reserved 14:59 < hrw> 14:59 (2s) hrw@puchatek:27$ cat reserved_regions 14:59 < hrw> 0x00000000fee00000 0x00000000feefffff msi 14:59 < hrw> 0x000000fd00000000 0x000000ffffffffff reserved 15:00 < aw> of course, you're on an x86 host, arm has no concept of not mapping memory at 0xfee00000 Summary: ARM64 TCG VM on x86 host has GPA overlapping host reserved IOVA regions, vfio cannot map these. The VM needs holes in the GPA to account for this. The 2nd and 3rd args of the vfio_dma_map error report are the starting IOVA address and length respectively. My (limited) understanding of PCI was that the board and the PCI controller emulation define what the windows are where PCI BARs can live, and that PCI cards go there. (This certainly works for emulated PCI cards.) It's not clear to me why the host system imposes restrictions on the memory layout of the VM... This is not related to the BARs, the mapping of the BARs into the guest is purely virtual and controlled by the guest. The issue is that the device needs to be able to DMA into guest RAM, and to do that transparently (ie. the guest doesn't know it's being virtualized), we need to map GPAs into the host IOMMU such that the guest interacts with the device in terms of GPAs, the host IOMMU translates that to HPAs. Thus the IOMMU needs to support GPA range of the guest as IOVA. However, there are ranges of IOVA space that the host IOMMU cannot map, for example the MSI range here is handled by the interrupt remmapper, not the DMA translation portion of the IOMMU (on physical ARM systems these are one-in-the-same, on x86 they are different components, using different mapping interfaces of the IOMMU). Therefore if the guest programmed the device to perform a DMA to 0xfee00000, the host IOMMU would see that as an MSI, not a DMA. When we do an x86 VM on and x86 host, both the host and the guest have complimentary reserved regions, which avoids this issue. Also, to expand on what I mentioned on IRC, every x86 host is going to have some reserved range below 4G for this purpose, but if the aarch64 VM has no requirements for memory below 4G, the starting GPA for the VM could be at or above 4G and avoid this issue. That's an awkward flaw in the IOMMU design :-( I wrote blog post about it: https://marcin.juszkiewicz.com.pl/2020/03/25/sharing-pcie-cards-across-architectures/ I wanted to try the same on machine without MSI. But my desktop refuses to boot into sane environment with pci=nomsi kernel option. Do we need something like a table of excluded IOVA regions in ACPI or somewhere; in a similar way we have a region of exluded physical ram areas? Or is the range of excluded IOVA's constant on any one architecture so it doesn't normally need to worry about it? Yes, to support this the vfio driver would need to be able to exclude ranges of guest GPA space. Recent kernels already expose an IOVA list via the vfio API. Clearly, excluding GPA ranges has implications for hotplug. On x86 the ranges are pretty consistent, but IIRC differ between Intel and AMD. The vfio IOVA list was originally developed for an ARM use case though, and I imagine there's little or no consistency there. Re: pci=nomsi, the reserved range exists regardless of whether MSI is actually used. I am experiencing the same behaviour for x86_64 guest on x86_64 host to which I'm attempting to efi boot (not hotplug) with a pcie gpu passthrough This discussion (https://www.spinics.net/lists/iommu/msg40613.html) suggests a change in drivers/iommu/intel-iommu.c but it appears that in the kernel I tried, the change it is already implemented (linux-image-5.4.0-39-generic) hardware is a hp microserver gen8 with conrep physical slot excluded in bios (https://www.jimmdenton.com/proliant-intel-dpdk/) and the kernel is rebuild with rmrr patch (https://forum.proxmox.com/threads/compile-proxmox-ve-with-patched-intel-iommu-driver-to-remove-rmrr-check.36374/) also an user complains that on the same hardware it used to work with kernel 5.3 + rmrr patch (https://forum.level1techs.com/t/looking-for-vfio-wizards-to-troubleshoot-error-vfio-dma-map-22/153539) but it stopped working on the 5.4 kernel. is this the same issue I'm observing? my qemu complains with the similar message: -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.4,addr=0x0: vfio_dma_map(0x556eb57939f0, 0xc0000, 0x3ff40000, 0x7f6fc7ec0000) = -22 (Invalid argument) /sys/kernel/iommu_groups/1/reserved_regions shows: 0x00000000000e8000 0x00000000000e8fff direct 0x00000000000f4000 0x00000000000f4fff direct 0x00000000d5f7e000 0x00000000d5f94fff direct 0x00000000fee00000 0x00000000feefffff msi except that in my case the vm does not boot at all no matter how less memory I allocate to it. You'd need to create a 160KB VM in order to stay below the required direct map memory regions imposed by the system firmware (e8000 - c0000). Non-upstream bypasses of those restrictions are clearly not supported. You can find more details regarding the RMRR restriction here: https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf QEMU currently has no support for creating a VM memory map based on the restrictions of the host platform IOMMU requirements. Alex, thanks for the quick answer, but sadly I still do not fully understand the implications, even if I read the pdf paper on RH website you mention, as well as the vendor advisory at https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04781229 When you say "qemu has no support", do you actually mean "qemu people are unable to help you if you break things by bypassing the in-place restrictions", or "qemu is designed to not work when restrictions are bypassed"? Do I understand correctly that the BIOS can modify portions of the system usable RAM, so the vendor specific software tools can read those addresses, and if yes, does this mean is there a risk for data corruption if the RMRR restrictions are bypassed? I have eventually managed to passthrough an nvidia card in the microserver gen8 to a windows vm using patched kernel 5.3, along with the vendor instructions to exclude the pcie slot aka the conrep solution but for it to work it still needed the "rmrr patch" aka removing the "return -EPERM" line below the "Device is ineligible [...]" in drivers/iommu/intel-iommu.c However applying the same modification to kernel 5.4 leads to the "VFIO_MAP_DMA: -22" error. Is there other place in the kernel 5.4 source that must be modified to bring back the v5.3 kernel behaviour? (ie. I have a stable home windows vm with the gpu passthrough despite all) > When you say "qemu has no support", do you actually mean "qemu people > are unable to help you if you break things by bypassing the in-place > restrictions", or "qemu is designed to not work when restrictions are > bypassed"? The former. There are two aspects to this. The first is that the device has address space restrictions which therefore impose address space restrictions on the VM. That makes things like hotplug difficult or impossible to support. That much is something that could be considered a feature which QEMU has not yet implemented. The more significant aspect when RMRRs are involved in this restriction is that an RMRR is essentially the platform firmware dictating that the host OS must maintain an identity map between the device and a range of physical address space. We don't know the purpose of that mapping, but we can assume that it allows the device to provide ongoing data for platform firmware to consume. This data might included health or sensor information that's vital to the operation of the system. It's therefore not simply a matter that QEMU needs to avoid RMRR ranges, we need to maintain the required identity maps while also manipulating the VM address space, but the former requirement implies that a user owns a device that has DMA access to a range of host memory that's been previously defined as vital to the operation of the platform and therefore likely exploitable by the user. The configuration you've achieved appears to have disabled the host kernel restrictions preventing RMRR encumbered devices from participating in the IOMMU API, but left in place the VM address space implications of those RMRRs. This means that once the device is opened by the user, that firmware mandated identity mapping is removed and whatever health or sensor data was being reported by the device to that range is no longer available to the host firmware, which might adversely affect the behavior of the system. Upstream put this restriction in place as the safe thing to do to honor the firmware mapping requirement and you've circumvented it, therefore you are your own support. > Do I understand correctly that the BIOS can modify portions of the > system usable RAM, so the vendor specific software tools can read > those addresses, and if yes, does this mean is there a risk for > data corruption if the RMRR restrictions are bypassed? RMRRs used for devices other than IGD or USB are often associated with reserved memory regions to prevent the host OS from making use of those ranges. It is possible that privileged utilities might interact with these ranges, but AIUI the main use case is for the device to interact with the range, which firmware then consumes. If you were to ignore the RMRR mapping altogether, there is a risk that the device will continue to write whatever health or sensor data it's programmed to report to that IOVA mapping, which could be a guest mapping and cause data corruption. > Is there other place in the kernel 5.4 source that must be modified > to bring back the v5.3 kernel behaviour? (ie. I have a stable home > windows vm with the gpu passthrough despite all) I think the scenarios is that previously the RMRR patch worked because the vfio IOMMU backend was not imposing the IOMMU reserved region mapping restrictions, meaning that it was sufficient to simply allow the device to participate in the IOMMU API and the remaining restrictions were ignored. Now the vfio IOMMU backend recognizes the address space mapping restrictions and disallows creating those mappings that I describe above as a potential source of data corruption. Sorry, you are your own support for this. The system is not fit for this use case due to the BIOS imposed restrictions. @alex-l-williamson: is there any safe(ish) way to ignore RMRR coming from BIOS? I don't know how IOMMU actually works in the kernel but theoretically if kernel had a flag forcing it to ignore certain RMRRs? If I understand this correctly ignoring an RMRR entry may cause two things: 1) DMA failure if remapping is attempted 2) If something (e.g. KVM) touches that region because we ignored RMRR the device memory may get corrupted Linux already has mechanisms to ignore stubborn BIOSes (e.g. disabled x2APIC with no option to enable it in the BIOS). The only thing I'm worried about is the thing you said: > The more significant aspect when RMRRs are involved in this restriction is that an RMRR is > essentially the platform firmware dictating that the host OS must maintain an identity map > between the device and a range of physical address space. We don't know the purpose of that > mapping, but we can assume that it allows the device to provide ongoing data for platform > firmware to consume. Does this mean that if a kernel is "blind" to a given RMRR region something else may break because these regions need to be treated in some special manner outside of not touching them for IOMMU? The QEMU project is currently moving its bug tracking to another system. For this we need to know which bugs are still valid and which could be closed already. Thus we are setting the bug state to "Incomplete" now. If the bug has already been fixed in the latest upstream version of QEMU, then please close this ticket as "Fix released". If it is not fixed yet and you think that this bug report here is still valid, then you have two options: 1) If you already have an account on gitlab.com, please open a new ticket for this problem in our new tracker here: https://gitlab.com/qemu-project/qemu/-/issues and then close this ticket here on Launchpad (or let it expire auto- matically after 60 days). Please mention the URL of this bug ticket on Launchpad in the new ticket on GitLab. 2) If you don't have an account on gitlab.com and don't intend to get one, but still would like to keep this ticket opened, then please switch the state back to "New" or "Confirmed" within the next 60 days (other- wise it will get closed as "Expired"). We will then eventually migrate the ticket automatically to the new system (but you won't be the reporter of the bug in the new system and thus you won't get notified on changes anymore). Thank you and sorry for the inconvenience. [Expired for QEMU because there has been no activity for 60 days.]