1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
|
PCIe cards passthrough to TCG guest works on 2GB of guest memory but fails on 4GB (vfio_dma_map invalid arg)
During one meeting coworker asked "did someone tried to passthrough PCIe card to other arch guest?" and I decided to check it.
Plugged SATA and USB3 controllers into spare slots on mainboard and started playing. On 1GB VM instance it worked (both cold- and hot-plugged). On 4GB one it did not:
Błąd podczas uruchamiania domeny: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)
Traceback (most recent call last):
File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
callback(asyncjob, *args, **kwargs)
File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
callback(*args, **kwargs)
File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 66, in newfn
ret = fn(self, *args, **kwargs)
File "/usr/share/virt-manager/virtManager/object/domain.py", line 1279, in startup
self._backend.create()
File "/usr/lib64/python3.8/site-packages/libvirt.py", line 1234, in create
if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)
I played with memory and 3054 MB is maximum value possible to boot VM with coldplugged host PCIe cards.
Qemu command line for booting VM was generated by libvirt:
/usr/bin/qemu-system-aarch64
-name guest=fedora-aarch64-pcie,debug-threads=on
-S
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-fedora-aarch64-pcie/master-key.aes
-blockdev {"driver":"file","filename":"/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}
-blockdev {"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/fedora-aarch64-pcie_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}
-machine virt-4.2,accel=tcg,usb=off,dump-guest-core=off,gic-version=2,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format
-cpu cortex-a57
-m 2048
-overcommit mem-lock=off
-smp 3,sockets=3,cores=1,threads=1
-uuid 139dc97a-1511-480d-b215-c58a5c80e646
-no-user-config
-nodefaults
-chardev socket,id=charmonitor,fd=32,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control
-rtc base=utc
-no-shutdown
-boot strict=on
-device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1
-device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1
-device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2
-device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3
-device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4
-device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5
-device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6
-device pcie-root-port,port=0xf,chassis=8,id=pci.8,bus=pcie.0,addr=0x1.0x7
-device virtio-serial-pci,id=virtio-serial0,bus=pci.2,addr=0x0
-netdev tap,fd=29,id=hostnet0
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:87:3e:d3,bus=pci.1,addr=0x0
-chardev pty,id=charserial0
-serial chardev:charserial0
-chardev socket,id=charchannel0,fd=33,server,nowait
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
-spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on
-device virtio-gpu-pci,id=video0,max_outputs=1,bus=pci.7,addr=0x0
-device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0
-device vfio-pci,host=0000:28:00.0,id=hostdev1,bus=pci.4,addr=0x0
-device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0
-object rng-random,id=objrng0,filename=/dev/urandom
-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
-msg timestamp=on
lspci -vvv output of used cards (host side):
28:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03) (prog-if 30 [XHCI])
DeviceName: RTL8111EPV
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 42
Region 0: Memory at f7700000 (64-bit, non-prefetchable) [size=8K]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
Vector table: BAR=0 offset=00001000
PBA: BAR=0 offset=00001080
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP-, LTR+
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Latency Tolerance Reporting
Max snoop latency: 1048576ns
Max no snoop latency: 1048576ns
Kernel driver in use: xhci_hcd
29:00.0 Mass storage controller: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller (rev 01)
Subsystem: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 66
Region 0: Memory at f7684000 (64-bit, non-prefetchable) [size=128]
Region 2: Memory at f7680000 (64-bit, non-prefetchable) [size=16K]
Region 4: I/O ports at c000 [size=128]
Expansion ROM at f7600000 [disabled] [size=512K]
Capabilities: [54] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [5c] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v1) Legacy Endpoint, MSI 00
DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Exit Latency L0s unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Kernel driver in use: sata_sil24
Kernel modules: sata_sil24
attaching as file as launched wraps in ugly way
Hotplug to VM with 3055MB of memory ends same way:
internal error: błąd podczas wykonać polecenia QEMU „device_add”: vfio 0000:28:00.0: failed to setup container for group 27: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x55c1aca5c5c0, 0x40000000, 0xbef00000, 0x7f3549000000) = -22 (Invalid argument)
Traceback (most recent call last):
File "/usr/share/virt-manager/virtManager/addhardware.py", line 1327, in _add_device
self.vm.attach_device(dev)
File "/usr/share/virt-manager/virtManager/object/domain.py", line 920, in attach_device
self._backend.attachDevice(devxml)
File "/usr/lib64/python3.8/site-packages/libvirt.py", line 606, in attachDevice
if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
libvirt.libvirtError: internal error: błąd podczas wykonać polecenia QEMU „device_add”: vfio 0000:28:00.0: failed to setup container for group 27: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x55c1aca5c5c0, 0x40000000, 0xbef00000, 0x7f3549000000) = -22 (Invalid argument)
14:57 < aw> hrw: under /sys/kernel/iommu_groups/ there's a reserved_regions file for every group. cat the ones associated with the groups for these devices
14:59 < hrw> 14:58 (0s) hrw@puchatek:28$ cat reserved_regions
14:59 < hrw> 0x00000000fee00000 0x00000000feefffff msi
14:59 < hrw> 0x000000fd00000000 0x000000ffffffffff reserved
14:59 < hrw> 14:59 (2s) hrw@puchatek:27$ cat reserved_regions
14:59 < hrw> 0x00000000fee00000 0x00000000feefffff msi
14:59 < hrw> 0x000000fd00000000 0x000000ffffffffff reserved
15:00 < aw> of course, you're on an x86 host, arm has no concept of not mapping memory at 0xfee00000
Summary: ARM64 TCG VM on x86 host has GPA overlapping host reserved IOVA regions, vfio cannot map these. The VM needs holes in the GPA to account for this. The 2nd and 3rd args of the vfio_dma_map error report are the starting IOVA address and length respectively.
My (limited) understanding of PCI was that the board and the PCI controller emulation define what the windows are where PCI BARs can live, and that PCI cards go there. (This certainly works for emulated PCI cards.) It's not clear to me why the host system imposes restrictions on the memory layout of the VM...
This is not related to the BARs, the mapping of the BARs into the guest is purely virtual and controlled by the guest. The issue is that the device needs to be able to DMA into guest RAM, and to do that transparently (ie. the guest doesn't know it's being virtualized), we need to map GPAs into the host IOMMU such that the guest interacts with the device in terms of GPAs, the host IOMMU translates that to HPAs. Thus the IOMMU needs to support GPA range of the guest as IOVA. However, there are ranges of IOVA space that the host IOMMU cannot map, for example the MSI range here is handled by the interrupt remmapper, not the DMA translation portion of the IOMMU (on physical ARM systems these are one-in-the-same, on x86 they are different components, using different mapping interfaces of the IOMMU). Therefore if the guest programmed the device to perform a DMA to 0xfee00000, the host IOMMU would see that as an MSI, not a DMA. When we do an x86 VM on and x86 host, both the host and the guest have complimentary reserved regions, which avoids this issue.
Also, to expand on what I mentioned on IRC, every x86 host is going to have some reserved range below 4G for this purpose, but if the aarch64 VM has no requirements for memory below 4G, the starting GPA for the VM could be at or above 4G and avoid this issue.
That's an awkward flaw in the IOMMU design :-(
I wrote blog post about it: https://marcin.juszkiewicz.com.pl/2020/03/25/sharing-pcie-cards-across-architectures/
I wanted to try the same on machine without MSI. But my desktop refuses to boot into sane environment with pci=nomsi kernel option.
Do we need something like a table of excluded IOVA regions in ACPI or somewhere; in a similar way we have a region of exluded physical ram areas?
Or is the range of excluded IOVA's constant on any one architecture so it doesn't normally need to worry about it?
Yes, to support this the vfio driver would need to be able to exclude ranges of guest GPA space. Recent kernels already expose an IOVA list via the vfio API. Clearly, excluding GPA ranges has implications for hotplug. On x86 the ranges are pretty consistent, but IIRC differ between Intel and AMD. The vfio IOVA list was originally developed for an ARM use case though, and I imagine there's little or no consistency there.
Re: pci=nomsi, the reserved range exists regardless of whether MSI is actually used.
I am experiencing the same behaviour for x86_64 guest on x86_64 host to which I'm attempting to efi boot (not hotplug) with a pcie gpu passthrough
This discussion (https://www.spinics.net/lists/iommu/msg40613.html) suggests a change in drivers/iommu/intel-iommu.c but it appears that in the kernel I tried, the change it is already implemented (linux-image-5.4.0-39-generic)
hardware is a hp microserver gen8 with conrep physical slot excluded in bios (https://www.jimmdenton.com/proliant-intel-dpdk/) and the kernel is rebuild with rmrr patch (https://forum.proxmox.com/threads/compile-proxmox-ve-with-patched-intel-iommu-driver-to-remove-rmrr-check.36374/)
also an user complains that on the same hardware it used to work with kernel 5.3 + rmrr patch (https://forum.level1techs.com/t/looking-for-vfio-wizards-to-troubleshoot-error-vfio-dma-map-22/153539) but it stopped working on the 5.4 kernel.
is this the same issue I'm observing? my qemu complains with the similar message:
-device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.4,addr=0x0: vfio_dma_map(0x556eb57939f0, 0xc0000, 0x3ff40000, 0x7f6fc7ec0000) = -22 (Invalid argument)
/sys/kernel/iommu_groups/1/reserved_regions shows:
0x00000000000e8000 0x00000000000e8fff direct
0x00000000000f4000 0x00000000000f4fff direct
0x00000000d5f7e000 0x00000000d5f94fff direct
0x00000000fee00000 0x00000000feefffff msi
except that in my case the vm does not boot at all no matter how less memory I allocate to it.
You'd need to create a 160KB VM in order to stay below the required direct map memory regions imposed by the system firmware (e8000 - c0000). Non-upstream bypasses of those restrictions are clearly not supported. You can find more details regarding the RMRR restriction here:
https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf
QEMU currently has no support for creating a VM memory map based on the restrictions of the host platform IOMMU requirements.
Alex, thanks for the quick answer, but sadly I still do not fully understand the implications, even if I read the pdf paper on RH website you mention, as well as the vendor advisory at https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04781229
When you say "qemu has no support", do you actually mean "qemu people are unable to help you if you break things by bypassing the in-place restrictions", or "qemu is designed to not work when restrictions are bypassed"?
Do I understand correctly that the BIOS can modify portions of the system usable RAM, so the vendor specific software tools can read those addresses, and if yes, does this mean is there a risk for data corruption if the RMRR restrictions are bypassed?
I have eventually managed to passthrough an nvidia card in the microserver gen8 to a windows vm using patched kernel 5.3, along with the vendor instructions to exclude the pcie slot aka the conrep solution but for it to work it still needed the "rmrr patch" aka removing the "return -EPERM" line below the "Device is ineligible [...]" in drivers/iommu/intel-iommu.c
However applying the same modification to kernel 5.4 leads to the "VFIO_MAP_DMA: -22" error.
Is there other place in the kernel 5.4 source that must be modified to bring back the v5.3 kernel behaviour? (ie. I have a stable home windows vm with the gpu passthrough despite all)
> When you say "qemu has no support", do you actually mean "qemu people
> are unable to help you if you break things by bypassing the in-place
> restrictions", or "qemu is designed to not work when restrictions are
> bypassed"?
The former. There are two aspects to this. The first is that the device has address space restrictions which therefore impose address space restrictions on the VM. That makes things like hotplug difficult or impossible to support. That much is something that could be considered a feature which QEMU has not yet implemented.
The more significant aspect when RMRRs are involved in this restriction is that an RMRR is essentially the platform firmware dictating that the host OS must maintain an identity map between the device and a range of physical address space. We don't know the purpose of that mapping, but we can assume that it allows the device to provide ongoing data for platform firmware to consume. This data might included health or sensor information that's vital to the operation of the system. It's therefore not simply a matter that QEMU needs to avoid RMRR ranges, we need to maintain the required identity maps while also manipulating the VM address space, but the former requirement implies that a user owns a device that has DMA access to a range of host memory that's been previously defined as vital to the operation of the platform and therefore likely exploitable by the user.
The configuration you've achieved appears to have disabled the host kernel restrictions preventing RMRR encumbered devices from participating in the IOMMU API, but left in place the VM address space implications of those RMRRs. This means that once the device is opened by the user, that firmware mandated identity mapping is removed and whatever health or sensor data was being reported by the device to that range is no longer available to the host firmware, which might adversely affect the behavior of the system. Upstream put this restriction in place as the safe thing to do to honor the firmware mapping requirement and you've circumvented it, therefore you are your own support.
> Do I understand correctly that the BIOS can modify portions of the
> system usable RAM, so the vendor specific software tools can read
> those addresses, and if yes, does this mean is there a risk for
> data corruption if the RMRR restrictions are bypassed?
RMRRs used for devices other than IGD or USB are often associated with reserved memory regions to prevent the host OS from making use of those ranges. It is possible that privileged utilities might interact with these ranges, but AIUI the main use case is for the device to interact with the range, which firmware then consumes. If you were to ignore the RMRR mapping altogether, there is a risk that the device will continue to write whatever health or sensor data it's programmed to report to that IOVA mapping, which could be a guest mapping and cause data corruption.
> Is there other place in the kernel 5.4 source that must be modified
> to bring back the v5.3 kernel behaviour? (ie. I have a stable home
> windows vm with the gpu passthrough despite all)
I think the scenarios is that previously the RMRR patch worked because the vfio IOMMU backend was not imposing the IOMMU reserved region mapping restrictions, meaning that it was sufficient to simply allow the device to participate in the IOMMU API and the remaining restrictions were ignored. Now the vfio IOMMU backend recognizes the address space mapping restrictions and disallows creating those mappings that I describe above as a potential source of data corruption. Sorry, you are your own support for this. The system is not fit for this use case due to the BIOS imposed restrictions.
@alex-l-williamson: is there any safe(ish) way to ignore RMRR coming from BIOS?
I don't know how IOMMU actually works in the kernel but theoretically if kernel had a flag forcing it to ignore certain RMRRs? If I understand this correctly ignoring an RMRR entry may cause two things:
1) DMA failure if remapping is attempted
2) If something (e.g. KVM) touches that region because we ignored RMRR the device memory may get corrupted
Linux already has mechanisms to ignore stubborn BIOSes (e.g. disabled x2APIC with no option to enable it in the BIOS).
The only thing I'm worried about is the thing you said:
> The more significant aspect when RMRRs are involved in this restriction is that an RMRR is
> essentially the platform firmware dictating that the host OS must maintain an identity map
> between the device and a range of physical address space. We don't know the purpose of that
> mapping, but we can assume that it allows the device to provide ongoing data for platform
> firmware to consume.
Does this mean that if a kernel is "blind" to a given RMRR region something else may break because these regions need to be treated in some special manner outside of not touching them for IOMMU?
The QEMU project is currently moving its bug tracking to another system.
For this we need to know which bugs are still valid and which could be
closed already. Thus we are setting the bug state to "Incomplete" now.
If the bug has already been fixed in the latest upstream version of QEMU,
then please close this ticket as "Fix released".
If it is not fixed yet and you think that this bug report here is still
valid, then you have two options:
1) If you already have an account on gitlab.com, please open a new ticket
for this problem in our new tracker here:
https://gitlab.com/qemu-project/qemu/-/issues
and then close this ticket here on Launchpad (or let it expire auto-
matically after 60 days). Please mention the URL of this bug ticket on
Launchpad in the new ticket on GitLab.
2) If you don't have an account on gitlab.com and don't intend to get
one, but still would like to keep this ticket opened, then please switch
the state back to "New" or "Confirmed" within the next 60 days (other-
wise it will get closed as "Expired"). We will then eventually migrate
the ticket automatically to the new system (but you won't be the reporter
of the bug in the new system and thus you won't get notified on changes
anymore).
Thank you and sorry for the inconvenience.
[Expired for QEMU because there has been no activity for 60 days.]
|