1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
|
xen/pt: Incorrect register mask for PCI passthrough prevents Linux guest from completing boot process
Description of problem:
In brief, the problem is that PCI/GPU passthrough functions normally with Xen/Qemu if the Xen HVM guest is Windows, but if the guest is Linux, the guest will not complete the booting process and it never reaches the systemd targets that allow the GUI environment to start and login to the desktop environment. The problem is that a bug in the way Qemu initializes the PCI status register of the passed through devices causes the PCI capabilities list bit of the PCI status register to be disabled instead of enabled. This in turn disables the MSI-x interrupt handling capability of the passed through PCI devices. I think the reason only Linux guests are affected is that Linux guests use a different method of delivering interrupts from the passed through PCI devices to the guest from the method used by Windows guests, and the method used by Windows does not require the MSI-x capability of the PCI devices but the method used by Linux does need the MSI-x capability of the passed through devices. I will explain this further in the additional information section with logs and other relevant information.
Steps to reproduce:
1. It might only be reproducible on specific hardware. It is very reproducible on my system, an ASRock B85M-Pro4 with BIOS version P2.50 and a Haswell core i5-4590S CPU.
2. Configure the system to pass through the Intel integrated graphics device (IGD), the on-board USB 3 controller, and the onboard PCI audio device to a Windows Xen HVM guest with Qemu running as the device model for the Windows guest in Dom0 using the Xen xl toolstack, and verify that the Windows guest boots and functions properly. This is not trivial and can probably only be done by persons familiar with Xen and its PCI and VGA/GPU passthrough feature. Here is the xl domain configuration file that the Xen xl toolstack used to create and boot the working Windows HVM domain with passthrough of three PCI devices on my hardware:
```
builder = 'hvm'
bios = 'seabios'
memory = '3072'
vcpus = '4'
device_model_version = 'qemu-xen'
disk = ['/dev/systems/windows,,xvda,w']
name = 'bullseye'
vif = [ 'model=e1000,script=vif-route,ip=<redacted>' ]
on_poweroff = 'destroy'
on_reboot = 'restart'
on_crash = 'destroy'
boot = 'c'
acpi = '1'
apic = '1'
viridian = '1'
xen_platform_pci = '1'
serial = 'pty'
vga = 'none'
sdl = '0'
vnc = '0'
gfx_passthru = '1'
pci = [ '00:1b.0', '00:14.0,rdm_policy=relaxed', '00:02.0' ]
```
3. Shut down the working Windows Xen HVM and replace it with a Linux Xen HVM disk image and try to boot that in place of Windows, keeping all other configuration options the same as with the working Windows guest. To create and boot the non-working Linux HVM domain, I used the same xl domain configuration as for Windows with the exception that the disk line was replaced with:
```
disk = ['/dev/systems/linux,,xvda,w']
```
which obviously points to a virtual disk that boots Linux instead of Windows. A Linux guest, such as Debian bullseye or Debian buster or Debian sid will not boot properly and instead exhibit the problem handling IRQs from the passed through PCI devices, as discussed above.
Additional information:
This problem is known by QubesOS and they have been using a patch to fix it since 2017, but they give very few details about the problem in their commit messages:
https://github.com/QubesOS/qubes-vmm-xen-stubdom-linux/pull/3/commits/ab2b4c2ad02827a73c52ba561e9a921cc4bb227c
That same patch to hw/xen/xen_pt_config_init.c also fixes the problem on my system.
Some logs:
Without the QubesOS patch, I get error messages indicating problems handling IRQs like this in the Dom0:
May 10 08:50:03 bullseye kernel: [79077.644346] pciback 0000:00:1b.0: xen_pciback: vpci: assign to virtual slot 0
May 10 08:50:03 bullseye kernel: [79077.644478] pciback 0000:00:1b.0: registering for 16
May 10 08:50:03 bullseye kernel: [79077.644732] pciback 0000:00:14.0: xen_pciback: vpci: assign to virtual slot 1
May 10 08:50:03 bullseye kernel: [79077.644874] pciback 0000:00:14.0: registering for 16
May 10 08:50:03 bullseye kernel: [79077.645024] pciback 0000:00:02.0: xen_pciback: vpci: assign to virtual slot 2
May 10 08:50:03 bullseye kernel: [79077.645107] pciback 0000:00:02.0: registering for 16
May 10 08:50:30 bullseye kernel: [79105.273876] vif vif-16-0 vif16.0: Guest Rx ready
May 10 08:50:30 bullseye kernel: [79105.273893] IPv6: ADDRCONF(NETDEV_CHANGE): vif16.0: link becomes ready
May 10 08:50:30 bullseye kernel: [79105.278023] xen-blkback: backend/vbd/16/51712: using 4 queues, protocol 1 (x86_64-abi) persistent grants
May 10 08:50:44 bullseye kernel: [79119.104937] irq 16: nobody cared (try booting with the "irqpoll" option)
May 10 08:50:44 bullseye kernel: [79119.104973] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
May 10 08:50:44 bullseye kernel: [79119.104976] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B85M Pro4, BIOS P2.50 12/11/2015
May 10 08:50:44 bullseye kernel: [79119.104979] Call Trace:
May 10 08:50:44 bullseye kernel: [79119.104984] <IRQ>
May 10 08:50:44 bullseye kernel: [79119.104998] dump_stack+0x6b/0x83
May 10 08:50:44 bullseye kernel: [79119.105008] __report_bad_irq+0x35/0xa7
May 10 08:50:44 bullseye kernel: [79119.105014] note_interrupt.cold+0xb/0x61
May 10 08:50:44 bullseye kernel: [79119.105024] handle_irq_event+0xa8/0xb0
May 10 08:50:44 bullseye kernel: [79119.105030] handle_fasteoi_irq+0x78/0x1c0
May 10 08:50:44 bullseye kernel: [79119.105037] generic_handle_irq+0x47/0x50
May 10 08:50:44 bullseye kernel: [79119.105044] __evtchn_fifo_handle_events+0x175/0x190
May 10 08:50:44 bullseye kernel: [79119.105054] __xen_evtchn_do_upcall+0x66/0xb0
May 10 08:50:44 bullseye kernel: [79119.105063] __xen_pv_evtchn_do_upcall+0x11/0x20
May 10 08:50:44 bullseye kernel: [79119.105069] asm_call_irq_on_stack+0x12/0x20
May 10 08:50:44 bullseye kernel: [79119.105072] </IRQ>
May 10 08:50:44 bullseye kernel: [79119.105079] xen_pv_evtchn_do_upcall+0xa2/0xc0
May 10 08:50:44 bullseye kernel: [79119.105084] exc_xen_hypervisor_callback+0x8/0x10
May 10 08:50:44 bullseye kernel: [79119.105091] RIP: e030:xen_hypercall_sched_op+0xa/0x20
May 10 08:50:44 bullseye kernel: [79119.105097] Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
May 10 08:50:44 bullseye kernel: [79119.105100] RSP: e02b:ffffffff82603de8 EFLAGS: 00000246
May 10 08:50:44 bullseye kernel: [79119.105106] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810023aa
May 10 08:50:44 bullseye kernel: [79119.105108] RDX: 0000000009d62df2 RSI: 0000000000000000 RDI: 0000000000000001
May 10 08:50:44 bullseye kernel: [79119.105111] RBP: ffffffff82613940 R08: 00000066a1715350 R09: 000047f57b235dc9
May 10 08:50:44 bullseye kernel: [79119.105114] R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
May 10 08:50:44 bullseye kernel: [79119.105117] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 10 08:50:44 bullseye kernel: [79119.105124] ? xen_hypercall_sched_op+0xa/0x20
May 10 08:50:44 bullseye kernel: [79119.105133] ? xen_safe_halt+0xc/0x20
May 10 08:50:44 bullseye kernel: [79119.105140] ? default_idle+0xa/0x10
May 10 08:50:44 bullseye kernel: [79119.105145] ? default_idle_call+0x38/0xc0
May 10 08:50:44 bullseye kernel: [79119.105152] ? do_idle+0x208/0x2b0
May 10 08:50:44 bullseye kernel: [79119.105158] ? cpu_startup_entry+0x19/0x20
May 10 08:50:44 bullseye kernel: [79119.105164] ? start_kernel+0x587/0x5a8
May 10 08:50:44 bullseye kernel: [79119.105170] ? xen_start_kernel+0x625/0x631
May 10 08:50:44 bullseye kernel: [79119.105180] ? startup_xen+0x3e/0x3e
May 10 08:50:44 bullseye kernel: [79119.105184] handlers:
May 10 08:50:44 bullseye kernel: [79119.105222] [<000000005d228d5f>] usb_hcd_irq [usbcore]
May 10 08:50:44 bullseye kernel: [79119.105245] [<00000000e534b010>] ath_isr [ath9k]
May 10 08:50:44 bullseye kernel: [79119.105257] Disabling IRQ #16
Also, without the patch, I get error messages about failure to handle IRQs in the Linux Xen HVM guest:
Oct 23 18:50:32 domU kernel: irq 36: nobody cared (try booting with the "irqpoll" option)
Oct 23 18:50:32 domU kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1
Oct 23 18:50:32 domU kernel: Hardware name: Xen HVM domU, BIOS 4.14.3 10/22/2021
Oct 23 18:50:32 domU kernel: Call Trace:
Oct 23 18:50:32 domU kernel: <IRQ>
Oct 23 18:50:32 domU kernel: dump_stack+0x6b/0x83
Oct 23 18:50:32 domU kernel: __report_bad_irq+0x35/0xa7
Oct 23 18:50:32 domU kernel: note_interrupt.cold+0xb/0x61
Oct 23 18:50:32 domU kernel: handle_irq_event+0xa8/0xb0
Oct 23 18:50:32 domU kernel: handle_fasteoi_irq+0x78/0x1c0
Oct 23 18:50:32 domU kernel: generic_handle_irq+0x47/0x50
Oct 23 18:50:32 domU kernel: __evtchn_fifo_handle_events+0x175/0x190
Oct 23 18:50:32 domU kernel: __xen_evtchn_do_upcall+0x66/0xb0
Oct 23 18:50:32 domU kernel: __sysvec_xen_hvm_callback+0x22/0x30
Oct 23 18:50:32 domU kernel: asm_call_irq_on_stack+0x12/0x20
Oct 23 18:50:32 domU kernel: </IRQ>
Oct 23 18:50:32 domU kernel: sysvec_xen_hvm_callback+0x72/0x80
Oct 23 18:50:32 domU kernel: asm_sysvec_xen_hvm_callback+0x12/0x20
Oct 23 18:50:32 domU kernel: RIP: 0010:native_safe_halt+0xe/0x10
Oct 23 18:50:32 domU kernel: Code: 02 20 48 8b 00 a8 08 75 c4 e9 7b ff ff ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d a6 6f 54 00 fb f4 <c3> 90 e9 07 00 00 00 0f 00 2d 96 6f 54 00 f4 c3 cc cc 0f 1f 44 00
Oct 23 18:50:32 domU kernel: RSP: 0018:ffffffff89003e48 EFLAGS: 00000246
Oct 23 18:50:32 domU kernel: RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff8dbb7cc2c9c0
Oct 23 18:50:32 domU kernel: RDX: ffff8dbb7cc00000 RSI: ffff8dbaf55b1400 RDI: ffff8dbaf55b1464
Oct 23 18:50:32 domU kernel: RBP: ffff8dbaf55b1464 R08: ffffffff891b9120 R09: 0000000000000008
Oct 23 18:50:32 domU kernel: R10: 000000000000000e R11: 000000000000000d R12: 0000000000000001
Oct 23 18:50:32 domU kernel: R13: ffffffff891b91a0 R14: 0000000000000001 R15: 0000000000000000
Oct 23 18:50:32 domU kernel: ? xen_sched_clock+0x11/0x20
Oct 23 18:50:32 domU kernel: acpi_idle_do_entry+0x46/0x50
Oct 23 18:50:32 domU kernel: acpi_idle_enter+0x86/0xc0
Oct 23 18:50:32 domU kernel: cpuidle_enter_state+0x89/0x350
Oct 23 18:50:32 domU kernel: cpuidle_enter+0x29/0x40
Oct 23 18:50:32 domU kernel: do_idle+0x1ef/0x2b0
Oct 23 18:50:32 domU kernel: cpu_startup_entry+0x19/0x20
Oct 23 18:50:32 domU kernel: start_kernel+0x587/0x5a8
Oct 23 18:50:32 domU kernel: secondary_startup_64_no_verify+0xb0/0xbb
Oct 23 18:50:32 domU kernel: handlers:
Oct 23 18:50:32 domU kernel: [<000000007d3a0964>] usb_hcd_irq [usbcore]
Oct 23 18:50:32 domU kernel: Disabling IRQ #36
Oct 23 18:50:32 domU kernel: PM: Image not found (code -22)
Oct 23 18:50:32 domU kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:45:pipe A] flip_done timed out
To prove the cause of the bug, I compare some logs without the patch
and with the patch that fixes it.
First, relevant logs generated by Qemu in Dom0, for existing Qemu without the patch. On Debian these logs are located in /var/log/xen in the Dom0:
[00:06.0] xen_pt_realize: Assigning real physical device 00:14.0 to devfn 0x30
[...]
[00:06.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0290, syncing to 0x0280.
[...]
[00:06.0] xen_pt_realize: Real physical device 00:14.0 registered successfully
[00:02.0] xen_pt_realize: Assigning real physical device 00:02.0 to devfn 0x10
[...]
[00:02.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0090, syncing to 0x0080.
[...]
[00:02.0] xen_pt_realize: Real physical device 00:02.0 registered successfully
Next, the same logs, but now using a version of Qemu with the patch that fixes the bug:
[00:06.0] xen_pt_realize: Assigning real physical device 00:14.0 to devfn 0x30
[...]
[00:06.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0290, syncing to 0x0290.
[...]
[00:06.0] xen_pt_realize: Real physical device 00:14.0 registered successfully
[00:02.0] xen_pt_realize: Assigning real physical device 00:02.0 to devfn 0x10
[...]
[00:02.0] xen_pt_config_reg_init: Offset 0x0006 mismatch! Emulated=0x0010, host=0x0090, syncing to 0x0090.
[...]
[00:02.0] xen_pt_realize: Real physical device 00:02.0 registered successfully
To decipher what is happening here, one must refer to the definitions
in the pci/header.h file from PCI Utilities that in Debian is in the
libpci-dev package and is probably in similarly named packages on other
distros.
The Offset of 0x0006 corresponds to the 16-bit PCI_STATUS register of
the passed through device, and the Emulated value of 0x0010 sets the desired
emulated value of the PCI_STATUS_CAP_LIST bit to 1 in the PCI_STATUS register.
The host values of 0x0290, 0x0090 correspond to the setting of the register in the
physical device for real device 00:14.0 and 00:02.0, respectively.
The syncing to value indicates the value of the register that Qemu
exposes to the guest. Notice that without the patch, the PCI_STATUS_CAP_LIST
bit is turned off for the two PCI devices (register value = 0x0280 and 0x0080
for real device 00:14.0 and 00:02.0, respectively), but the bit is turned
on (0x0290 and 0x0090) for these devices with the patch. With the capabilities list enabled, the guest can use the MSI-x capability of the device, but with the capabilities
list disabled, the guest cannot use the MSI-x capability of the devices.
That explains why this patch is needed in Qemu to fix this problem and enable the Linux guest to use the MSI-x capability of the passed through PCI devices.
This is the QubesOS patch thatfixes it:
```
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -1969,7 +1969,7 @@
/* Mask out host (including past size). */
new_val = val & host_mask;
/* Merge emulated ones (excluding the non-emulated ones). */
- new_val |= data & host_mask;
+ new_val |= data & reg->emu_mask;
/* Leave intact host and emulated values past the size - even though
* we do not care as we write per reg->size granularity, but for the
* logging below lets have the proper value. */
```
The QubesOS patch that fixes it in Debian's Qemu 7.0.0 build is also attached as a file.[xen-fix-emu-mask.patch](/uploads/3bef189175549cd9854f8dc3d1affc88/xen-fix-emu-mask.patch)
~~I will not officially submit it as a patch because I am not its author.~~
~~I do not know why QubesOS never officially requested that this fix
be committed to Qemu upstream, but I hope after review by the
maintainers of the code touched by this patch it will be recognized
as a necessary fix to a mistake that causes the desired merge of
the host and emulated values to be incorrect.~~
For reference, the commit that is fixed by the QubesOS patch is:
Fixes: 2e87512eccf3c5e40f3142ff5a763f4f850839f4 (xen/pt: Sync up the dev.config and data values.)
I think perhaps that commit and the patched file might need some other cleanup so I might try my hand at officially submitting a patch to Qemu that fixes this issue on my hardware without breaking something else, because it is possible that the simple QubesOS patch is not suitable as the correct fix.
But before I do that, I wish to make one more comment. In my logs, the only other register than the PCI_STATUS register that is affected by the QubesOS patch is the PCI_HEADER_TYPE register. Without the patch, the register's value is always exposed to the guest as 0x80, and with the patch, the value is always exposed as 0x00 (PCI_HEADER_TYPE_NORMAL as defined in pci/header.h). That is because Qemu sets the initial emulated value of PCI_HEADER_TYPE register to 0x80, but Qemu also sets the emu_mask to 0x00, so after correcting the merging of the host and emulated values with the QubesOS patch, the value is synced to 0x00 instead of 0x80. What I don't understand is why the register is initialized with 0x80, but the emu_mask is 0x00. Shouldn't the emu_mask be 0x80, to pass through the initial emulated value of 0x80? ~~Also, I don't know why the initial emulated value of PCI_HEADER_TYPE is set to 0x80 but I will assume that is the correct emulated value that should be exposed to the guest.~~ Update: After doing some research, I discovered the bit that is set in the PCI_HEADER_TYPE register (0x80) because of this issue is the bit to define the device as a multifunction device. None of my devices are multifunction, and the fact that the multifunction bit is incorrectly set on my passed through devices because of this issue seems to have no effect on the operation of the device or the guest. Apparently the author of the code to initialize the PCI_HEADER_TYPE register planned to initialize every passed through device as a multifunction device, but is this needed? My testing indicates it is not needed on my system.
I am referring to this code in hw/xen/xen_pt_config_init.c:
```
static int xen_pt_header_type_reg_init(XenPCIPassthroughState *s,
XenPTRegInfo *reg, uint32_t real_offset,
uint32_t *data)
{
/* read PCI_HEADER_TYPE */
*data = reg->init_val | 0x80;
return 0;
}
[...]
/* Header Type reg */
{
.offset = PCI_HEADER_TYPE,
.size = 1,
.init_val = 0x00,
.ro_mask = 0xFF,
.emu_mask = 0x00,
.init = xen_pt_header_type_reg_init,
.u.b.read = xen_pt_byte_reg_read,
.u.b.write = xen_pt_byte_reg_write,
},
```
I would appreciate any guidance that experienced Qemu or Xen contributors can give me about this question. ~~If no one gives me any guidance here in a timely manner, I plan to propose my own fix officially to Qemu as the QubesOS patch plus changing the emu_mask value of the PCI_HEADER_TYPE register from 0x00 to 0x80. I verified that fixes the problem I am seeing in the PCI_STATUS register without also causing the change that the QubesOS patch makes to the PCI_HEADER_TYPE register.~~
I plan to submit a patch to fix this issue, noting the effect the patch has on the PCI_HEADER_TYPE register in the commit message.
|