1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
|
device: 0.951
register: 0.951
boot: 0.943
debug: 0.942
graphic: 0.942
architecture: 0.940
permissions: 0.936
performance: 0.927
semantic: 0.924
assembly: 0.919
PID: 0.914
kernel virtual machine: 0.907
arm: 0.906
risc-v: 0.901
network: 0.894
other: 0.894
socket: 0.882
files: 0.878
TCG: 0.860
vnc: 0.853
mistranslation: 0.826
x86: 0.783
[BUG] AArch64 boot hang with -icount and -smp >1 (iothread locking issue?)
Hello,
I am encountering one or more bugs when using -icount and -smp >1 that I am
attempting to sort out. My current theory is that it is an iothread locking
issue.
I am using a command-line like the following where $kernel is a recent upstream
AArch64 Linux kernel Image (I can provide a binary if that would be helpful -
let me know how is best to post):
qemu-system-aarch64 \
-M virt -cpu cortex-a57 -m 1G \
-nographic \
-smp 2 \
-icount 0 \
-kernel $kernel
For any/all of the symptoms described below, they seem to disappear when I
either remove `-icount 0` or change smp to `-smp 1`. In other words, it is the
combination of `-smp >1` and `-icount` which triggers what I'm seeing.
I am seeing two different (but seemingly related) behaviors. The first (and
what I originally started debugging) shows up as a boot hang. When booting
using the above command after Peter's "icount: Take iothread lock when running
QEMU timers" patch [1], The kernel boots for a while and then hangs after:
>
...snip...
>
[ 0.010764] Serial: AMBA PL011 UART driver
>
[ 0.016334] 9000000.pl011: ttyAMA0 at MMIO 0x9000000 (irq = 13, base_baud
>
= 0) is a PL011 rev1
>
[ 0.016907] printk: console [ttyAMA0] enabled
>
[ 0.017624] KASLR enabled
>
[ 0.031986] HugeTLB: registered 16.0 GiB page size, pre-allocated 0 pages
>
[ 0.031986] HugeTLB: 16320 KiB vmemmap can be freed for a 16.0 GiB page
>
[ 0.031986] HugeTLB: registered 512 MiB page size, pre-allocated 0 pages
>
[ 0.031986] HugeTLB: 448 KiB vmemmap can be freed for a 512 MiB page
>
[ 0.031986] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
>
[ 0.031986] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
When it hangs here, I drop into QEMU's console, attach to the gdbserver, and it
always reports that it is at address 0xffff800008dc42e8 (as shown below from an
objdump of the vmlinux). I note this is in the middle of messing with timer
system registers - which makes me suspect we're attempting to take the iothread
lock when its already held:
>
ffff800008dc42b8 <arch_timer_set_next_event_virt>:
>
ffff800008dc42b8: d503201f nop
>
ffff800008dc42bc: d503201f nop
>
ffff800008dc42c0: d503233f paciasp
>
ffff800008dc42c4: d53be321 mrs x1, cntv_ctl_el0
>
ffff800008dc42c8: 32000021 orr w1, w1, #0x1
>
ffff800008dc42cc: d5033fdf isb
>
ffff800008dc42d0: d53be042 mrs x2, cntvct_el0
>
ffff800008dc42d4: ca020043 eor x3, x2, x2
>
ffff800008dc42d8: 8b2363e3 add x3, sp, x3
>
ffff800008dc42dc: f940007f ldr xzr, [x3]
>
ffff800008dc42e0: 8b020000 add x0, x0, x2
>
ffff800008dc42e4: d51be340 msr cntv_cval_el0, x0
>
* ffff800008dc42e8: 927ef820 and x0, x1, #0xfffffffffffffffd
>
ffff800008dc42ec: d51be320 msr cntv_ctl_el0, x0
>
ffff800008dc42f0: d5033fdf isb
>
ffff800008dc42f4: 52800000 mov w0, #0x0
>
// #0
>
ffff800008dc42f8: d50323bf autiasp
>
ffff800008dc42fc: d65f03c0 ret
The second behavior is that prior to Peter's "icount: Take iothread lock when
running QEMU timers" patch [1], I observe the following message (same command
as above):
>
ERROR:../accel/tcg/tcg-accel-ops.c:79:tcg_handle_interrupt: assertion failed:
>
(qemu_mutex_iothread_locked())
>
Aborted (core dumped)
This is the same behavior described in Gitlab issue 1130 [0] and addressed by
[1]. I bisected the appearance of this assertion, and found it was introduced
by Pavel's "replay: rewrite async event handling" commit [2]. Commits prior to
that one boot successfully (neither assertions nor hangs) with `-icount 0 -smp
2`.
I've looked over these two commits ([1], [2]), but it is not obvious to me
how/why they might be interacting to produce the boot hangs I'm seeing and
I welcome any help investigating further.
Thanks!
-Aaron Lindsay
[0] -
https://gitlab.com/qemu-project/qemu/-/issues/1130
[1] -
https://gitlab.com/qemu-project/qemu/-/commit/c7f26ded6d5065e4116f630f6a490b55f6c5f58e
[2] -
https://gitlab.com/qemu-project/qemu/-/commit/60618e2d77691e44bb78e23b2b0cf07b5c405e56
On Fri, 21 Oct 2022 at 16:48, Aaron Lindsay
<aaron@os.amperecomputing.com> wrote:
>
>
Hello,
>
>
I am encountering one or more bugs when using -icount and -smp >1 that I am
>
attempting to sort out. My current theory is that it is an iothread locking
>
issue.
Weird coincidence, that is a bug that's been in the tree for months
but was only reported to me earlier this week. Try reverting
commit a82fd5a4ec24d923ff1e -- that should fix it.
CAFEAcA_i8x00hD-4XX18ySLNbCB6ds1-DSazVb4yDnF8skjd9A@mail.gmail.com
/">https://lore.kernel.org/qemu-devel/
CAFEAcA_i8x00hD-4XX18ySLNbCB6ds1-DSazVb4yDnF8skjd9A@mail.gmail.com
/
has the explanation.
thanks
-- PMM
On Oct 21 17:00, Peter Maydell wrote:
>
On Fri, 21 Oct 2022 at 16:48, Aaron Lindsay
>
<aaron@os.amperecomputing.com> wrote:
>
>
>
> Hello,
>
>
>
> I am encountering one or more bugs when using -icount and -smp >1 that I am
>
> attempting to sort out. My current theory is that it is an iothread locking
>
> issue.
>
>
Weird coincidence, that is a bug that's been in the tree for months
>
but was only reported to me earlier this week. Try reverting
>
commit a82fd5a4ec24d923ff1e -- that should fix it.
I can confirm that reverting a82fd5a4ec24d923ff1e fixes it for me.
Thanks for the help and fast response!
-Aaron
|