1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
|
other: 0.734
mistranslation: 0.649
instruction: 0.581
semantic: 0.577
boot: 0.478
[RESEND][BUG FIX HELP] QEMU main thread endlessly hangs in __ppoll()
Hi Genius,
I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may still
exist in the mainline.
Thanks in advance to heroes who can take a look and share understanding.
The qemu main thread endlessly hangs in the handle of the qmp statement:
{'execute': 'human-monitor-command', 'arguments':{ 'command-line':
'drive_del replication0' } }
and we have the call trace looks like:
#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1,
timeout=<optimized out>, timeout@entry=0x7ffc56c66db0,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0,
__nfds=<optimized out>, __fds=<optimized out>)
at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=<optimized out>) at util/qemu-timer.c:348
#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0,
blocking=blocking@entry=true) at util/aio-posix.c:669
#4 0x000055561019268d in bdrv_do_drained_begin (poll=true,
ignore_bds_parents=false, parent=0x0, recursive=false,
bs=0x55561138b0a0) at block/io.c:430
#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>,
parent=0x0, ignore_bds_parents=<optimized out>,
poll=<optimized out>) at block/io.c:396
#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0,
child=0x7f36dc0ce380, errp=<optimized out>)
at block/quorum.c:1063
#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120
"colo-disk0", has_child=<optimized out>,
child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0,
errp=0x7ffc56c66f98) at blockdev.c:4494
#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized
out>, ret=<optimized out>, errp=0x7ffc56c67018)
at qapi/qapi-commands-block-core.c:1538
#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010,
allow_oob=<optimized out>, request=<optimized out>,
cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132
#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized
out>, allow_oob=<optimized out>)
at qapi/qmp-dispatch.c:175
#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40,
req=<optimized out>) at monitor/qmp.c:145
#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized out>)
at monitor/qmp.c:234
#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at
util/async.c:117
#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117
#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at
util/aio-posix.c:459
#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>,
callback=<optimized out>, user_data=<optimized out>)
at util/async.c:260
#17 0x00007f3c22302fbd in g_main_context_dispatch () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219
#19 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:242
#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518
#21 0x000055560ff600fe in main_loop () at vl.c:1814
#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized out>,
envp=<optimized out>) at vl.c:4503
We found that we're doing endless check in the line of
block/io.c:bdrv_do_drained_begin():
BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent));
and it turns out that the bdrv_drain_poll() always get true from:
- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents)
- AND atomic_read(&bs->in_flight)
I personally think this is a deadlock issue in the a QEMU block layer
(as we know, we have some #FIXME comments in related codes, such as block
permisson update).
Any comments are welcome and appreciated.
---
thx,likexu
On 2/28/21 9:39 PM, Like Xu wrote:
Hi Genius,
I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may
still exist in the mainline.
Thanks in advance to heroes who can take a look and share understanding.
Do you have a test case that reproduces on 5.2? It'd be nice to know if
it was still a problem in the latest source tree or not.
--js
The qemu main thread endlessly hangs in the handle of the qmp statement:
{'execute': 'human-monitor-command', 'arguments':{ 'command-line':
'drive_del replication0' } }
and we have the call trace looks like:
#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1,
timeout=<optimized out>, timeout@entry=0x7ffc56c66db0,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0,
__nfds=<optimized out>, __fds=<optimized out>)
at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=<optimized out>) at util/qemu-timer.c:348
#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0,
blocking=blocking@entry=true) at util/aio-posix.c:669
#4 0x000055561019268d in bdrv_do_drained_begin (poll=true,
ignore_bds_parents=false, parent=0x0, recursive=false,
bs=0x55561138b0a0) at block/io.c:430
#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>,
parent=0x0, ignore_bds_parents=<optimized out>,
poll=<optimized out>) at block/io.c:396
#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0,
child=0x7f36dc0ce380, errp=<optimized out>)
at block/quorum.c:1063
#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120
"colo-disk0", has_child=<optimized out>,
child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0,
errp=0x7ffc56c66f98) at blockdev.c:4494
#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized
out>, ret=<optimized out>, errp=0x7ffc56c67018)
at qapi/qapi-commands-block-core.c:1538
#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010,
allow_oob=<optimized out>, request=<optimized out>,
cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132
#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized
out>, allow_oob=<optimized out>)
at qapi/qmp-dispatch.c:175
#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40,
req=<optimized out>) at monitor/qmp.c:145
#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized
out>) at monitor/qmp.c:234
#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at
util/async.c:117
#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117
#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at
util/aio-posix.c:459
#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>,
callback=<optimized out>, user_data=<optimized out>)
at util/async.c:260
#17 0x00007f3c22302fbd in g_main_context_dispatch () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219
#19 os_host_main_loop_wait (timeout=<optimized out>) at
util/main-loop.c:242
#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518
#21 0x000055560ff600fe in main_loop () at vl.c:1814
#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at vl.c:4503
We found that we're doing endless check in the line of
block/io.c:bdrv_do_drained_begin():
    BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent));
and it turns out that the bdrv_drain_poll() always get true from:
- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents)
- AND atomic_read(&bs->in_flight)
I personally think this is a deadlock issue in the a QEMU block layer
(as we know, we have some #FIXME comments in related codes, such as
block permisson update).
Any comments are welcome and appreciated.
---
thx,likexu
Hi John,
Thanks for your comment.
On 2021/3/5 7:53, John Snow wrote:
On 2/28/21 9:39 PM, Like Xu wrote:
Hi Genius,
I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may
still exist in the mainline.
Thanks in advance to heroes who can take a look and share understanding.
Do you have a test case that reproduces on 5.2? It'd be nice to know if it
was still a problem in the latest source tree or not.
We narrowed down the source of the bug, which basically came from
the following qmp usage:
{'execute': 'human-monitor-command', 'arguments':{ 'command-line':
'drive_del replication0' } }
One of the test cases is the COLO usage (docs/colo-proxy.txt).
This issue is sporadic,the probability may be 1/15 for a io-heavy guest.
I believe it's reproducible on 5.2 and the latest tree.
--js
The qemu main thread endlessly hangs in the handle of the qmp statement:
{'execute': 'human-monitor-command', 'arguments':{ 'command-line':
'drive_del replication0' } }
and we have the call trace looks like:
#0 0x00007f3c22045bf6 in __ppoll (fds=0x555611328410, nfds=1,
timeout=<optimized out>, timeout@entry=0x7ffc56c66db0,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1 0x000055561021f415 in ppoll (__ss=0x0, __timeout=0x7ffc56c66db0,
__nfds=<optimized out>, __fds=<optimized out>)
at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=<optimized out>) at util/qemu-timer.c:348
#3 0x0000555610221430 in aio_poll (ctx=ctx@entry=0x5556113010f0,
blocking=blocking@entry=true) at util/aio-posix.c:669
#4 0x000055561019268d in bdrv_do_drained_begin (poll=true,
ignore_bds_parents=false, parent=0x0, recursive=false,
bs=0x55561138b0a0) at block/io.c:430
#5 bdrv_do_drained_begin (bs=0x55561138b0a0, recursive=<optimized out>,
parent=0x0, ignore_bds_parents=<optimized out>,
poll=<optimized out>) at block/io.c:396
#6 0x000055561017b60b in quorum_del_child (bs=0x55561138b0a0,
child=0x7f36dc0ce380, errp=<optimized out>)
at block/quorum.c:1063
#7 0x000055560ff5836b in qmp_x_blockdev_change (parent=0x555612373120
"colo-disk0", has_child=<optimized out>,
child=0x5556112df3e0 "children.1", has_node=<optimized out>, node=0x0,
errp=0x7ffc56c66f98) at blockdev.c:4494
#8 0x00005556100f8f57 in qmp_marshal_x_blockdev_change (args=<optimized
out>, ret=<optimized out>, errp=0x7ffc56c67018)
at qapi/qapi-commands-block-core.c:1538
#9 0x00005556101d8290 in do_qmp_dispatch (errp=0x7ffc56c67010,
allow_oob=<optimized out>, request=<optimized out>,
cmds=0x5556109c69a0 <qmp_commands>) at qapi/qmp-dispatch.c:132
#10 qmp_dispatch (cmds=0x5556109c69a0 <qmp_commands>, request=<optimized
out>, allow_oob=<optimized out>)
at qapi/qmp-dispatch.c:175
#11 0x00005556100d4c4d in monitor_qmp_dispatch (mon=0x5556113a6f40,
req=<optimized out>) at monitor/qmp.c:145
#12 0x00005556100d5437 in monitor_qmp_bh_dispatcher (data=<optimized
out>) at monitor/qmp.c:234
#13 0x000055561021dbec in aio_bh_call (bh=0x5556112164bGrateful0) at
util/async.c:117
#14 aio_bh_poll (ctx=ctx@entry=0x5556112151b0) at util/async.c:117
#15 0x00005556102212c4 in aio_dispatch (ctx=0x5556112151b0) at
util/aio-posix.c:459
#16 0x000055561021dab2 in aio_ctx_dispatch (source=<optimized out>,
callback=<optimized out>, user_data=<optimized out>)
at util/async.c:260
#17 0x00007f3c22302fbd in g_main_context_dispatch () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#18 0x0000555610220358 in glib_pollfds_poll () at util/main-loop.c:219
#19 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:242
#20 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518
#21 0x000055560ff600fe in main_loop () at vl.c:1814
#22 0x000055560fddbce9 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at vl.c:4503
We found that we're doing endless check in the line of
block/io.c:bdrv_do_drained_begin():
     BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent));
and it turns out that the bdrv_drain_poll() always get true from:
- bdrv_parent_drained_poll(bs, ignore_parent, ignore_bds_parents)
- AND atomic_read(&bs->in_flight)
I personally think this is a deadlock issue in the a QEMU block layer
(as we know, we have some #FIXME comments in related codes, such as block
permisson update).
Any comments are welcome and appreciated.
---
thx,likexu
On 3/4/21 10:08 PM, Like Xu wrote:
Hi John,
Thanks for your comment.
On 2021/3/5 7:53, John Snow wrote:
On 2/28/21 9:39 PM, Like Xu wrote:
Hi Genius,
I am a user of QEMU v4.2.0 and stuck in an interesting bug, which may
still exist in the mainline.
Thanks in advance to heroes who can take a look and share understanding.
Do you have a test case that reproduces on 5.2? It'd be nice to know
if it was still a problem in the latest source tree or not.
We narrowed down the source of the bug, which basically came from
the following qmp usage:
{'execute': 'human-monitor-command', 'arguments':{ 'command-line':
'drive_del replication0' } }
One of the test cases is the COLO usage (docs/colo-proxy.txt).
This issue is sporadic,the probability may be 1/15 for a io-heavy guest.
I believe it's reproducible on 5.2 and the latest tree.
Can you please test and confirm that this is the case, and then file a
bug report on the LP:
https://launchpad.net/qemu
and include:
- The exact commit you used (current origin/master debug build would be
the most ideal.)
- Which QEMU binary you are using (qemu-system-x86_64?)
- The shortest command line you are aware of that reproduces the problem
- The host OS and kernel version
- An updated call trace
- Any relevant commands issued prior to the one that caused the hang; or
detailed reproduction steps if possible.
Thanks,
--js
|