results/scraper/launchpad/1207686


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243

qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process

Hi,

after some testing I tried to narrow down a problem, which was initially reported by some users.
Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.

All using some flavour of linux-3.2.x kernel.

Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
Problem could be triggert with some workload ala:

spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
and in parallel do some apt-get install/remove/whatever.

That results in a somewhat stuck qemu-session with the bad "kernel_hung_task..." messages.

A typical command-line is as follows:

/usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-server/760.vnc,password -qmp unix:/var/run/qemu-server/760.qmp,server,nowait -nodefaults -serial none -parallel none -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1 -device virtio-blk-pci,drive=virtio0 -drive format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native -drive format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc

no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-session is accepted, need to hard-kill the process.

Please give any advice on what to do for tracing/debugging, because the number of tickets here are raising, and noone knows, what users are doing inside their VM.

Kind regards,

Oliver Francke.

On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
> after some testing I tried to narrow down a problem, which was initially reported by some users.
> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
> 
> All using some flavour of linux-3.2.x kernel.
> 
> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.

Is that a guest kernel upgrade?

> Problem could be triggert with some workload ala:
> 
> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
> and in parallel do some apt-get install/remove/whatever.
> 
> That results in a somewhat stuck qemu-session with the bad
> "kernel_hung_task..." messages.
> 
> A typical command-line is as follows:
> 
> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
> server/760.vnc,password -qmp unix:/var/run/qemu-
> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
> -device virtio-blk-pci,drive=virtio0 -drive
> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
> -drive
> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
> 
> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
> session is accepted, need to hard-kill the process.

Yesterday I saw a possibly related report on IRC.  It was a Windows
guest running under OpenStack with images on Ceph.

They reported that the QEMU process would lock up - ping would not work
and their management tools showed 0 CPU activity for the guest.
However, they were able to "kick" the guest by taking a VNC screenshot
(I think).  Then it would come back to life.

If you have a Linux guest that is reporting kernel_hung_task, then it
could be a similar scenario.

Please confirm that the hung task message is from inside the guest.

If you are able to reproduce this and have an alternative non-Ceph
storage pool, please try that since Ceph is common to both these bug
reports.

Stefan


Hi Stefan,

Am 02.08.2013 um 17:24 schrieb Stefan Hajnoczi <email address hidden>:

> On Fri, Aug 02, 2013 at 09:58:29AM -0000, Oliver Francke wrote:
>> after some testing I tried to narrow down a problem, which was initially reported by some users.
>> Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
>> 
>> All using some flavour of linux-3.2.x kernel.
>> 
>> Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
> 
> Is that a guest kernel upgrade?

yeah, sorry if that was not clear enough.

> 
>> Problem could be triggert with some workload ala:
>> 
>> spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
>> and in parallel do some apt-get install/remove/whatever.
>> 
>> That results in a somewhat stuck qemu-session with the bad
>> "kernel_hung_task..." messages.
>> 
>> A typical command-line is as follows:
>> 
>> /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet -enable-
>> kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
>> unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run/qemu-
>> server/760.vnc,password -qmp unix:/var/run/qemu-
>> server/760.qmp,server,nowait -nodefaults -serial none -parallel none
>> -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
>> type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
>> -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
>> -device virtio-blk-pci,drive=virtio0 -drive
>> format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
>> -drive
>> format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
>> -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
>> if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
>> 
>> no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
>> session is accepted, need to hard-kill the process.
> 
> Yesterday I saw a possibly related report on IRC.  It was a Windows
> guest running under OpenStack with images on Ceph.
> 
> They reported that the QEMU process would lock up - ping would not work
> and their management tools showed 0 CPU activity for the guest.
> However, they were able to "kick" the guest by taking a VNC screenshot
> (I think).  Then it would come back to life.
> 
> If you have a Linux guest that is reporting kernel_hung_task, then it
> could be a similar scenario.
> 
> Please confirm that the hung task message is from inside the guest.
> 

confirmed.

> If you are able to reproduce this and have an alternative non-Ceph
> storage pool, please try that since Ceph is common to both these bug
> reports.
> 

I can reproduce it with: kernel 3.2.something + qemu-1.[456] ( never spent much time on 1.3) and high I/O.
I took this VM later this day and converted it to local-storage-qcow2, no prob with any kernel. I already asked on ceph-users-list for assistance, especially from Josh ( if he's not on summer holiday ;) )

What is strange, I have a session via VNC-console opened and have a loop ala:

while true; do apt-get install -y ntp libopts25; apt-get remove -y ntp-libopts25; done
and and parallel spew as described, the apt-"session" dies and one can see the hung_task-thingy, but I still can restart the spew-test.
Just for completeness.

Thnx for you attention,

Oliver.

> Stefan
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1207686
> 
> Title:
>  qemu-1.4.0 and onwards, linux kernel 3.2.x, heavy I/O leads to
>  kernel_hung_tasks_timout_secs message and unresponsive qemu-process
> 
> Status in QEMU:
>  New
> 
> Bug description:
>  Hi,
> 
>  after some testing I tried to narrow down a problem, which was initially reported by some users.
>  Seen on different distros - debian 7.1, ubuntu 12.04 LTS, IPFire-2.3 as reported by now.
> 
>  All using some flavour of linux-3.2.x kernel.
> 
>  Tried e.g. under Ubuntu an upgrade to "Linux 3.8.0-27-generic x86_64" which solves the problem.
>  Problem could be triggert with some workload ala:
> 
>  spew -v --raw -P -t -i 3 -b 4k -p random -B 4k 1G /tmp/doof.dat
>  and in parallel do some apt-get install/remove/whatever.
> 
>  That results in a somewhat stuck qemu-session with the bad
>  "kernel_hung_task..." messages.
> 
>  A typical command-line is as follows:
> 
>  /usr/local/qemu-1.6.0/bin/qemu-system-x86_64 -usbdevice tablet
>  -enable-kvm -daemonize -pidfile /var/run/qemu-server/760.pid -monitor
>  unix:/var/run/qemu-server/760.mon,server,nowait -vnc unix:/var/run
>  /qemu-server/760.vnc,password -qmp unix:/var/run/qemu-
>  server/760.qmp,server,nowait -nodefaults -serial none -parallel none
>  -device virtio-net-pci,mac=00:F1:70:00:2F:80,netdev=vlan0d0 -netdev
>  type=tap,id=vlan0d0,ifname=tap760i0d0,script=/etc/fcms/add_if.sh,downscript=/etc/fcms/downscript.sh
>  -name 1155823384-4 -m 512 -vga cirrus -k de -smp sockets=1,cores=1
>  -device virtio-blk-pci,drive=virtio0 -drive
>  format=raw,file=rbd:1155823384/vm-760-disk-1.rbd:rbd_cache=false,cache=writeback,if=none,id=virtio0,media=disk,index=0,aio=native
>  -drive
>  format=raw,file=rbd:1155823384/vm-760-swap-1.rbd:rbd_cache=false,cache=writeback,if=virtio,media=disk,index=1,aio=native
>  -drive if=ide,media=cdrom,id=ide1-cd0,readonly=on -drive
>  if=ide,media=cdrom,id=ide1-cd1,readonly=on -boot order=dc
> 
>  no "system_reset", "sendkey ctrl-alt-delete" or "q" in monitoring-
>  session is accepted, need to hard-kill the process.
> 
>  Please give any advice on what to do for tracing/debugging, because
>  the number of tickets here are raising, and noone knows, what users
>  are doing inside their VM.
> 
>  Kind regards,
> 
>  Oliver Francke.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1207686/+subscriptions


Hi,

opened a ticket with the ceph-guys, and it turned out to be a bug in "librados aio flush".

With latest "wip-librados-aio-flush (bobtail)" I got no error even with _very_ high load.

Thnx for the attention ;)

Oliver.


Closing as "Invalid" since this was not a QEMU bug according to comment #3.