1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
|
graphic: 0.921
network: 0.704
KVM: 0.531
other: 0.523
semantic: 0.500
device: 0.498
assembly: 0.480
instruction: 0.433
vnc: 0.354
mistranslation: 0.269
socket: 0.267
boot: 0.140
high IRQ-TLB generates network interruptions
we are having a problem in our hosts, all the vm running on them suddenly, and for some seconds, lost network connectivity.
the root cause appears to be the increase of irb-tlb from low values (less than 20) to more than >100k, that spike only last for some seconds then everything goes back to normal
i've upload an screenshot of collectd for one hypervisor here
http://zumbi.com.ar/tmp/irq-tlb.png
we have hosts running precise (qemu 1.5, ovs 2.0.2, libvirt 1.2.2 and kernel 3.13) where the issue is frequent. also we have an small % of our fleet running trusty (qemu 2.0.0 ovs 2.0.2 libvirt 1.2.2 and kernel 3.16) where the problem seemed to be nonexistent until today
issue seems to be isolated to < 10% of our hypervisors, some hypervisors had this problem every few days, others only once or twice. our vm are a black box to us we don't know what users run on them, but mostly cpu and network bound workload.
most of our guests run centos 6.5 (kernel 2.6.32)
vm are bridged to a linuxbridge then veth wired to an ovs switch (neutron openvswitch agent setup)
maybe first part is not clear, here it goes again
this happens on some hypervisors at random times, not all hypervisors at the same time, and affects all vm on the hypervisor
overcommit ratio on latest server i had the problem is 3.6 (3.6 vcpu for each cpu), would that be part of the problem? i see other servers that never had the problem with over commit ratios as high as 4.1
Seeing the same here, also happens on overbooked hypervisors.
Just one or two hosts have this behaviour.
We are using:
qemu-kvm 2.0.0+dfsg-2ubuntu1.25
libvirt-bin 1.2.9
kernel 3.13.0-92-generic
We are using contrail as a SDN.
It looks like it started after upgrading a bunch of packages including kernel (we came from 3.13.0-83-generic)
Disabling huge pages seem to help.
Strangely this should theoretically increase the issue but it so far we have not seen issues after disabling THP.
(have not seen high load spikes in a week but this might also be holiday related)
So other people can try it out:
echo never >/sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Looking through old bug tickets... can you still reproduce this issue with the latest version of QEMU? Or could we close this ticket nowadays?
[Expired for QEMU because there has been no activity for 60 days.]
|