graphic: 0.921
hypervisor: 0.872
performance: 0.792
network: 0.704
architecture: 0.606
virtual: 0.585
KVM: 0.531
kernel: 0.517
ppc: 0.507
semantic: 0.500
device: 0.498
assembly: 0.480
peripherals: 0.477
debug: 0.462
i386: 0.387
x86: 0.380
files: 0.366
vnc: 0.354
PID: 0.347
register: 0.333
user-level: 0.312
mistranslation: 0.269
socket: 0.267
arm: 0.267
risc-v: 0.238
permissions: 0.209
VMM: 0.195
TCG: 0.162
boot: 0.140

high IRQ-TLB generates network interruptions

 we are having a problem in our hosts, all the vm running on them suddenly, and for some seconds, lost network connectivity.

the root cause appears to be the increase of irb-tlb from low values (less than 20) to more than >100k, that spike only last for some seconds then everything goes back to normal

i've upload an screenshot of collectd for one hypervisor here
http://zumbi.com.ar/tmp/irq-tlb.png


we have hosts running precise (qemu 1.5, ovs 2.0.2, libvirt 1.2.2 and kernel 3.13) where the issue is frequent. also we have an small % of our fleet running trusty (qemu 2.0.0 ovs 2.0.2 libvirt 1.2.2 and kernel 3.16) where the problem seemed to be nonexistent until today

issue seems to be isolated to < 10% of our hypervisors, some hypervisors had this problem every few days, others only once or twice. our vm are a black box to us we don't know what users run on them, but mostly cpu and network bound workload.
most of our guests run centos 6.5 (kernel 2.6.32)

vm are bridged to a linuxbridge then veth wired to an ovs switch (neutron openvswitch agent setup)


maybe first part is not clear, here it goes again

 this happens on some hypervisors at random times, not all hypervisors at the same time, and affects all vm on the hypervisor

overcommit ratio on latest server i had the problem is 3.6 (3.6 vcpu for each cpu), would that be part of the problem?  i see other servers that never had the problem with over commit ratios as high as 4.1 

Seeing the same here, also happens on overbooked hypervisors.

Just one or two hosts have this behaviour.

We are using:
qemu-kvm                             2.0.0+dfsg-2ubuntu1.25
libvirt-bin                          1.2.9
kernel  3.13.0-92-generic

We are using contrail as a SDN.

It looks like it started after upgrading a bunch of packages including kernel (we came from 3.13.0-83-generic)


Disabling huge pages seem to help.
Strangely this should theoretically increase the issue but it so far we have not seen issues after disabling THP.
(have not seen high load spikes in a week but this might also be holiday related)

So other people can try it out:
echo never >/sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled


Looking through old bug tickets... can you still reproduce this issue with the latest version of QEMU? Or could we close this ticket nowadays?

[Expired for QEMU because there has been no activity for 60 days.]