high IRQ-TLB generates network interruptions

 we are having a problem in our hosts, all the vm running on them suddenly, and for some seconds, lost network connectivity.

the root cause appears to be the increase of irb-tlb from low values (less than 20) to more than >100k, that spike only last for some seconds then everything goes back to normal

i've upload an screenshot of collectd for one hypervisor here
http://zumbi.com.ar/tmp/irq-tlb.png


we have hosts running precise (qemu 1.5, ovs 2.0.2, libvirt 1.2.2 and kernel 3.13) where the issue is frequent. also we have an small % of our fleet running trusty (qemu 2.0.0 ovs 2.0.2 libvirt 1.2.2 and kernel 3.16) where the problem seemed to be nonexistent until today

issue seems to be isolated to < 10% of our hypervisors, some hypervisors had this problem every few days, others only once or twice. our vm are a black box to us we don't know what users run on them, but mostly cpu and network bound workload.
most of our guests run centos 6.5 (kernel 2.6.32)

vm are bridged to a linuxbridge then veth wired to an ovs switch (neutron openvswitch agent setup)


maybe first part is not clear, here it goes again

 this happens on some hypervisors at random times, not all hypervisors at the same time, and affects all vm on the hypervisor

overcommit ratio on latest server i had the problem is 3.6 (3.6 vcpu for each cpu), would that be part of the problem?  i see other servers that never had the problem with over commit ratios as high as 4.1 

Seeing the same here, also happens on overbooked hypervisors.

Just one or two hosts have this behaviour.

We are using:
qemu-kvm                             2.0.0+dfsg-2ubuntu1.25
libvirt-bin                          1.2.9
kernel  3.13.0-92-generic

We are using contrail as a SDN.

It looks like it started after upgrading a bunch of packages including kernel (we came from 3.13.0-83-generic)


Disabling huge pages seem to help.
Strangely this should theoretically increase the issue but it so far we have not seen issues after disabling THP.
(have not seen high load spikes in a week but this might also be holiday related)

So other people can try it out:
echo never >/sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled


Looking through old bug tickets... can you still reproduce this issue with the latest version of QEMU? Or could we close this ticket nowadays?

[Expired for QEMU because there has been no activity for 60 days.]