qemu nested virtualization is not working with Ubuntu16.04 + Intel CPU

# 1 What am I trying to do ? #

I want to use `libvirt` `qemu/KVM` with **nested virtualization** like described
in [1] and [2].
**But it does not work with Ubuntu16.04.** It worked some times ago, but not
anymore.


I want 2 levels of virtualization like this:

* L0 – the bare metal host, running KVM on `Ubuntu 16.04`
* L1 – a `Ubuntu 16.04` VM running on L0; also called the "guest hypervisor" 
  — as it itself is capable of running KVM
* L2 – a `Ubuntu 16.04` VM running on L1, also called the "nested guest"


[1] https://docs.fedoraproject.org/en-US/quick-docs/using-nested-virtualization-in-kvm/
[2] https://www.linux-kvm.org/page/Nested_Guests


My goal is to deploy an `OpenStack` environnement on top of VMs rather than on
bare metal hosts for convenience for a lab experiment. As a result, the 
`OpenStack` nodes are L1 VMs. Compute nodes are L1 VMs as well and the VMs 
created with `OpenStack` and wich are running on the compute nodes are L2 VMs.


# 2 What is my problem ? #

I can **not** run my 2nd levels of virtualization in 16.04: 

* L0 is just fine: running `Ubuntu 16.04.5 LTS`, installed with the `.iso` image
* L1: I install `libvirt` + `KVM` on L0. I can run VMs like the `Ubuntu16.04` 
  cloud image on L0.
* L2: I install `libvirt` + `KVM` on L1 as well. But I **can not** run VMs on 
  L1: I get `kernel panic` or `general protection fault`.


**But if I do the same with Ubuntu18.04** (on the same hardware) instead of 
`Ubuntu16.04`, it works without faults.
I don't change the configuration or `virt-install scripts` (other than using 
the 18.04 .iso and cloud image).


# 3 My libvirt installation for Ubuntu16.04 #

I install `libvir KVM` in both L0 and L1 using a custom repository [3] from 
`OpenStack` team, because their version of libvirt in this repo is newer than  
the one on Ubuntu 16.04 official repo and it match the version of `libvirt` 
in Ubuntu 18.04.

[3] https://wiki.ubuntu.com/OpenStack/CloudArchive


# 4 hardware and CPU #

CPU is:
> Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
> Intel virt is enable in the bios/uefi.

The rest is standard HDD, standard I/O...


# 5 .iso and cloud image #

I download .iso for L0 bare metal server and cloud image 
for L1/L2 VMs from official repository:

Ubuntu 16.04
 * http://releases.ubuntu.com/16.04/
 * https://cloud-images.ubuntu.com/releases/16.04/release/

Ubuntu 18.04
 * http://releases.ubuntu.com/bionic/
 * https://cloud-images.ubuntu.com/releases/18.04/release/


# 6 Details #

## Details about L0 Ubuntu 16.04 bare metal host ##
L0 is running `Ubuntu 16.04.5 LTS` installed with the .iso.


**kernel**
```
user@L0:~$ uname -a
Linux L0 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
```

**libvirt version** running on L0
```
user@L0:~$ virsh version
Compiled against library: libvirt 4.0.0
Using library: libvirt 4.0.0
Using API: QEMU 4.0.0
Running hypervisor: QEMU 2.11.1
```

**qemu version detail**
```
ukvm2@kvm2:~$ qemu-system-x86_64 --version
QEMU emulator version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.5~cloud0)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
```

**KVM acceleration**
```
user@L0:~$ kvm-ok 
INFO: /dev/kvm exists
KVM acceleration can be used
```

**nested parameter**
```
user@L0:~$ cat /sys/module/kvm_intel/parameters/nested
Y
```

**number of CPU**
```
user@L0:~$ egrep -c '(vmx|svm)' /proc/cpuinfo
48
```


## Details about a L1 Ubuntu 16.04 VM ##
A VM in L1 (which is running on L0) which is running `Ubuntu 16.04.5 LTS` 
installed by a cloud image.

**kernel**
```
user@L1-VM:~$ uname -a
Linux L1 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
```

**libvirt version** running on the L1 VM
```
user@L1-VM:~$ sudo virsh version
Compiled against library: libvirt 4.0.0
Using library: libvirt 4.0.0
Using API: QEMU 4.0.0
Running hypervisor: QEMU 2.11.1
```

**qemu version detail**
```
user@L1-VM:~$ qemu-system-x86_64 --version
QEMU emulator version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.5~cloud0)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
```

**KVM acceleration**
```
user@L1-VM:~$ kvm-ok 
INFO: /dev/kvm exists
KVM acceleration can be used
```

**nested parameter**
```
user@L1-VM:~$ cat /sys/module/kvm_intel/parameters/nested
Y
```

**number of CPU**, which are vCPU given by L0 to the L1 VM
I give 20 vCPU.
```
user@L1-VM:~$ egrep -c '(vmx|svm)' /proc/cpuinfo
20
```


## L1 VM virt-install script parameter ##
If you want to reproduce an L1 VM, I followed this [4]:

```
virt-install \
    --connect=qemu:///system \
    --name $VMName \
    --memory $RAM \
    --vcpus $VCPUS \
    --cpu host \
    --metadata description=$DESCRIPTION \
    --os-type linux \
    --os-variant ubuntu16.04 \
    --disk $DISK_PATH/$VMName.$DISK_FORMAT,size=$DISK_SIZE,bus=virtio \
    --disk $CFGIMG_PATH/config_$VMName.$DISK_FORMAT,device=cdrom \
    --network bridge=virbr0 \
    --graphics none \
    --console pty,target_type=serial \
    --hvm
```

[4] https://youth2009.org/post/kvm-with-ubuntu-cloud-image/


## Details about a L2 VM ##

I want to create a L2 `Ubuntu 16.04.5 LTS` VM installed by a cloud image VM 
within my L1 `KVM` VM. But whatever I do, my L2 VM crash before finishing to be 
instantiated. I get `kernel panic` or `general protection fault`.


Here is the log of an L2 VM after the instanciation failed:
```
user@L1-VM:~$ less /var/log/libvirt/qemu/VMNAME.log

2018-10-11T07:40:45.837151Z qemu-system-x86_64: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
2018-10-11T07:40:45.844279Z qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.invpcid [bit 10]
2018-10-11T07:40:45.848532Z qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.07H:EBX.invpcid [bit 10]
```


If you want to reproduce an L2 VM running on L1, follow [4].


**However** a Cirros OS image can run on a L1 VM !


# 7 Thoughts #
I think this is a bug in either `Ubuntu16.04` or `libvirt`.
All the information are here to reproduce the bug, I think.


If I do the same with `Ubuntu 18.04`, on the same hardware, following the same 
steps but with Ubuntu 18.04 .iso and cloud image, it works.

It works if:

* L0 = Ubuntu18.04 (.iso) + qemu/KVM
* L1 = Ubuntu18.04 (cloud image) + qemu/KVM
* L2 = Ubuntu18.04 (cloud image)


It also works if:

* L0 = Ubuntu18.04 (.iso) + qemu/KVM
* L1 = Ubuntu18.04 (cloud image) + qemu/KVM
* L2 = Ubuntu16.04 (cloud image)


Thank you for your time reading !
--
nico


[update]
I tested some new combinations #1 and #2 (see in attachment) with
Ubuntu 18.04 and 16.04.

I think for the moment and if it fits my needs, I will stick to 
combination #1 and/or #2.

Hi Nicolas, interesting.

Seeing CPUID.07H:EBX.invpcid  makes me wonder - IIRC that was a speedup feature long neglected by everyone but suddenly becoming important in the context of meltdown avoidance. Maybe that wasn't passed/emulated in the older qemu but the guest now insists or misdetects it?

This also would sort of match your statement "It worked some times ago, but not anymore." as the meltdown fixes obviously came after the release of 16.04.


Let me try to recreate your initial case first with 16.04->16.04->16.04.

I took a fresh deployed Xenial host and deployed a Xenial guest in lvl1
$ sudo apt install uvtool-libvirt
$ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 release=xenial label=daily
$ uvt-kvm create --memory 4096 --disk 30 --cpu 4 --password ubuntu xenial-guest-lvl1 arch=amd64 release=xenial label=daily

And then in the lvl1 guest doing the same to spawn a smaller lvl2 guest.
...
# note: back then (16.04) nested default libvirt network needed to manually get to work before the next command
$ uvt-kvm create --password ubuntu xenial-guest-lvl2 arch=amd64 release=xenial label=daily


That guest runs just fine and is happy.
So it has to be part of your guest configuration in some way.
$ cat /proc/cpuinfo  | grep invpcid
Report it is available on the Host (lvl0) but none of the guests (lvl1/lvl2).

I did not see a warning like yours about CPUID.07H:EBX.invpcid (on neither of the lvls).
My guests are defined the "most default" way possible leaving most of the cpu construction to the defaults of libvirt/qemu.
virsh dumpxml content:
lvl1 => http://paste.ubuntu.com/p/fH57d5prmS/
lvl2 => http://paste.ubuntu.com/p/vQbcgfmfVv/

I wonder if your way to setup the guests uses special CPU types that define the meltdowny features - like the -IBRS types or even adding features like those mentioned in [1].

E.g. Virt-manager would default to "Haswell-noTSX-IBRS" on my system with the virt stack of 16.04. 

If I use that in my guest definition (on both levels)
  <cpu mode='custom' match='exact'>
    <model fallback='allow'>Haswell-noTSX-IBRS</model>
  </cpu>

Now I get invpcid in $ cat /proc/cpuinfo  | grep invpcid in the lvl1 guest.
But since this type lacks the KVM features I'm no more assuming but waiting for your reply on how guest CPU is modelled in your case.

But in  general in that case i could think of this being a potential trouble for (x86) nesting which is generally known as "working great until it does't"

Waiting for your feedback on guest CPU definitions in your case.

[1]: https://www.berrange.com/posts/2018/06/29/cpu-model-configuration-for-qemu-kvm-on-x86-hosts/

Hi Christian,

First, I tried to create a lvl2 VM using your suggestion with `uvtool-libvirt`.
I have tried this on one of my lvl1 VM, which is an OpenStack compute node.

```
compute@L1: $ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 release=xenial label=daily
compute@L1: $ uvt-kvm create --memory 4096 --disk 30 --cpu 4 --password ubuntu xenial-guest-lvl1 arch=amd64 release=xenial label=daily
```
And this way, it works !
However, if I use the OpenStack API to create a lvl2 VM on this same compute 
node, the OpenStack Nova VM fails.
.
.
.
.
If we recap the 3 options tested here to create an lvl2 VM:
  1. OpenStack API -> FAIL
  The lvl2 VM is stuck for example at:
  "Starting Update UTMP about System Boot/Shutdown..."

  2. virt-install "By hand" -> FAIL
  If I do like in [1], VM generate the same error than in my 1st post.

  3. uvtool-libvirt -> SUCCESS
  Your example works just fine.

[1] https://youth2009.org/post/kvm-with-ubuntu-cloud-image/
.
.
.
.
You say:
  > That guest runs just fine and is happy.
  > So it has to be part of your guest configuration in some way.

I agree: maybe I should look further into the options of `virt-install`.
What I give in my first post (the virt-install script) is my way of creating
lvl1 VM.

    NB: It seems that for the moment, --os-variant has no `ubuntu18.04` value.
        I keep this parameter to ubuntu16.04, even if I want to create a 
        18.04 VM.

The difference between Ubuntu 16.04 and 18.04 regarding `virt-install`:
  * 16.04: virt-install --version is 1.3.2
  * 18.04: virt-install --version is 1.5.1

So maybe the problem comes from `virt-install` and the way I configure a VM.
.
.
.
.
However when looking at the OpenStack API, here I am not the one who provides
the guest configuration. I provide the OpenStack API with the info it needs to
create a new OpenStack instance (i.e. flavor, image type, cloud-init config, 
etc...) and then the API ~converts~ this description to instantiate this
OpenStack instance on the compute node which is running qemu/KVM.

I am not sure what the OpenStack API uses to do that. I assume it uses 
python-libvirt [2] but I may be wrong.

[2] https://libvirt.org/docs/libvirt-appdev-guide-python/en-US/html/libvirt_application_development_guide_using_python-Guest_Domains-Lifecycle_Control.html#libvirt_application_development_guide_using_python-Guest_Domains-Lifecycle-Provisioning_and_Starting
.
.
.
.
You say:
  > I wonder if your way to setup the guests uses special CPU types [...]
  > 
  > Waiting for your feedback on guest CPU definitions in your case.

My CPU is a Intel Xeon Broadwell.

On the lvl0, which have 48 cores:
```
baremetal@L0:cat /proc/cpuinfo | grep invpcid | wc -l
48
```

On the lvl1, which is a VM with 20 vCPU:
```
compute@L1:$ cat /proc/cpuinfo | grep invpcid | wc -l
20
```
.
.
.
.
Dumpxml of a working lvl1 VM "compute41":
https://paste.ubuntu.com/p/KMrCKGgvRg/

Dumpxml of a failing lvl2 VM created by the OpenStack/Nova API on "compute41":
https://paste.ubuntu.com/p/9FrhMWWgVk/

Dumpxml of a working lvl2 VM created by uvtool-libvirt:
https://paste.ubuntu.com/p/4CztPDW7fM/

On difference I see with your Dumpxml is for os part:
  machine='pc-i440fx-bionic' for me
  machine='pc-i440fx-xenial' for you

Maybe this is due to the way I install qemu with cloud-archive:queens [3].

[3] https://wiki.ubuntu.com/OpenStack/CloudArchive

.
.
.
.
qemu logs for the lvl2 VM created by uvtool-libvirt:
```
compute@L1:$ cat /var/log/libvirt/qemu/xenial-guest-lvl2.log

[...]
2018-10-12T14:28:03.317760Z qemu-system-x86_64: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
```

If I miss sth, let me know!
--
Nicolas

On my previous comment, I am in the "16.04 > 16.04 > 16.04" situation.
I say this especially for my 3 dumpxml files.

Ok,
at lvl1 definition Openstack came up with it's cpu modelling which in this case is actually:
  cpu mode='host-passthrough
  + a bunch of required features
That is what gives your LVL1 the invpcid feature (so far so good).

At lvl2 we have
Nova:
  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
vs uvtool
 <!-- has no definition, keeping defaults -->

Thanks for the data Nicolas!
With that in mind I have set my LVL1 to run the same host-passthrough config that you have reported.
Then again I configure LVL2 to run the same host-model config.
Note: "my 16.04" would not allow "check='partial'", so I dropped it.
What version of libvirt is running in your lvl1 (or all levels)?
Current is 1.3.1-1ubuntu10.24

I was feeling glad that it seems that the uvtool style guests work for you as I assumed.
But even with the same CPU definitions used in my case it works for me.
That is for "16.04 > 16.04 > 16.04" as well.

x86 nested virt is never really supported, just "as good as it happens to work". I wonder if that is one of those cases.
My chip is a somewhat older 12 core "Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz".

We have plenty of SW workarounds already, but if you can spend the time I wonder if you can:
- reproduce the same on a different host CPU
- To confirm our current theory that the usage/emulation/nesting of invpcid is the root cause, could you on the failing case in the definition for LVL2 add <feature policy='disable' name='invpcid'/> to the cpu section. That would keep the rest as-is, but remove that feature.


NB-reply: there was no --os-variant 18.04 released back then, but since there was no change since former releases it doesn't matter - the only drawback is that sometimes people are wondering if it is missing.

You say:
  > What version of libvirt is running in your lvl1 (or all levels)?

cf my 1st post and below:

On my Ubuntu16.04 bare metal host:
    ```
    libvirt-bin      4.0.0-1ubuntu8.5~cloud0
    qemu-system-x86  1:2.11+dfsg-1ubuntu7.6~cloud0
    ```

On one Ubuntu16.04 lvl1 VM:
    ```
    libvirt-bin      4.0.0-1ubuntu8.3~cloud0
    qemu-system-x86  1:2.11+dfsg-1ubuntu7.4~cloud0
    ```
.
.
.
.
You say:
  > - [can you] reproduce the same on a different host CPU

If I can I will try, on a much 'smaller' device (not a Xeon CPU).
Maybe tomorow.
.
.
.
.
You say:
  > To confirm our current theory that the usage/emulation/nesting of invpcid 
  is the root cause [...]

Dumpxml of my lvl2 VM with "<feature policy='disable' name='invpcid'/>":
https://paste.ubuntu.com/p/WxvfBcHnF2/

And here is the boot log of the lvl2 VM with invpcid disabled:
The VM failed to boot. Maybe I missed sth.
https://paste.ubuntu.com/p/bkqDsT8VTy/

I followed [1]:
[1] https://youth2009.org/post/kvm-with-ubuntu-cloud-image/
.
.
.
.
you say:
  > NB-reply: there was no --os-variant 18.04 released back then

No big deal, it is just frustrating to instantiate an Ubuntu18.04 and to set 
this parameter to something else !
.
.
.
.
You say:
  > x86 nested virt is never really supported, just "as good as it happens to work". I wonder if that is one of those cases.

First I thought nested virt could make my life easier regarding what I wanted
to achieve. There are several ways of testing and deploying OpenStack.
With the way I chose, I could 'simulate' a multi-hosts environment like in [2] 
with several compute nodes, etc...

[2] https://github.com/nuagenetworks/nuage-openstack-ansible/wiki/Configure-OSA-Multi-node-Environment

But now I understand that nested virt is maybe too much a beta.
I will try with "Ubuntu 18.04 > Ubuntu 16/18 > {whatever OS}".
Maybe this is patched in Ubuntu18.04.
And if it does not suit my needs, I will figure out something else (and more 
bare metal ^^). 

Metal as a Service (Ubuntu MAAS) looks good, but it is too much for me.
.
.
.
.
TY for your help. I think this bug is hard to identify and maybe harder to
patch. And I am not a virtualization or qemu or OpenStack expert 
(for the moment !?). So I can't help you more.

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

FWIW, bumping the kernel on the host (and most likely on the L1 VMs too) should work.
The HWE kernel in Xenial is the same version (4.15) with the kernel used by Bionic (18.04), so this should fix the problem:
$ apt install linux-generic-hwe-16.04
$ reboot

BR,
Alex

Thank you for your suggestion @Alexandru !
.
.
.
I can not try this fix because since then I have moved on and I use Ubuntu18.04 for my L0 hypervisor, and I have also tried with Ubuntu18.04 on th L1 VMs.
.
.
.
However very interesting. On my previous Ubuntu16.04 hosts, I believe I used "linux-image-4.4.0-XYZ-generic".

I think I have encountered exactly the same issue as you Nico. We have very similar setups. 
Upgrading the kernel did not help.
Only thing that helped was setting openstack to use qemu instead of kvm in L2 VMs with the performance cost associated with doing that :(