diff options
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/amd-memory-encryption.txt | 89 | ||||
| -rw-r--r-- | docs/devel/index.rst | 1 | ||||
| -rw-r--r-- | docs/devel/vfio-migration.rst | 150 | ||||
| -rw-r--r-- | docs/interop/firmware.json | 47 | ||||
| -rw-r--r-- | docs/system/cpu-models-x86-abi.csv | 67 | ||||
| -rw-r--r-- | docs/system/cpu-models-x86.rst.inc | 22 |
6 files changed, 350 insertions, 26 deletions
diff --git a/docs/amd-memory-encryption.txt b/docs/amd-memory-encryption.txt index 145896aec7..ffca382b5f 100644 --- a/docs/amd-memory-encryption.txt +++ b/docs/amd-memory-encryption.txt @@ -1,38 +1,48 @@ Secure Encrypted Virtualization (SEV) is a feature found on AMD processors. SEV is an extension to the AMD-V architecture which supports running encrypted -virtual machine (VMs) under the control of KVM. Encrypted VMs have their pages +virtual machines (VMs) under the control of KVM. Encrypted VMs have their pages (code and data) secured such that only the guest itself has access to the unencrypted version. Each encrypted VM is associated with a unique encryption -key; if its data is accessed to a different entity using a different key the +key; if its data is accessed by a different entity using a different key the encrypted guests data will be incorrectly decrypted, leading to unintelligible data. -The key management of this feature is handled by separate processor known as -AMD secure processor (AMD-SP) which is present in AMD SOCs. Firmware running -inside the AMD-SP provide commands to support common VM lifecycle. This +Key management for this feature is handled by a separate processor known as the +AMD secure processor (AMD-SP), which is present in AMD SOCs. Firmware running +inside the AMD-SP provides commands to support a common VM lifecycle. This includes commands for launching, snapshotting, migrating and debugging the -encrypted guest. Those SEV command can be issued via KVM_MEMORY_ENCRYPT_OP +encrypted guest. These SEV commands can be issued via KVM_MEMORY_ENCRYPT_OP ioctls. +Secure Encrypted Virtualization - Encrypted State (SEV-ES) builds on the SEV +support to additionally protect the guest register state. In order to allow a +hypervisor to perform functions on behalf of a guest, there is architectural +support for notifying a guest's operating system when certain types of VMEXITs +are about to occur. This allows the guest to selectively share information with +the hypervisor to satisfy the requested function. + Launching --------- -Boot images (such as bios) must be encrypted before guest can be booted. -MEMORY_ENCRYPT_OP ioctl provides commands to encrypt the images :LAUNCH_START, +Boot images (such as bios) must be encrypted before a guest can be booted. The +MEMORY_ENCRYPT_OP ioctl provides commands to encrypt the images: LAUNCH_START, LAUNCH_UPDATE_DATA, LAUNCH_MEASURE and LAUNCH_FINISH. These four commands together generate a fresh memory encryption key for the VM, encrypt the boot -images and provide a measurement than can be used as an attestation of the +images and provide a measurement than can be used as an attestation of a successful launch. +For a SEV-ES guest, the LAUNCH_UPDATE_VMSA command is also used to encrypt the +guest register state, or VM save area (VMSA), for all of the guest vCPUs. + LAUNCH_START is called first to create a cryptographic launch context within -the firmware. To create this context, guest owner must provides guest policy, +the firmware. To create this context, guest owner must provide a guest policy, its public Diffie-Hellman key (PDH) and session parameters. These inputs -should be treated as binary blob and must be passed as-is to the SEV firmware. +should be treated as a binary blob and must be passed as-is to the SEV firmware. -The guest policy is passed as plaintext and hypervisor may able to read it +The guest policy is passed as plaintext. A hypervisor may choose to read it, but should not modify it (any modification of the policy bits will result in bad measurement). The guest policy is a 4-byte data structure containing -several flags that restricts what can be done on running SEV guest. +several flags that restricts what can be done on a running SEV guest. See KM Spec section 3 and 6.2 for more details. The guest policy can be provided via the 'policy' property (see below) @@ -40,31 +50,42 @@ The guest policy can be provided via the 'policy' property (see below) # ${QEMU} \ sev-guest,id=sev0,policy=0x1...\ -Guest owners provided DH certificate and session parameters will be used to +Setting the "SEV-ES required" policy bit (bit 2) will launch the guest as a +SEV-ES guest (see below) + +# ${QEMU} \ + sev-guest,id=sev0,policy=0x5...\ + +The guest owner provided DH certificate and session parameters will be used to establish a cryptographic session with the guest owner to negotiate keys used for the attestation. -The DH certificate and session blob can be provided via 'dh-cert-file' and -'session-file' property (see below +The DH certificate and session blob can be provided via the 'dh-cert-file' and +'session-file' properties (see below) # ${QEMU} \ sev-guest,id=sev0,dh-cert-file=<file1>,session-file=<file2> LAUNCH_UPDATE_DATA encrypts the memory region using the cryptographic context -created via LAUNCH_START command. If required, this command can be called +created via the LAUNCH_START command. If required, this command can be called multiple times to encrypt different memory regions. The command also calculates the measurement of the memory contents as it encrypts. -LAUNCH_MEASURE command can be used to retrieve the measurement of encrypted -memory. This measurement is a signature of the memory contents that can be -sent to the guest owner as an attestation that the memory was encrypted +LAUNCH_UPDATE_VMSA encrypts all the vCPU VMSAs for a SEV-ES guest using the +cryptographic context created via the LAUNCH_START command. The command also +calculates the measurement of the VMSAs as it encrypts them. + +LAUNCH_MEASURE can be used to retrieve the measurement of encrypted memory and, +for a SEV-ES guest, encrypted VMSAs. This measurement is a signature of the +memory contents and, for a SEV-ES guest, the VMSA contents, that can be sent +to the guest owner as an attestation that the memory and VMSAs were encrypted correctly by the firmware. The guest owner may wait to provide the guest confidential information until it can verify the attestation measurement. Since the guest owner knows the initial contents of the guest at boot, the attestation measurement can be verified by comparing it to what the guest owner expects. -LAUNCH_FINISH command finalizes the guest launch and destroy's the cryptographic +LAUNCH_FINISH finalizes the guest launch and destroys the cryptographic context. See SEV KM API Spec [1] 'Launching a guest' usage flow (Appendix A) for the @@ -76,12 +97,28 @@ To launch a SEV guest -machine ...,confidential-guest-support=sev0 \ -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 +To launch a SEV-ES guest + +# ${QEMU} \ + -machine ...,confidential-guest-support=sev0 \ + -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1,policy=0x5 + +An SEV-ES guest has some restrictions as compared to a SEV guest. Because the +guest register state is encrypted and cannot be updated by the VMM/hypervisor, +a SEV-ES guest: + - Does not support SMM - SMM support requires updating the guest register + state. + - Does not support reboot - a system reset requires updating the guest register + state. + - Requires in-kernel irqchip - the burden is placed on the hypervisor to + manage booting APs. + Debugging ----------- -Since memory contents of SEV guest is encrypted hence hypervisor access to the -guest memory will get a cipher text. If guest policy allows debugging, then -hypervisor can use DEBUG_DECRYPT and DEBUG_ENCRYPT commands access the guest -memory region for debug purposes. This is not supported in QEMU yet. +Since the memory contents of a SEV guest are encrypted, hypervisor access to +the guest memory will return cipher text. If the guest policy allows debugging, +then a hypervisor can use the DEBUG_DECRYPT and DEBUG_ENCRYPT commands to access +the guest memory region for debug purposes. This is not supported in QEMU yet. Snapshot/Restore ----------------- @@ -102,8 +139,10 @@ Secure Encrypted Virtualization Key Management: KVM Forum slides: http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf +https://www.linux-kvm.org/images/9/94/Extending-Secure-Encrypted-Virtualization-with-SEV-ES-Thomas-Lendacky-AMD.pdf AMD64 Architecture Programmer's Manual: http://support.amd.com/TechDocs/24593.pdf SME is section 7.10 SEV is section 15.34 + SEV-ES is section 15.35 diff --git a/docs/devel/index.rst b/docs/devel/index.rst index 791925dcda..977c3893bd 100644 --- a/docs/devel/index.rst +++ b/docs/devel/index.rst @@ -44,3 +44,4 @@ Contents: block-coroutine-wrapper multi-process ebpf_rss + vfio-migration diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst new file mode 100644 index 0000000000..9ff6163c88 --- /dev/null +++ b/docs/devel/vfio-migration.rst @@ -0,0 +1,150 @@ +===================== +VFIO device Migration +===================== + +Migration of virtual machine involves saving the state for each device that +the guest is running on source host and restoring this saved state on the +destination host. This document details how saving and restoring of VFIO +devices is done in QEMU. + +Migration of VFIO devices consists of two phases: the optional pre-copy phase, +and the stop-and-copy phase. The pre-copy phase is iterative and allows to +accommodate VFIO devices that have a large amount of data that needs to be +transferred. The iterative pre-copy phase of migration allows for the guest to +continue whilst the VFIO device state is transferred to the destination, this +helps to reduce the total downtime of the VM. VFIO devices can choose to skip +the pre-copy phase of migration by returning pending_bytes as zero during the +pre-copy phase. + +A detailed description of the UAPI for VFIO device migration can be found in +the comment for the ``vfio_device_migration_info`` structure in the header +file linux-headers/linux/vfio.h. + +VFIO implements the device hooks for the iterative approach as follows: + +* A ``save_setup`` function that sets up the migration region and sets _SAVING + flag in the VFIO device state. + +* A ``load_setup`` function that sets up the migration region on the + destination and sets _RESUMING flag in the VFIO device state. + +* A ``save_live_pending`` function that reads pending_bytes from the vendor + driver, which indicates the amount of data that the vendor driver has yet to + save for the VFIO device. + +* A ``save_live_iterate`` function that reads the VFIO device's data from the + vendor driver through the migration region during iterative phase. + +* A ``save_state`` function to save the device config space if it is present. + +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the + VFIO device state and iteratively copies the remaining data for the VFIO + device until the vendor driver indicates that no data remains (pending bytes + is zero). + +* A ``load_state`` function that loads the config section and the data + sections that are generated by the save functions above + +* ``cleanup`` functions for both save and load that perform any migration + related cleanup, including unmapping the migration region + + +The VFIO migration code uses a VM state change handler to change the VFIO +device state when the VM state changes from running to not-running, and +vice versa. + +Similarly, a migration state change handler is used to trigger a transition of +the VFIO device state when certain changes of the migration state occur. For +example, the VFIO device state is transitioned back to _RUNNING in case a +migration failed or was canceled. + +System memory dirty pages tracking +---------------------------------- + +A ``log_global_start`` and ``log_global_stop`` memory listener callback informs +the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync`` +memory listener callback marks those system memory pages as dirty which are +used for DMA by the VFIO device. The dirty pages bitmap is queried per +container. All pages pinned by the vendor driver through external APIs have to +be marked as dirty during migration. When there are CPU writes, CPU dirty page +tracking can identify dirtied pages, but any page pinned by the vendor driver +can also be written by the device. There is currently no device or IOMMU +support for dirty page tracking in hardware. + +By default, dirty pages are tracked when the device is in pre-copy as well as +stop-and-copy phase. So, a page pinned by the vendor driver will be copied to +the destination in both phases. Copying dirty pages in pre-copy phase helps +QEMU to predict if it can achieve its downtime tolerances. If QEMU during +pre-copy phase keeps finding dirty pages continuously, then it understands +that even in stop-and-copy phase, it is likely to find dirty pages and can +predict the downtime accordingly. + +QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking`` +which disables querying the dirty bitmap during pre-copy phase. If it is set to +off, all dirty pages will be copied to the destination in stop-and-copy phase +only. + +System memory dirty pages tracking when vIOMMU is enabled +--------------------------------------------------------- + +With vIOMMU, an IO virtual address range can get unmapped while in pre-copy +phase of migration. In that case, the unmap ioctl returns any dirty pages in +that range and QEMU reports corresponding guest physical pages dirty. During +stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped +pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those +mapped ranges. + +Flow of state changes during Live migration +=========================================== + +Below is the flow of state change during live migration. +The values in the brackets represent the VM state, the migration state, and +the VFIO device state, respectively. + +Live migration save path +------------------------ + +:: + + QEMU normal running state + (RUNNING, _NONE, _RUNNING) + | + migrate_init spawns migration_thread + Migration thread then calls each device's .save_setup() + (RUNNING, _SETUP, _RUNNING|_SAVING) + | + (RUNNING, _ACTIVE, _RUNNING|_SAVING) + If device is active, get pending_bytes by .save_live_pending() + If total pending_bytes >= threshold_size, call .save_live_iterate() + Data of VFIO device for pre-copy phase is copied + Iterate till total pending bytes converge and are less than threshold + | + On migration completion, vCPU stops and calls .save_live_complete_precopy for + each active device. The VFIO device is then transitioned into _SAVING state + (FINISH_MIGRATE, _DEVICE, _SAVING) + | + For the VFIO device, iterate in .save_live_complete_precopy until + pending data is 0 + (FINISH_MIGRATE, _DEVICE, _STOPPED) + | + (FINISH_MIGRATE, _COMPLETED, _STOPPED) + Migraton thread schedules cleanup bottom half and exits + +Live migration resume path +-------------------------- + +:: + + Incoming migration calls .load_setup for each device + (RESTORE_VM, _ACTIVE, _STOPPED) + | + For each device, .load_state is called for that device section data + (RESTORE_VM, _ACTIVE, _RESUMING) + | + At the end, .load_cleanup is called for each device and vCPUs are started + (RUNNING, _NONE, _RUNNING) + +Postcopy +======== + +Postcopy migration is currently not supported for VFIO devices. diff --git a/docs/interop/firmware.json b/docs/interop/firmware.json index 9d94ccafa9..8d8b0be030 100644 --- a/docs/interop/firmware.json +++ b/docs/interop/firmware.json @@ -115,6 +115,12 @@ # this feature are documented in # "docs/amd-memory-encryption.txt". # +# @amd-sev-es: The firmware supports running under AMD Secure Encrypted +# Virtualization - Encrypted State, as specified in the AMD64 +# Architecture Programmer's Manual. QEMU command line options +# related to this feature are documented in +# "docs/amd-memory-encryption.txt". +# # @enrolled-keys: The variable store (NVRAM) template associated with # the firmware binary has the UEFI Secure Boot # operational mode turned on, with certificates @@ -179,7 +185,7 @@ # Since: 3.0 ## { 'enum' : 'FirmwareFeature', - 'data' : [ 'acpi-s3', 'acpi-s4', 'amd-sev', 'enrolled-keys', + 'data' : [ 'acpi-s3', 'acpi-s4', 'amd-sev', 'amd-sev-es', 'enrolled-keys', 'requires-smm', 'secure-boot', 'verbose-dynamic', 'verbose-static' ] } @@ -504,6 +510,45 @@ # } # # { +# "description": "OVMF with SEV-ES support", +# "interface-types": [ +# "uefi" +# ], +# "mapping": { +# "device": "flash", +# "executable": { +# "filename": "/usr/share/OVMF/OVMF_CODE.fd", +# "format": "raw" +# }, +# "nvram-template": { +# "filename": "/usr/share/OVMF/OVMF_VARS.fd", +# "format": "raw" +# } +# }, +# "targets": [ +# { +# "architecture": "x86_64", +# "machines": [ +# "pc-q35-*" +# ] +# } +# ], +# "features": [ +# "acpi-s3", +# "amd-sev", +# "amd-sev-es", +# "verbose-dynamic" +# ], +# "tags": [ +# "-a X64", +# "-p OvmfPkg/OvmfPkgX64.dsc", +# "-t GCC48", +# "-b DEBUG", +# "-D FD_SIZE_4MB" +# ] +# } +# +# { # "description": "UEFI firmware for ARM64 virtual machines", # "interface-types": [ # "uefi" diff --git a/docs/system/cpu-models-x86-abi.csv b/docs/system/cpu-models-x86-abi.csv new file mode 100644 index 0000000000..f3f3b60be1 --- /dev/null +++ b/docs/system/cpu-models-x86-abi.csv @@ -0,0 +1,67 @@ +Model,baseline,v2,v3,v4 +486-v1,,,, +Broadwell-v1,✅,✅,✅, +Broadwell-v2,✅,✅,✅, +Broadwell-v3,✅,✅,✅, +Broadwell-v4,✅,✅,✅, +Cascadelake-Server-v1,✅,✅,✅,✅ +Cascadelake-Server-v2,✅,✅,✅,✅ +Cascadelake-Server-v3,✅,✅,✅,✅ +Cascadelake-Server-v4,✅,✅,✅,✅ +Conroe-v1,✅,,, +Cooperlake-v1,✅,✅,✅,✅ +Denverton-v1,✅,✅,, +Denverton-v2,✅,✅,, +Dhyana-v1,✅,✅,✅, +EPYC-Milan-v1,✅,✅,✅, +EPYC-Rome-v1,✅,✅,✅, +EPYC-Rome-v2,✅,✅,✅, +EPYC-v1,✅,✅,✅, +EPYC-v2,✅,✅,✅, +EPYC-v3,✅,✅,✅, +Haswell-v1,✅,✅,✅, +Haswell-v2,✅,✅,✅, +Haswell-v3,✅,✅,✅, +Haswell-v4,✅,✅,✅, +Icelake-Client-v1,✅,✅,✅, +Icelake-Client-v2,✅,✅,✅, +Icelake-Server-v1,✅,✅,✅,✅ +Icelake-Server-v2,✅,✅,✅,✅ +Icelake-Server-v3,✅,✅,✅,✅ +Icelake-Server-v4,✅,✅,✅,✅ +IvyBridge-v1,✅,✅,, +IvyBridge-v2,✅,✅,, +KnightsMill-v1,✅,✅,✅, +Nehalem-v1,✅,✅,, +Nehalem-v2,✅,✅,, +Opteron_G1-v1,✅,,, +Opteron_G2-v1,✅,,, +Opteron_G3-v1,✅,,, +Opteron_G4-v1,✅,✅,, +Opteron_G5-v1,✅,✅,, +Penryn-v1,✅,,, +SandyBridge-v1,✅,✅,, +SandyBridge-v2,✅,✅,, +Skylake-Client-v1,✅,✅,✅, +Skylake-Client-v2,✅,✅,✅, +Skylake-Client-v3,✅,✅,✅, +Skylake-Server-v1,✅,✅,✅,✅ +Skylake-Server-v2,✅,✅,✅,✅ +Skylake-Server-v3,✅,✅,✅,✅ +Skylake-Server-v4,✅,✅,✅,✅ +Snowridge-v1,✅,✅,, +Snowridge-v2,✅,✅,, +Westmere-v1,✅,✅,, +Westmere-v2,✅,✅,, +athlon-v1,,,, +core2duo-v1,✅,,, +coreduo-v1,,,, +kvm32-v1,,,, +kvm64-v1,✅,,, +n270-v1,,,, +pentium-v1,,,, +pentium2-v1,,,, +pentium3-v1,,,, +phenom-v1,✅,,, +qemu32-v1,,,, +qemu64-v1,✅,,, diff --git a/docs/system/cpu-models-x86.rst.inc b/docs/system/cpu-models-x86.rst.inc index 867c8216b5..f40ee03ecc 100644 --- a/docs/system/cpu-models-x86.rst.inc +++ b/docs/system/cpu-models-x86.rst.inc @@ -39,6 +39,28 @@ CPU, as they would with "Host passthrough", but gives much of the benefit of passthrough, while making live migration safe. +ABI compatibility levels for CPU models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The x86_64 architecture has a number of `ABI compatibility levels`_ +defined. Traditionally most operating systems and toolchains would +only target the original baseline ABI. It is expected that in +future OS and toolchains are likely to target newer ABIs. The +table that follows illustrates which ABI compatibility levels +can be satisfied by the QEMU CPU models. Note that the table only +lists the long term stable CPU model versions (eg Haswell-v4). +In addition to whats listed, there are also many CPU model +aliases which resolve to a different CPU model version, +depending on the machine type is in use. + +.. _ABI compatibility levels: https://gitlab.com/x86-psABIs/x86-64-ABI/ + +.. csv-table:: x86-64 ABI compatibility levels + :file: cpu-models-x86-abi.csv + :widths: 40,15,15,15,15 + :header-rows: 2 + + Preferred CPU models for Intel x86 hosts ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |