summary refs log tree commit diff stats
path: root/target/i386/kvm/kvm.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* target/i386: SEV: use KVM_SEV_INIT2 if possiblePaolo Bonzini2024-04-231-0/+2
| | | | | | | | | | | | | | | | | | | | | Implement support for the KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM virtual machine types, and the KVM_SEV_INIT2 function of KVM_MEMORY_ENCRYPT_OP. These replace the KVM_SEV_INIT and KVM_SEV_ES_INIT functions, and have several advantages: - sharing the initialization sequence with SEV-SNP and TDX - allowing arguments including the set of desired VMSA features - protection against invalid use of KVM_GET/SET_* ioctls for guests with encrypted state If the KVM_X86_SEV_VM and KVM_X86_SEV_ES_VM types are not supported, fall back to KVM_SEV_INIT and KVM_SEV_ES_INIT (which use the default x86 VM type). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: Implement mc->kvm_type() to get VM typePaolo Bonzini2024-04-231-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM is introducing a new API to create confidential guests, which will be used by TDX and SEV-SNP but is also available for SEV and SEV-ES. The API uses the VM type argument to KVM_CREATE_VM to identify which confidential computing technology to use. Since there are no other expected uses of VM types, delegate mc->kvm_type() for x86 boards to the confidential-guest-support object pointed to by ms->cgs. For example, if a sev-guest object is specified to confidential-guest-support, like, qemu -machine ...,confidential-guest-support=sev0 \ -object sev-guest,id=sev0,... it will check if a VM type KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM is supported, and if so use them together with the KVM_SEV_INIT2 function of the KVM_MEMORY_ENCRYPT_OP ioctl. If not, it will fall back to KVM_SEV_INIT and KVM_SEV_ES_INIT. This is a preparatory work towards TDX and SEV-SNP support, but it will also enable support for VMSA features such as DebugSwap, which are only available via KVM_SEV_INIT2. Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: remove kvm_arch_cpu_check_are_resettablePaolo Bonzini2024-04-231-5/+0
| | | | | | | | | | Board reset requires writing a fresh CPU state. As far as KVM is concerned, the only thing that blocks reset is that CPU state is encrypted; therefore, kvm_cpus_are_resettable() can simply check if that is the case. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/sev: Switch to use confidential_guest_kvm_init()Xiaoyao Li2024-04-231-4/+6
| | | | | | | | | | | | | Use confidential_guest_kvm_init() instead of calling SEV specific sev_kvm_init(). This allows the introduction of multiple confidential-guest-support subclasses for different x86 vendors. As a bonus, stubs are not needed anymore since there is no direct call from target/i386/kvm/kvm.c to SEV code. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20240229060038.606591-1-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/kvm: Move architectural CPUID leaf generation to separate helperSean Christopherson2024-04-231-206/+211
| | | | | | | | | | | | | | Move the architectural (for lack of a better term) CPUID leaf generation to a separate helper so that the generation code can be reused by TDX, which needs to generate a canonical VM-scoped configuration. For now this is just a cleanup, so keep the function static. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240229063726.610065-23-xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* vmbus: Print a warning when enabled without the recommended set of featuresMaciej S. Szmigiero2024-03-081-0/+7
| | | | | | | | | | | | | | | Some Windows versions crash at boot or fail to enable the VMBus device if they don't see the expected set of Hyper-V features (enlightenments). Since this provides poor user experience let's warn user if the VMBus device is enabled without the recommended set of Hyper-V features. The recommended set is the minimum set of Hyper-V features required to make the VMBus device work properly in Windows Server versions 2016, 2019 and 2022. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
* i386/cpuid: Move leaf 7 to correct groupXiaoyao Li2024-02-161-1/+1
| | | | | | | | | | | | | | | | | | CPUID leaf 7 was grouped together with SGX leaf 0x12 by commit b9edbadefb9e ("i386: Propagate SGX CPUID sub-leafs to KVM") by mistake. SGX leaf 0x12 has its specific logic to check if subleaf (starting from 2) is valid or not by checking the bit 0:3 of corresponding EAX is 1 or not. Leaf 7 follows the logic that EAX of subleaf 0 enumerates the maximum valid subleaf. Fixes: b9edbadefb9e ("i386: Propagate SGX CPUID sub-leafs to KVM") Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20240125024016.2521244-4-xiaoyao.li@intel.com> Cc: qemu-stable@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/cpuid: Remove subleaf constraint on CPUID leaf 1FXiaoyao Li2024-02-161-4/+0
| | | | | | | | | No such constraint that subleaf index needs to be less than 64. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by:Yang Weijiang <weijiang.yang@intel.com> Message-ID: <20240125024016.2521244-3-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/cpuid: Decrease cpuid_i when skipping CPUID leaf 1FXiaoyao Li2024-02-161-0/+1
| | | | | | | | | | | | | | Existing code misses a decrement of cpuid_i when skip leaf 0x1F. There's a blank CPUID entry(with leaf, subleaf as 0, and all fields stuffed 0s) left in the CPUID array. It conflicts with correct CPUID leaf 0. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by:Yang Weijiang <weijiang.yang@intel.com> Message-ID: <20240125024016.2521244-2-xiaoyao.li@intel.com> Cc: qemu-stable@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* system/cpus: rename qemu_mutex_lock_iothread() to bql_lock()Stefan Hajnoczi2024-01-081-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Big QEMU Lock (BQL) has many names and they are confusing. The actual QemuMutex variable is called qemu_global_mutex but it's commonly referred to as the BQL in discussions and some code comments. The locking APIs, however, are called qemu_mutex_lock_iothread() and qemu_mutex_unlock_iothread(). The "iothread" name is historic and comes from when the main thread was split into into KVM vcpu threads and the "iothread" (now called the main loop thread). I have contributed to the confusion myself by introducing a separate --object iothread, a separate concept unrelated to the BQL. The "iothread" name is no longer appropriate for the BQL. Rename the locking APIs to: - void bql_lock(void) - void bql_unlock(void) - bool bql_locked(void) There are more APIs with "iothread" in their names. Subsequent patches will rename them. There are also comments and documentation that will be updated in later patches. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Fabiano Rosas <farosas@suse.de> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Cédric Le Goater <clg@kaod.org> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Harsh Prateek Bora <harshpb@linux.ibm.com> Acked-by: Hyman Huang <yong.huang@smartx.com> Reviewed-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-id: 20240102153529.486531-2-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* i386/sev: Avoid SEV-ES crash due to missing MSR_EFER_LMA bitMichael Roth2023-12-061-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors") added error checking for KVM_SET_SREGS/KVM_SET_SREGS2. In doing so, it exposed a long-running bug in current KVM support for SEV-ES where the kernel assumes that MSR_EFER_LMA will be set explicitly by the guest kernel, in which case EFER write traps would result in KVM eventually seeing MSR_EFER_LMA get set and recording it in such a way that it would be subsequently visible when accessing it via KVM_GET_SREGS/etc. However, guest kernels currently rely on MSR_EFER_LMA getting set automatically when MSR_EFER_LME is set and paging is enabled via CR0_PG_MASK. As a result, the EFER write traps don't actually expose the MSR_EFER_LMA bit, even though it is set internally, and when QEMU subsequently tries to pass this EFER value back to KVM via KVM_SET_SREGS* it will fail various sanity checks and return -EINVAL, which is now considered fatal due to the aforementioned QEMU commit. This can be addressed by inferring the MSR_EFER_LMA bit being set when paging is enabled and MSR_EFER_LME is set, and synthesizing it to ensure the expected bits are all present in subsequent handling on the host side. Ultimately, this handling will be implemented in the host kernel, but to avoid breaking QEMU's SEV-ES support when using older host kernels, the same handling can be done in QEMU just after fetching the register values via KVM_GET_SREGS*. Implement that here. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Akihiko Odaki <akihiko.odaki@daynix.com> Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Lara Lazier <laramglazier@gmail.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: <kvm@vger.kernel.org> Fixes: 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors") Signed-off-by: Michael Roth <michael.roth@amd.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-ID: <20231206155821.1194551-1-michael.roth@amd.com>
* i386/xen: advertise XEN_HVM_CPUID_UPCALL_VECTOR in CPUIDDavid Woodhouse2023-11-071-0/+4
| | | | | | | | This will allow Linux guests (since v6.0) to use the per-vCPU upcall vector delivered as MSI through the local APIC. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* kvm: i8254: require KVM_CAP_PIT2 and KVM_CAP_PIT_STATE2Paolo Bonzini2023-10-251-7/+0
| | | | Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: require KVM_CAP_SET_IDENTITY_MAP_ADDRPaolo Bonzini2023-10-251-13/+7
| | | | | | This was introduced in KVM in Linux 2.6.32, we can require it unconditionally. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: require KVM_CAP_ADJUST_CLOCKPaolo Bonzini2023-10-251-5/+1
| | | | | | | | | This was introduced in KVM in Linux 2.6.33, we can require it unconditionally. KVM_CLOCK_TSC_STABLE was only added in Linux 4.9, for now do not require it (though it would allow the removal of some pretty yucky code). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: require KVM_CAP_MCEPaolo Bonzini2023-10-251-10/+4
| | | | | | This was introduced in KVM in Linux 2.6.34, we can require it unconditionally. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: require KVM_CAP_SET_VCPU_EVENTS and KVM_CAP_X86_ROBUST_SINGLESTEPPaolo Bonzini2023-10-251-90/+2
| | | | Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: require KVM_CAP_XSAVEPaolo Bonzini2023-10-251-68/+2
| | | | | | | | This was introduced in KVM in Linux 2.6.36, and could already be used at the time to save/restore FPU data even on older processor. We can require it unconditionally and stop using KVM_GET/SET_FPU. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: require KVM_CAP_DEBUGREGSPaolo Bonzini2023-10-251-8/+1
| | | | | | This was introduced in KVM in Linux 2.6.35, we can require it unconditionally. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: move KVM_CAP_IRQ_ROUTING detection to kvm_arch_required_capabilitiesPaolo Bonzini2023-10-251-5/+1
| | | | | | Simple code cleanup. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: require KVM_CAP_SIGNAL_MSIPaolo Bonzini2023-10-251-0/+1
| | | | | | | | | | | | | This was introduced in KVM in Linux 3.5, we can require it unconditionally in kvm_irqchip_send_msi(). However, not all architectures have to implement it so check it only in x86, the only architecture that ever had MSI injection but not KVM_CAP_SIGNAL_MSI. ARM uses it to detect the presence of the ITS emulation in the kernel, introduced in Linux 4.8. Assume that it's there and possibly fail when realizing the arm-its-kvm device. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* migration: simplify blockersSteve Sistare2023-10-201-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | Modify migrate_add_blocker and migrate_del_blocker to take an Error ** reason. This allows migration to own the Error object, so that if an error occurs in migrate_add_blocker, migration code can free the Error and clear the client handle, simplifying client code. It also simplifies the migrate_del_blocker call site. In addition, this is a pre-requisite for a proposed future patch that would add a mode argument to migration requests to support live update, and maintain a list of blockers for each mode. A blocker may apply to a single mode or to multiple modes, and passing Error** will allow one Error object to be registered for multiple modes. No functional change. Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Tested-by: Michael Galaxy <mgalaxy@akamai.com> Reviewed-by: Michael Galaxy <mgalaxy@akamai.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> Signed-off-by: Juan Quintela <quintela@redhat.com> Message-ID: <1697634216-84215-1-git-send-email-steven.sistare@oracle.com>
* target/i386/cpu: Fix CPUID_HT exposureXiaoyao Li2023-10-171-0/+2
| | | | | | | | | | | | | | | | When explicitly booting a multiple vcpus vm with "-cpu +ht", it gets warning of warning: host doesn't support requested feature: CPUID.01H:EDX.ht [bit 28] Make CPUID_HT as supported unconditionally can resolve the warning. However it introduces another issue that it also expose CPUID_HT to guest when "-cpu host/max" with only 1 vcpu. To fix this, need mark CPUID_HT as the no_autoenable_flags. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20231010060539.210258-1-xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: eliminate shadowed local variablesPaolo Bonzini2023-09-261-6/+1
| | | | | | These are harmless are they die immediately after their use. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386: spelling fixesMichael Tokarev2023-09-201-2/+2
| | | | | Signed-off-by: Michael Tokarev <mjt@tls.msk.ru> Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
* sysemu/kvm: Restrict kvm_has_pit_state2() to x86 targetsPhilippe Mathieu-Daudé2023-09-071-2/+2
| | | | | | | | | | kvm_has_pit_state2() is only defined for x86 targets (in target/i386/kvm/kvm.c). Its declaration is pointless on all other targets. Have it return a boolean. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20230904124325.79040-13-philmd@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* sysemu/kvm: Use vaddr for kvm_arch_[insert|remove]_hw_breakpointAnton Johansson2023-08-241-5/+3
| | | | | | | | | | | | Changes the signature of the target-defined functions for inserting/removing kvm hw breakpoints. The address and length arguments are now of vaddr type, which both matches the type used internally in accel/kvm/kvm-all.c and makes the api target-agnostic. Signed-off-by: Anton Johansson <anjo@rev.ng> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Message-Id: <20230807155706.9580-4-anjo@rev.ng> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
* kvm: Introduce kvm_arch_get_default_type hookAkihiko Odaki2023-08-221-0/+5
| | | | | | | | | | | | | | | | | | | kvm_arch_get_default_type() returns the default KVM type. This hook is particularly useful to derive a KVM type that is valid for "none" machine model, which is used by libvirt to probe the availability of KVM. For MIPS, the existing mips_kvm_type() is reused. This function ensures the availability of VZ which is mandatory to use KVM on the current QEMU. Cc: qemu-stable@nongnu.org Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Message-id: 20230727073134.134102-2-akihiko.odaki@daynix.com Reviewed-by: Peter Maydell <peter.maydell@linaro.org> [PMM: added doc comment for new function] Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
* target/i386: Allow MCDT_NO if host supportsTao Su2023-07-071-0/+4
| | | | | | | | | | | | MCDT_NO bit indicates HW contains the security fix and doesn't need to be mitigated to avoid data-dependent behaviour for certain instructions. It needs no hypervisor support. Treat it as supported regardless of what KVM reports. Signed-off-by: Tao Su <tao1.su@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20230706054949.66556-4-tao1.su@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Fix build without CONFIG_XEN_EMUMiroslav Rezanina2023-03-151-0/+2
| | | | | | | | | | | | | Upstream commit ddf0fd9ae1 "hw/xen: Support HVM_PARAM_CALLBACK_TYPE_GSI callback" added kvm_xen_maybe_deassert_callback usage to target/i386/kvm/kvm.c file without conditional preprocessing check. This breaks any build not using CONFIG_XEN_EMU. Protect call by conditional preprocessing to allow build without CONFIG_XEN_EMU. Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230308130557.2420-1-mrezanin@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/i386: Add xen-evtchn-max-pirq propertyDavid Woodhouse2023-03-011-0/+34
| | | | | | | | The default number of PIRQs is set to 256 to avoid issues with 32-bit MSI devices. Allow it to be increased if the user desires. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* hw/xen: Support MSI mapping to PIRQDavid Woodhouse2023-03-011-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The way that Xen handles MSI PIRQs is kind of awful. There is a special MSI message which targets a PIRQ. The vector in the low bits of data must be zero. The low 8 bits of the PIRQ# are in the destination ID field, the extended destination ID field is unused, and instead the high bits of the PIRQ# are in the high 32 bits of the address. Using the high bits of the address means that we can't intercept and translate these messages in kvm_send_msi(), because they won't be caught by the APIC — addresses like 0x1000fee46000 aren't in the APIC's range. So we catch them in pci_msi_trigger() instead, and deliver the event channel directly. That isn't even the worst part. The worst part is that Xen snoops on writes to devices' MSI vectors while they are *masked*. When a MSI message is written which looks like it targets a PIRQ, it remembers the device and vector for later. When the guest makes a hypercall to bind that PIRQ# (snooped from a marked MSI vector) to an event channel port, Xen *unmasks* that MSI vector on the device. Xen guests using PIRQ delivery of MSI don't ever actually unmask the MSI for themselves. Now that this is working we can finally enable XENFEAT_hvm_pirqs and let the guest use it all. Tested with passthrough igb and emulated e1000e + AHCI. CPU0 CPU1 0: 65 0 IO-APIC 2-edge timer 1: 0 14 xen-pirq 1-ioapic-edge i8042 4: 0 846 xen-pirq 4-ioapic-edge ttyS0 8: 1 0 xen-pirq 8-ioapic-edge rtc0 9: 0 0 xen-pirq 9-ioapic-level acpi 12: 257 0 xen-pirq 12-ioapic-edge i8042 24: 9600 0 xen-percpu -virq timer0 25: 2758 0 xen-percpu -ipi resched0 26: 0 0 xen-percpu -ipi callfunc0 27: 0 0 xen-percpu -virq debug0 28: 1526 0 xen-percpu -ipi callfuncsingle0 29: 0 0 xen-percpu -ipi spinlock0 30: 0 8608 xen-percpu -virq timer1 31: 0 874 xen-percpu -ipi resched1 32: 0 0 xen-percpu -ipi callfunc1 33: 0 0 xen-percpu -virq debug1 34: 0 1617 xen-percpu -ipi callfuncsingle1 35: 0 0 xen-percpu -ipi spinlock1 36: 8 0 xen-dyn -event xenbus 37: 0 6046 xen-pirq -msi ahci[0000:00:03.0] 38: 1 0 xen-pirq -msi-x ens4 39: 0 73 xen-pirq -msi-x ens4-rx-0 40: 14 0 xen-pirq -msi-x ens4-rx-1 41: 0 32 xen-pirq -msi-x ens4-tx-0 42: 47 0 xen-pirq -msi-x ens4-tx-1 Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* kvm/i386: Add xen-gnttab-max-frames propertyDavid Woodhouse2023-03-011-0/+34
| | | | | Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* hw/xen: Support HVM_PARAM_CALLBACK_TYPE_GSI callbackDavid Woodhouse2023-03-011-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The GSI callback (and later PCI_INTX) is a level triggered interrupt. It is asserted when an event channel is delivered to vCPU0, and is supposed to be cleared when the vcpu_info->evtchn_upcall_pending field for vCPU0 is cleared again. Thankfully, Xen does *not* assert the GSI if the guest sets its own evtchn_upcall_pending field; we only need to assert the GSI when we have delivered an event for ourselves. So that's the easy part, kind of. There's a slight complexity in that we need to hold the BQL before we can call qemu_set_irq(), and we definitely can't do that while holding our own port_lock (because we'll need to take that from the qemu-side functions that the PV backend drivers will call). So if we end up wanting to set the IRQ in a context where we *don't* already hold the BQL, defer to a BH. However, we *do* need to poll for the evtchn_upcall_pending flag being cleared. In an ideal world we would poll that when the EOI happens on the PIC/IOAPIC. That's how it works in the kernel with the VFIO eventfd pairs — one is used to trigger the interrupt, and the other works in the other direction to 'resample' on EOI, and trigger the first eventfd again if the line is still active. However, QEMU doesn't seem to do that. Even VFIO level interrupts seem to be supported by temporarily unmapping the device's BARs from the guest when an interrupt happens, then trapping *all* MMIO to the device and sending the 'resample' event on *every* MMIO access until the IRQ is cleared! Maybe in future we'll plumb the 'resample' concept through QEMU's irq framework but for now we'll do what Xen itself does: just check the flag on every vmexit if the upcall GSI is known to be asserted. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* i386/xen: handle VCPUOP_register_vcpu_infoJoao Martins2023-03-011-0/+17
| | | | | | | | | | | | | | | | Handle the hypercall to set a per vcpu info, and also wire up the default vcpu_info in the shared_info page for the first 32 vCPUs. To avoid deadlock within KVM a vCPU thread must set its *own* vcpu_info rather than it being set from the context in which the hypercall is invoked. Add the vcpu_info (and default) GPA to the vmstate_x86_cpu for migration, and restore it in kvm_arch_put_registers() appropriately. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* i386/xen: handle guest hypercallsJoao Martins2023-03-011-0/+5
| | | | | | | | | | | This means handling the new exit reason for Xen but still crashing on purpose. As we implement each of the hypercalls we will then return the right return code. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> [dwmw2: Add CPL to hypercall tracing, disallow hypercalls from CPL > 0] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* i386/kvm: Set Xen vCPU ID in KVMDavid Woodhouse2023-03-011-0/+5
| | | | | | | | | | | | | | | | | | | | | | | There are (at least) three different vCPU ID number spaces. One is the internal KVM vCPU index, based purely on which vCPU was chronologically created in the kernel first. If userspace threads are all spawned and create their KVM vCPUs in essentially random order, then the KVM indices are basically random too. The second number space is the APIC ID space, which is consistent and useful for referencing vCPUs. MSIs will specify the target vCPU using the APIC ID, for example, and the KVM Xen APIs also take an APIC ID from userspace whenever a vCPU needs to be specified (as opposed to just using the appropriate vCPU fd). The third number space is not normally relevant to the kernel, and is the ACPI/MADT/Xen CPU number which corresponds to cs->cpu_index. But Xen timer hypercalls use it, and Xen timer hypercalls *really* want to be accelerated in the kernel rather than handled in userspace, so the kernel needs to be told. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* i386/kvm: handle Xen HVM cpuid leavesJoao Martins2023-03-011-2/+75
| | | | | | | | | | | Introduce support for emulating CPUID for Xen HVM guests. It doesn't make sense to advertise the KVM leaves to a Xen guest, so do Xen unconditionally when the xen-version machine property is set. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> [dwmw2: Obtain xen_version from KVM property, make it automatic] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* i386/kvm: Add xen-version KVM accelerator property and init KVM Xen supportDavid Woodhouse2023-03-011-0/+59
| | | | | | | | | | | | | This just initializes the basic Xen support in KVM for now. Only permitted on TYPE_PC_MACHINE because that's where the sysbus devices for Xen heap overlay, event channel, grant tables and other stuff will exist. There's no point having the basic hypercall support if nothing else works. Provide sysemu/kvm_xen.h and a kvm_xen_get_caps() which will be used later by support devices. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org>
* target/i386: KVM: allow fast string operations if host supports themPaolo Bonzini2023-02-271-1/+16
| | | | | | | | | | | These are just a flag that documents the performance characteristic of an instruction; it needs no hypervisor support. So include them even if KVM does not show them. In particular, FZRM/FSRS/FSRC have only been added very recently, but they are available on Sapphire Rapids processors. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* include/block: Untangle inclusion loopsMarkus Armbruster2023-01-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | We have two inclusion loops: block/block.h -> block/block-global-state.h -> block/block-common.h -> block/blockjob.h -> block/block.h block/block.h -> block/block-io.h -> block/block-common.h -> block/blockjob.h -> block/block.h I believe these go back to Emanuele's reorganization of the block API, merged a few months ago in commit d7e2fe4aac8. Fortunately, breaking them is merely a matter of deleting unnecessary includes from headers, and adding them back in places where they are now missing. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20221221133551.3967339-2-armbru@redhat.com>
* qapi: Use returned bool to check for failure (again)Markus Armbruster2022-12-141-4/+1
| | | | | | | | | | | | | | | | | | | | | Commit 012d4c96e2 changed the visitor functions taking Error ** to return bool instead of void, and the commits following it used the new return value to simplify error checking. Since then a few more uses in need of the same treatment crept in. Do that. All pretty mechanical except for * balloon_stats_get_all() This is basically the same transformation commit 012d4c96e2 applied to the virtual walk example in include/qapi/visitor.h. * set_max_queue_size() Additionally replace "goto end of function" by return. Signed-off-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20221121085054.683122-10-armbru@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
* target/i386: Set maximum APIC ID to KVM prior to vCPU creationZeng Guang2022-10-311-0/+5
| | | | | | | | | | | | | | | | | | Specify maximum possible APIC ID assigned for current VM session to KVM prior to the creation of vCPUs. By this setting, KVM can set up VM-scoped data structure indexed by the APIC ID, e.g. Posted-Interrupt Descriptor pointer table to support Intel IPI virtualization, with the most optimal memory footprint. It can be achieved by calling KVM_ENABLE_CAP for KVM_CAP_MAX_VCPU_ID capability once KVM has enabled it. Ignoring the return error if KVM doesn't support this capability yet. Signed-off-by: Zeng Guang <guang.zeng@intel.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <20220825025246.26618-1-guang.zeng@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Drop useless casts from g_malloc() & friends to pointerMarkus Armbruster2022-10-221-2/+1
| | | | | | | | | | | | These memory allocation functions return void *, and casting to another pointer type is useless clutter. Drop these casts. If you really want another pointer type, consider g_new(). Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Laurent Vivier <laurent@vivier.eu> Message-Id: <20220923120025.448759-3-armbru@redhat.com> Signed-off-by: Laurent Vivier <laurent@vivier.eu>
* hyperv: fix SynIC SINT assertion failure on guest resetMaciej S. Szmigiero2022-10-181-7/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Resetting a guest that has Hyper-V VMBus support enabled triggers a QEMU assertion failure: hw/hyperv/hyperv.c:131: synic_reset: Assertion `QLIST_EMPTY(&synic->sint_routes)' failed. This happens both on normal guest reboot or when using "system_reset" HMP command. The failing assertion was introduced by commit 64ddecc88bcf ("hyperv: SControl is optional to enable SynIc") to catch dangling SINT routes on SynIC reset. The root cause of this problem is that the SynIC itself is reset before devices using SINT routes have chance to clean up these routes. Since there seems to be no existing mechanism to force reset callbacks (or methods) to be executed in specific order let's use a similar method that is already used to reset another interrupt controller (APIC) after devices have been reset - by invoking the SynIC reset from the machine reset handler via a new x86_cpu_after_reset() function co-located with the existing x86_cpu_reset() in target/i386/cpu.c. Opportunistically move the APIC reset handler there, too. Fixes: 64ddecc88bcf ("hyperv: SControl is optional to enable SynIc") # exposed the bug Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <cb57cee2e29b20d06f81dce054cbcea8b5d497e8.1664552976.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: Implement MSR_CORE_THREAD_COUNT MSRAlexander Graf2022-10-111-0/+21
| | | | | | | | | | | | | | The MSR_CORE_THREAD_COUNT MSR describes CPU package topology, such as number of threads and cores for a given package. This is information that QEMU has readily available and can provide through the new user space MSR deflection interface. This patch propagates the existing hvf logic from patch 027ac0cb516 ("target/i386/hvf: add rdmsr 35H MSR_CORE_THREAD_COUNT") to KVM. Signed-off-by: Alexander Graf <agraf@csgraf.de> Message-Id: <20221004225643.65036-4-agraf@csgraf.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386: kvm: Add support for MSR filteringAlexander Graf2022-10-111-0/+123
| | | | | | | | | | | | | | KVM has grown support to deflect arbitrary MSRs to user space since Linux 5.10. For now we don't expect to make a lot of use of this feature, so let's expose it the easiest way possible: With up to 16 individually maskable MSRs. This patch adds a kvm_filter_msr() function that other code can call to install a hook on KVM MSR reads or writes. Signed-off-by: Alexander Graf <agraf@csgraf.de> Message-Id: <20221004225643.65036-3-agraf@csgraf.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386: add notify VM exit supportChenyi Qiang2022-10-111-0/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are cases that malicious virtual machine can cause CPU stuck (due to event windows don't open up), e.g., infinite loop in microcode when nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and IRQ) can be delivered. It leads the CPU to be unavailable to host or other VMs. Notify VM exit is introduced to mitigate such kind of attacks, which will generate a VM exit if no event window occurs in VM non-root mode for a specified amount of time (notify window). A new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT is exposed to user space so that the user can query the capability and set the expected notify window when creating VMs. The format of the argument when enabling this capability is as follows: Bit 63:32 - notify window specified in qemu command Bit 31:0 - some flags (e.g. KVM_X86_NOTIFY_VMEXIT_ENABLED is set to enable the feature.) Users can configure the feature by a new (x86 only) accel property: qemu -accel kvm,notify-vmexit=run|internal-error|disable,notify-window=n The default option of notify-vmexit is run, which will enable the capability and do nothing if the exit happens. The internal-error option raises a KVM internal error if it happens. The disable option does not enable the capability. The default value of notify-window is 0. It is valid only when notify-vmexit is not disabled. The valid range of notify-window is non-negative. It is even safe to set it to zero since there's an internal hardware threshold to be added to ensure no false positive. Because a notify VM exit may happen with VM_CONTEXT_INVALID set in exit qualification (no cases are anticipated that would set this bit), which means VM context is corrupted. It would be reflected in the flags of KVM_EXIT_NOTIFY exit. If KVM_NOTIFY_CONTEXT_INVALID bit is set, raise a KVM internal error unconditionally. Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20220929072014.20705-5-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: allow target-specific accelerator propertiesPaolo Bonzini2022-10-101-0/+4
| | | | | | | | | | | | | Several hypervisor capabilities in KVM are target-specific. When exposed to QEMU users as accelerator properties (i.e. -accel kvm,prop=value), they should not be available for all targets. Add a hook for targets to add their own properties to -accel kvm, for now no such property is defined. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220929072014.20705-3-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386: kvm: extend kvm_{get, put}_vcpu_events to support pending triple faultChenyi Qiang2022-10-101-0/+20
| | | | | | | | | | | | | | | | For the direct triple faults, i.e. hardware detected and KVM morphed to VM-Exit, KVM will never lose them. But for triple faults sythesized by KVM, e.g. the RSM path, if KVM exits to userspace before the request is serviced, userspace could migrate the VM and lose the triple fault. A new flag KVM_VCPUEVENT_VALID_TRIPLE_FAULT is defined to signal that the event.triple_fault_pending field contains a valid state if the KVM_CAP_X86_TRIPLE_FAULT_EVENT capability is enabled. Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20220929072014.20705-2-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>