summary refs log tree commit diff stats
path: root/target/i386/kvm/kvm.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* i386/kvm: Drop KVM_CAP_X86_SMM check in kvm_arch_init()Xiaoyao Li2025-09-171-2/+1
| | | | | | | | | | | x86_machine_is_smm_enabled() checks the KVM_CAP_X86_SMM for KVM case. No need to check KVM_CAP_X86_SMM in kvm_arch_init(). So just drop the check of KVM_CAP_X86_SMM to simplify the code. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250729062014.1669578-3-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: Define enum X86ASIdx for x86's address spacesXiaoyao Li2025-09-171-2/+2
| | | | | | | | | | | | Define X86ASIdx as enum, like ARM's ARMASIdx, so that it's clear index 0 is for memory and index 1 is for SMM. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Tested-By: Kirill Martynov <stdcalllevi@yandex-team.ru> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250730095253.1833411-3-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/cpu: Enable SMM cpu address space under KVMXiaoyao Li2025-09-171-0/+5
| | | | | | | | | | | | | | | | | | | | | | Kirill Martynov reported assertation in cpu_asidx_from_attrs() being hit when x86_cpu_dump_state() is called to dump the CPU state[*]. It happens when the CPU is in SMM and KVM emulation failure due to misbehaving guest. The root cause is that QEMU i386 never enables the SMM address space for cpu since KVM SMM support has been added. Enable the SMM cpu address space under KVM when the SMM is enabled for the x86machine. [*] https://lore.kernel.org/qemu-devel/20250523154431.506993-1-stdcalllevi@yandex-team.ru/ Reported-by: Kirill Martynov <stdcalllevi@yandex-team.ru> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Kirill Martynov <stdcalllevi@yandex-team.ru> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250730095253.1833411-2-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* accel: use atomic accesses for exit_requestPaolo Bonzini2025-09-171-3/+3
| | | | | | | | | | | | | | | CPU threads write exit_request as a "note to self" that they need to go out to a slow path. This write happens out of the BQL and can be a data race with another threads' cpu_exit(); use atomic accesses consistently. While at it, change the source argument from int ("1") to bool ("true"). Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* treewide: clear bits of cs->interrupt_request with cpu_reset_interrupt()Paolo Bonzini2025-09-171-7/+7
| | | | | | | | | | | Open coding cpu_reset_interrupt() can cause bugs if the BQL is not taken, for example i386 has the call chain kvm_cpu_exec() -> kvm_put_vcpu_events() -> kvm_arch_put_registers(). Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: i386: irqchip: take BQL only if there is an interruptIgor Mammedov2025-08-291-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when kernel-irqchip=split is used, QEMU still hits BQL contention issue when reading ACPI PM/HPET timers (despite of timer[s] access being lock-less). So Windows with more than 255 cpus is still not able to boot (since it requires iommu -> split irqchip). Problematic path is in kvm_arch_pre_run() where BQL is taken unconditionally when split irqchip is in use. There are a few parts that BQL protects there: 1. interrupt check and injecting however we do not take BQL when checking for pending interrupt (even within the same function), so the patch takes the same approach for cpu->interrupt_request checks and takes BQL only if there is a job to do. 2. request_interrupt_window access CPUState::kvm_run::request_interrupt_window doesn't need BQL as it's accessed by its own vCPU thread. 3. cr8/cpu_get_apic_tpr access the same (as #2) applies to CPUState::kvm_run::cr8, and APIC registers are also cached/synced (get/put) within the vCPU thread it belongs to. Taking BQL only when is necessary, eleminates BQL bottleneck on IO/MMIO only exit path, improoving latency by 80% on HPET micro benchmark. This lets Windows to boot succesfully (in case hv-time isn't used) when more than 255 vCPUs are in use. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20250814160600.2327672-8-imammedo@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* add cpu_test_interrupt()/cpu_set_interrupt() helpers and use them tree wideIgor Mammedov2025-08-291-17/+17
| | | | | | | | | | | | | | | | The helpers form load-acquire/store-release pair and ensure that appropriate barriers are in place in case checks happen outside of BQL. Use them to replace open-coded checkers/setters across the code, to make sure that barriers are not missed. Helpers also make code a bit more readable. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason J. Herne <jjherne@linux.ibm.com> Link: https://lore.kernel.org/r/20250821155603.2422553-1-imammedo@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: do not expose ARCH_CAPABILITIES on AMD CPUPaolo Bonzini2025-07-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM emulates the ARCH_CAPABILITIES on x86 for both Intel and AMD cpus, although the IA32_ARCH_CAPABILITIES MSR is an Intel-specific MSR and it makes no sense to emulate it on AMD. As a consequence, VMs created on AMD with qemu -cpu host and using KVM will advertise the ARCH_CAPABILITIES feature and provide the IA32_ARCH_CAPABILITIES MSR. This can cause issues (like Windows BSOD) as the guest OS might not expect this MSR to exist on such cpus (the AMD documentation specifies that ARCH_CAPABILITIES feature and MSR are not defined on the AMD architecture). A fix was proposed in KVM code, however KVM maintainers don't want to change this behavior that exists for 6+ years and suggest changes to be done in QEMU instead. Therefore, hide the bit from "-cpu host": migration of -cpu host guests is only possible between identical host kernel and QEMU versions, therefore this is not a problematic breakage. If a future AMD machine does include the MSR, that would re-expose the Windows guest bug; but it would not be KVM/QEMU's problem at that point, as we'd be following a genuine physical CPU impl. Reported-by: Alexandre Chartre <alexandre.chartre@oracle.com> Suggested-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/cpu: Unify family, model and stepping calculation for x86 CPUXiaoyao Li2025-07-121-1/+1
| | | | | | | | | | | | | | | | | | | | There are multiple places where CPUID family/model/stepping info are retrieved from env->cpuid_version. Besides, the calculation of family and model inside host_cpu_vendor_fms() doesn't comply to what Intel and AMD define. For family, both Intel and AMD define that Extended Family ID needs to be counted only when (base) Family is 0xF. For model, Intel counts Extended Model when (base) Family is 0x6 or 0xF, while AMD counts EXtended MOdel when (base) Family is 0xF. Introduce generic helper functions to get family, model and stepping from the EAX value of CPUID leaf 1, with the correct calculation formula. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250630080610.3151956-5-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: handle TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPTXiaoyao Li2025-07-121-0/+3
| | | | | | | | | | | | Record the interrupt vector and the apic id of the vcpu that calls TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT. Inject the interrupt to TD guest to notify the completion of <GetQuote> when notify interrupt vector is valid. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250703024021.3559286-5-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: handle TDG.VP.VMCALL<GetQuote>Isaku Yamahata2025-06-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add property "quote-generation-socket" to tdx-guest, which is a property of type SocketAddress to specify Quote Generation Service(QGS). On request of GetQuote, it connects to the QGS socket, read request data from shared guest memory, send the request data to the QGS, and store the response into shared guest memory, at last notify TD guest by interrupt. command line example: qemu-system-x86_64 \ -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type":"unix", "path":"/var/run/tdx-qgs/qgs.socket"}}' \ -machine confidential-guest-support=tdx0 Note, above example uses the unix socket. It can be other types, like vsock, which depends on the implementation of QGS. To avoid no response from QGS server, setup a timer for the transaction. If timeout, make it an error and interrupt guest. Define the threshold of time to 30s at present, maybe change to other value if not appropriate. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: handle TDG.VP.VMCALL<GetTdVmCallInfo>Binbin Wu2025-06-201-0/+12
| | | | | Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/kvm: Prefault memory on page state changeTom Lendacky2025-06-061-5/+26
| | | | | | | | | | | | | | | | | | | | A page state change is typically followed by an access of the page(s) and results in another VMEXIT in order to map the page into the nested page table. Depending on the size of page state change request, this can generate a number of additional VMEXITs. For example, under SNP, when Linux is utilizing lazy memory acceptance, memory is typically accepted in 4M chunks. A page state change request is submitted to mark the pages as private, followed by validation of the memory. Since the guest_memfd currently only supports 4K pages, each page validation will result in VMEXIT to map the page, resulting in 1024 additional exits. When performing a page state change, invoke KVM_PRE_FAULT_MEMORY for the size of the page state change in order to pre-map the pages and avoid the additional VMEXITs. This helps speed up boot times. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/f5411c42340bd2f5c14972551edb4e959995e42b.1743193824.git.thomas.lendacky@amd.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/cgs: Introduce x86_confidential_guest_check_features()Xiaoyao Li2025-05-281-0/+8
| | | | | | | | | | | To do cgs specific feature checking. Note the feature checking in x86_cpu_filter_features() is valid for non-cgs VMs. For cgs VMs like TDX, what features can be supported has more restrictions. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-51-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Implement adjust_cpuid_features() for TDXXiaoyao Li2025-05-281-1/+1
| | | | | | | | | | | | | | | | | | | | | Maintain a TDX specific supported CPUID set, and use it to mask the common supported CPUID value of KVM. It can avoid newly added supported features (reported via KVM_GET_SUPPORTED_CPUID) for common VMs being falsely reported as supported for TDX. As the first step, initialize the TDX supported CPUID set with all the configurable CPUID bits. It's not complete because there are other CPUID bits are supported for TDX but not reported as directly configurable. E.g. the XFAM related bits, attribute related bits and fixed-1 bits. They will be handled in the future. Also, what matters are the CPUID bits related to QEMU's feature word. Only mask the CPUID leafs which are feature word leaf. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-45-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/cgs: Rename *mask_cpuid_features() to *adjust_cpuid_features()Xiaoyao Li2025-05-281-1/+1
| | | | | | | | | | | Because for TDX case, there are also fixed-1 bits that enforced by TDX module. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-44-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() for TDsXiaoyao Li2025-05-281-19/+21
| | | | | | | | | | | | | | For TDs, only MSR_IA32_UCODE_REV in kvm_init_msrs() can be configured by VMM, while the features enumerated/controlled by other MSRs except MSR_IA32_UCODE_REV in kvm_init_msrs() are not under control of VMM. Only configure MSR_IA32_UCODE_REV for TDs. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-41-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Don't synchronize guest tsc for TDsIsaku Yamahata2025-05-281-1/+1
| | | | | | | | | | | | | | TSC of TDs is not accessible and KVM doesn't allow access of MSR_IA32_TSC for TDs. To avoid the assert() in kvm_get_tsc, make kvm_synchronize_all_tsc() noop for TDs, Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Connor Kuehl <ckuehl@redhat.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-40-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/cpu: Introduce enable_cpuid_0x1f to force exposing CPUID 0x1fXiaoyao Li2025-05-281-1/+1
| | | | | | | | | | | | | | | | | | | | | Currently, QEMU exposes CPUID 0x1f to guest only when necessary, i.e., when topology level that cannot be enumerated by leaf 0xB, e.g., die or module level, are configured for the guest, e.g., -smp xx,dies=2. However, TDX architecture forces to require CPUID 0x1f to configure CPU topology. Introduce a bool flag, enable_cpuid_0x1f, in CPU for the case that requires CPUID leaf 0x1f to be exposed to guest. Introduce a new function x86_has_cpuid_0x1f(), which is the wrapper of cpu->enable_cpuid_0x1f and x86_has_extended_topo() to check if it needs to enable cpuid leaf 0x1f for the guest. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-34-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Handle KVM_SYSTEM_EVENT_TDX_FATALXiaoyao Li2025-05-281-0/+10
| | | | | | | | | | | | | | TD guest can use TDG.VP.VMCALL<REPORT_FATAL_ERROR> to request termination. KVM translates such request into KVM_EXIT_SYSTEM_EVENT with type of KVM_SYSTEM_EVENT_TDX_FATAL. Add hanlder for such exit. Parse and print the error message, and terminate the TD guest in the handler. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-29-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Implement user specified tsc frequencyXiaoyao Li2025-05-281-0/+9
| | | | | | | | | | | | | | | Reuse "-cpu,tsc-frequency=" to get user wanted tsc frequency and call VM scope VM_SET_TSC_KHZ to set the tsc frequency of TD before KVM_TDX_INIT_VM. Besides, sanity check the tsc frequency to be in the legal range and legal granularity (required by TDX module). Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-16-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Initialize TDX before creating TD vcpusXiaoyao Li2025-05-281-6/+10
| | | | | | | | | | | | | | | | | | | Invoke KVM_TDX_INIT_VM in kvm_arch_pre_create_vcpu() that KVM_TDX_INIT_VM configures global TD configurations, e.g. the canonical CPUID config, and must be executed prior to creating vCPUs. Use kvm_x86_arch_cpuid() to setup the CPUID settings for TDX VM. Note, this doesn't address the fact that QEMU may change the CPUID configuration when creating vCPUs, i.e. punts on refactoring QEMU to provide a stable CPUID config prior to kvm_arch_init(). Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-9-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: Introduce kvm_arch_pre_create_vcpu()Xiaoyao Li2025-05-281-0/+5
| | | | | | | | | | | | | | | Introduce kvm_arch_pre_create_vcpu(), to perform arch-dependent work prior to create any vcpu. This is for i386 TDX because it needs call TDX_INIT_VM before creating any vcpu. The specific implementation for i386 will be added in the future patch. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: Gerd Hoffmann <kraxel@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-8-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIESXiaoyao Li2025-05-281-2/+0
| | | | | | | | | | | | | | | | | KVM provides TDX capabilities via sub command KVM_TDX_CAPABILITIES of IOCTL(KVM_MEMORY_ENCRYPT_OP). Get the capabilities when initializing TDX context. It will be used to validate user's setting later. Since there is no interface reporting how many cpuid configs contains in KVM_TDX_CAPABILITIES, QEMU chooses to try starting with a known number and abort when it exceeds KVM_MAX_CPUID_ENTRIES. Besides, introduce the interfaces to invoke TDX "ioctls" at VCPU scope in preparation. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-6-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Implement tdx_kvm_init() to initialize TDX VM contextXiaoyao Li2025-05-281-10/+1
| | | | | | | | | | | | | | Implement TDX specific ConfidentialGuestSupportClass::kvm_init() callback, tdx_kvm_init(). Mark guest state is proctected for TDX VM. More TDX specific initialization will be added later. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-5-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* i386/tdx: Implement tdx_kvm_type() for TDXXiaoyao Li2025-05-281-0/+1
| | | | | | | | | | | TDX VM requires VM type to be KVM_X86_TDX_VM. Implement tdx_kvm_type() as X86ConfidentialGuestClass->kvm_type. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20250508150002.689633-4-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* exec/cpu-all: remove exec/target_page includePierrick Bouvier2025-04-231-0/+1
| | | | | | Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
* i386/cpu: Extract a common fucntion to setup value of MSR_CORE_THREAD_COUNTXiaoyao Li2025-01-101-4/+1
| | | | | | | | | | There are duplicated code to setup the value of MSR_CORE_THREAD_COUNT. Extract a common function for it. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20241219110125.1266461-2-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Replace ARRAY_SIZE(msr_handlers) with KVM_MSR_FILTER_MAX_RANGESPaolo Bonzini2025-01-101-1/+2
| | | | | | | | | | | | | | kvm_install_msr_filters() uses KVM_MSR_FILTER_MAX_RANGES as the bound when traversing msr_handlers[], while other places still compute the size by ARRAY_SIZE(msr_handlers). In fact, msr_handlers[] is an array with the fixed size KVM_MSR_FILTER_MAX_RANGES, and this has to be true because kvm_install_msr_filters copies from one array to the other. For code consistency, assert that they match and use ARRAY_SIZE(msr_handlers) everywehere. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Clean up error handling in kvm_arch_init()Zhao Liu2025-01-101-9/+16
| | | | | | | | | | | | | | | | | Currently, there're following incorrect error handling cases in kvm_arch_init(): * Missed to handle failure of kvm_get_supported_feature_msrs(). * Missed to return when kvm_vm_enable_disable_exits() fails. * MSR filter related cases called exit() directly instead of returning to kvm_init(). (The caller of kvm_arch_init() - kvm_init() - needs to know if kvm_arch_init() fails in order to perform cleanup). Fix the above cases. Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Zide Chen <zide.chen@intel.com> Link: https://lore.kernel.org/r/20241106030728.553238-11-zhao1.liu@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Return -1 when kvm_msr_energy_thread_init() failsZhao Liu2025-01-101-18/+11
| | | | | | | | | | | | | It is common practice to return a negative value (like -1) to indicate an error, and other functions in kvm_arch_init() follow this style. To avoid confusion (sometimes returned -1 indicates failure, and sometimes -1, in a same function), return -1 when kvm_msr_energy_thread_init() fails. Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20241106030728.553238-10-zhao1.liu@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Clean up return values of MSR filter related functionsZhao Liu2025-01-101-43/+44
| | | | | | | | | | | | | | | | | | | | | Before commit 0cc42e63bb54 ("kvm/i386: refactor kvm_arch_init and split it into smaller functions"), error_report() attempts to print the error code from kvm_filter_msr(). However, printing error code does not work due to kvm_filter_msr() returns bool instead int. 0cc42e63bb54 fixed the error by removing error code printing, but this lost useful error messages. Bring it back by making kvm_filter_msr() return int. This also makes the function call chain processing clearer, allowing for better handling of error result propagation from kvm_filter_msr() to kvm_arch_init(), preparing for the subsequent cleanup work of error handling in kvm_arch_init(). Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Zide Chen <zide.chen@intel.com> Link: https://lore.kernel.org/r/20241106030728.553238-9-zhao1.liu@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Drop workaround for KVM_X86_DISABLE_EXITS_HTL typoZhao Liu2025-01-101-4/+1
| | | | | | | | | | | | | The KVM_X86_DISABLE_EXITS_HTL typo has been fixed in commit 77d361b13c19 ("linux-headers: Update to kernel mainline commit b357bf602"). Drop the related workaround. Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Zide Chen <zide.chen@intel.com> Link: https://lore.kernel.org/r/20241106030728.553238-7-zhao1.liu@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Only save/load kvmclock MSRs when kvmclock enabledZhao Liu2025-01-101-4/+8
| | | | | | | | | | | | | MSR_KVM_SYSTEM_TIME and MSR_KVM_WALL_CLOCK are attached with the (old) kvmclock feature (KVM_FEATURE_CLOCKSOURCE). So, just save/load them only when kvmclock (KVM_FEATURE_CLOCKSOURCE) is enabled. Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Zide Chen <zide.chen@intel.com> Link: https://lore.kernel.org/r/20241106030728.553238-5-zhao1.liu@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Remove local MSR_KVM_WALL_CLOCK and MSR_KVM_SYSTEM_TIME ↵Zhao Liu2025-01-101-3/+0
| | | | | | | | | | | | | | | definitions These 2 MSRs have been already defined in kvm_para.h (standard-headers/ asm-x86/kvm_para.h). Remove QEMU local definitions to avoid duplication. Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Zide Chen <zide.chen@intel.com> Link: https://lore.kernel.org/r/20241106030728.553238-4-zhao1.liu@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Add feature bit definitions for KVM CPUIDZhao Liu2025-01-101-14/+14
| | | | | | | | | | | Add feature definitions for KVM_CPUID_FEATURES in CPUID ( CPUID[4000_0001].EAX and CPUID[4000_0001].EDX), to get rid of lots of offset calculations. Signed-off-by: Zhao Liu <zhao1.liu@intel.com> Reviewed-by: Zide Chen <zide.chen@intel.com> Link: https://lore.kernel.org/r/20241106030728.553238-3-zhao1.liu@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'exec-20241220' of https://github.com/philmd/qemu into stagingStefan Hajnoczi2024-12-211-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Accel & Exec patch queue - Ignore writes to CNTP_CTL_EL0 on HVF ARM (Alexander) - Add '-d invalid_mem' logging option (Zoltan) - Create QOM containers explicitly (Peter) - Rename sysemu/ -> system/ (Philippe) - Re-orderning of include/exec/ headers (Philippe) Move a lot of declarations from these legacy mixed bag headers: . "exec/cpu-all.h" . "exec/cpu-common.h" . "exec/cpu-defs.h" . "exec/exec-all.h" . "exec/translate-all" to these more specific ones: . "exec/page-protection.h" . "exec/translation-block.h" . "user/cpu_loop.h" . "user/guest-host.h" . "user/page-protection.h" # -----BEGIN PGP SIGNATURE----- # # iQIzBAABCAAdFiEE+qvnXhKRciHc/Wuy4+MsLN6twN4FAmdlnyAACgkQ4+MsLN6t # wN6mBw//QFWi7CrU+bb8KMM53kOU9C507tjn99LLGFb5or73/umDsw6eo/b8DHBt # KIwGLgATel42oojKfNKavtAzLK5rOrywpboPDpa3SNeF1onW+99NGJ52LQUqIX6K # A6bS0fPdGG9ZzEuPpbjDXlp++0yhDcdSgZsS42fEsT7Dyj5gzJYlqpqhiXGqpsn8 # 4Y0UMxSL21K3HEexlzw2hsoOBFA3tUm2ujNDhNkt8QASr85yQVLCypABJnuoe/// # 5Ojl5wTBeDwhANET0rhwHK8eIYaNboiM9fHopJYhvyw1bz6yAu9jQwzF/MrL3s/r # xa4OBHBy5mq2hQV9Shcl3UfCQdk/vDaYaWpgzJGX8stgMGYfnfej1SIl8haJIfcl # VMX8/jEFdYbjhO4AeGRYcBzWjEJymkDJZoiSWp2NuEDi6jqIW+7yW1q0Rnlg9lay # ShAqLK5Pv4zUw3t0Jy3qv9KSW8sbs6PQxtzXjk8p97rTf76BJ2pF8sv1tVzmsidP # 9L92Hv5O34IqzBu2oATOUZYJk89YGmTIUSLkpT7asJZpBLwNM2qLp5jO00WVU0Sd # +kAn324guYPkko/TVnjC/AY7CMu55EOtD9NU35k3mUAnxXT9oDUeL4NlYtfgrJx6 # x1Nzr2FkS68+wlPAFKNSSU5lTjsjNaFM0bIJ4LCNtenJVP+SnRo= # =cjz8 # -----END PGP SIGNATURE----- # gpg: Signature made Fri 20 Dec 2024 11:45:20 EST # gpg: using RSA key FAABE75E12917221DCFD6BB2E3E32C2CDEADC0DE # gpg: Good signature from "Philippe Mathieu-Daudé (F4BUG) <f4bug@amsat.org>" [unknown] # gpg: WARNING: This key is not certified with a trusted signature! # gpg: There is no indication that the signature belongs to the owner. # Primary key fingerprint: FAAB E75E 1291 7221 DCFD 6BB2 E3E3 2C2C DEAD C0DE * tag 'exec-20241220' of https://github.com/philmd/qemu: (59 commits) util/qemu-timer: fix indentation meson: Do not define CONFIG_DEVICES on user emulation system/accel-ops: Remove unnecessary 'exec/cpu-common.h' header system/numa: Remove unnecessary 'exec/cpu-common.h' header hw/xen: Remove unnecessary 'exec/cpu-common.h' header target/mips: Drop left-over comment about Jazz machine target/mips: Remove tswap() calls in semihosting uhi_fstat_cb() target/xtensa: Remove tswap() calls in semihosting simcall() helper accel/tcg: Un-inline translator_is_same_page() accel/tcg: Include missing 'exec/translation-block.h' header accel/tcg: Move tcg_cflags_has/set() to 'exec/translation-block.h' accel/tcg: Restrict curr_cflags() declaration to 'internal-common.h' qemu/coroutine: Include missing 'qemu/atomic.h' header exec/translation-block: Include missing 'qemu/atomic.h' header accel/tcg: Declare cpu_loop_exit_requested() in 'exec/cpu-common.h' exec/cpu-all: Include 'cpu.h' earlier so MMU_USER_IDX is always defined target/sparc: Move sparc_restore_state_to_opc() to cpu.c target/sparc: Uninline cpu_get_tb_cpu_state() target/loongarch: Declare loongarch_cpu_dump_state() locally user: Move various declarations out of 'exec/exec-all.h' ... Conflicts: hw/char/riscv_htif.c hw/intc/riscv_aplic.c target/s390x/cpu.c Apply sysemu header path changes to not in the pull request. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
| * include: Rename sysemu/ -> system/Philippe Mathieu-Daudé2024-12-201-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Headers in include/sysemu/ are not only related to system *emulation*, they are also used by virtualization. Rename as system/ which is clearer. Files renamed manually then mechanical change using sed tool. Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Tested-by: Lei Yang <leiyang@redhat.com> Message-Id: <20241203172445.28576-1-philmd@linaro.org>
* | target/i386: Reset TSCs of parked vCPUs too on VM resetMaciej S. Szmigiero2024-12-191-0/+15
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 5286c3662294 ("target/i386: properly reset TSC on reset") QEMU writes the special value of "1" to each online vCPU TSC on VM reset to reset it. However parked vCPUs don't get that handling and due to that their TSCs get desynchronized when the VM gets reset. This in turn causes KVM to turn off PVCLOCK_TSC_STABLE_BIT in its exported PV clock. Note that KVM has no understanding of vCPU being currently parked. Without PVCLOCK_TSC_STABLE_BIT the sched clock is marked unstable in the guest's kvm_sched_clock_init(). This causes a performance regressions to show in some tests. Fix this issue by writing the special value of "1" also to TSCs of parked vCPUs on VM reset. Reproducing the issue: 1) Boot a VM with "-smp 2,maxcpus=3" or similar 2) device_add host-x86_64-cpu,id=vcpu,node-id=0,socket-id=0,core-id=2,thread-id=0 3) Wait a few seconds 4) device_del vcpu 5) Inside the VM run: # echo "t" >/proc/sysrq-trigger; dmesg | grep sched_clock_stable Observe the sched_clock_stable() value is 1. 6) Reboot the VM 7) Once the VM boots once again run inside it: # echo "t" >/proc/sysrq-trigger; dmesg | grep sched_clock_stable Observe the sched_clock_stable() value is now 0. Fixes: 5286c3662294 ("target/i386: properly reset TSC on reset") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Link: https://lore.kernel.org/r/5a605a88e9a231386dc803c60f5fed9b48108139.1734014926.git.maciej.szmigiero@oracle.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: add AVX10 feature and AVX10 version propertyTao Su2024-10-311-1/+2
| | | | | | | | | | | | | | | | | | | | When AVX10 enable bit is set, the 0x24 leaf will be present as "AVX10 Converged Vector ISA leaf" containing fields for the version number and the supported vector bit lengths. Introduce avx10-version property so that avx10 version can be controlled by user and cpu model. Per spec, avx10 version can never be 0, the default value of avx10-version is set to 0 to determine whether it is specified by user. The default can come from the device model or, for the max model, from KVM's reported value. Signed-off-by: Tao Su <tao1.su@linux.intel.com> Link: https://lore.kernel.org/r/20241028024512.156724-3-tao1.su@linux.intel.com Link: https://lore.kernel.org/r/20241028024512.156724-4-tao1.su@linux.intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Xuelian Guo <xuelian.guo@intel.com> Link: https://lore.kernel.org/r/20241031085233.425388-5-tao1.su@linux.intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: Exclude 'hv-syndbg' from 'hv-passthrough'Vitaly Kuznetsov2024-10-171-2/+5
| | | | | | | | | | | | | | | | | | | | | | Windows with Hyper-V role enabled doesn't boot with 'hv-passthrough' when no debugger is configured, this significantly limits the usefulness of the feature as there's no support for subtracting Hyper-V features from CPU flags at this moment (e.g. "-cpu host,hv-passthrough,-hv-syndbg" does not work). While this is also theoretically fixable, 'hv-syndbg' is likely very special and unneeded in the default set. Genuine Hyper-V doesn't seem to enable it either. Introduce 'skip_passthrough' flag to 'kvm_hyperv_properties' and use it as one-off to skip 'hv-syndbg' when enabling features in 'hv-passthrough' mode. Note, "-cpu host,hv-passthrough,hv-syndbg" can still be used if needed. As both 'hv-passthrough' and 'hv-syndbg' are debug features, the change should not have any effect on production environments. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240917160051.2637594-3-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: Fix conditional CONFIG_SYNDBG enablementVitaly Kuznetsov2024-10-171-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | Putting HYPERV_FEAT_SYNDBG entry under "#ifdef CONFIG_SYNDBG" in 'kvm_hyperv_properties' array is wrong: as HYPERV_FEAT_SYNDBG is not the highest feature number, the result is an empty (zeroed) entry in the array (and not a skipped entry!). hyperv_feature_supported() is designed to check that all CPUID bits are set but for a zeroed feature in 'kvm_hyperv_properties' it returns 'true' so QEMU considers HYPERV_FEAT_SYNDBG as always supported, regardless of whether KVM host actually supports it. To fix the issue, leave HYPERV_FEAT_SYNDBG's definition in 'kvm_hyperv_properties' array, there's nothing wrong in having it defined even when 'CONFIG_SYNDBG' is not set. Instead, put "hv-syndbg" CPU property under '#ifdef CONFIG_SYNDBG' to alter the existing behavior when the flag is silently skipped in !CONFIG_SYNDBG builds. Leave an 'assert' sentinel in hyperv_feature_supported() making sure there are no 'holes' or improperly defined features in 'kvm_hyperv_properties'. Fixes: d8701185f40c ("hw: hyperv: Initial commit for Synthetic Debugging device") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240917160051.2637594-2-vkuznets@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: Add support save/load HWCR MSRGao Shiyuan2024-10-171-0/+12
| | | | | | | | | | | | KVM commit 191c8137a939 ("x86/kvm: Implement HWCR support") introduced support for emulating HWCR MSR. Add support for QEMU to save/load this MSR for migration purposes. Signed-off-by: Gao Shiyuan <gaoshiyuan@baidu.com> Signed-off-by: Wang Liang <wangliang44@baidu.com> Link: https://lore.kernel.org/r/20241009095109.66843-1-gaoshiyuan@baidu.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: Construct CPUID 2 as stateful iff times > 1Xiaoyao Li2024-10-171-2/+4
| | | | | | | | When times == 1, the CPUID leaf 2 is not stateful. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240814075431.339209-6-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386: Don't construct a all-zero entry for CPUID[0xD 0x3f]Xiaoyao Li2024-10-171-5/+6
| | | | | | | | | | | | | Currently, QEMU always constructs a all-zero CPUID entry for CPUID[0xD 0x3f]. It's meaningless to construct such a leaf as the end of leaf 0xD. Rework the logic of how subleaves of 0xD are constructed to get rid of such all-zero value of subleaf 0x3f. Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20240814075431.339209-2-xiaoyao.li@intel.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* target/i386/kvm: Report which action failed in kvm_arch_put/get_registersJulia Suvorova2024-10-031-0/+23
| | | | | | | | | | | | | | To help debug and triage future failure reports (akin to [1,2]) that may occur during kvm_arch_put/get_registers, the error path of each action is accompanied by unique error message. [1] https://issues.redhat.com/browse/RHEL-7558 [2] https://issues.redhat.com/browse/RHEL-21761 Signed-off-by: Julia Suvorova <jusual@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20240927104743.218468-3-jusual@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm: Allow kvm_arch_get/put_registers to accept Error**Julia Suvorova2024-10-031-2/+2
| | | | | | | | | This is necessary to provide discernible error messages to the caller. Signed-off-by: Julia Suvorova <jusual@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20240927104743.218468-2-jusual@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/i386: replace identity_base variable with a constantPaolo Bonzini2024-10-031-18/+18
| | | | | | | | | | identity_base variable is first initialzied to address 0xfffbc000 and then kvm_vm_set_identity_map_addr() overrides this value to address 0xfeffc000. The initial address to which the variable was initialized was never used. Clean everything up, placing 0xfeffc000 in a preprocessor constant. Reported-by: Ani Sinha <anisinha@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/i386: refactor kvm_arch_init and split it into smaller functionsAni Sinha2024-10-031-126/+201
| | | | | | | | | | kvm_arch_init() enables a lot of vm capabilities. Refactor them into separate smaller functions. Energy MSR related operations also moved to its own function. There should be no functional impact. Signed-off-by: Ani Sinha <anisinha@redhat.com> Link: https://lore.kernel.org/r/20240903124143.39345-2-anisinha@redhat.com Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvm/i386: fix return values of is_host_cpu_intel()Ani Sinha2024-10-021-3/+3
| | | | | | | | | | | | | | is_host_cpu_intel() should return TRUE if the host cpu in Intel based, otherwise it should return FALSE. Currently, it returns zero (FALSE) when the host CPU is INTEL and non-zero otherwise. Fix the function so that it agrees more with the semantics. Adjust the calling logic accordingly. RAPL needs Intel host cpus. If the host CPU is not Intel baseed, we should report error. Signed-off-by: Ani Sinha <anisinha@redhat.com> Link: https://lore.kernel.org/r/20240903080004.33746-1-anisinha@redhat.com [While touching the code remove too many spaces from the second part of the error. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>