If the tiny code generator of qemu makes an error in the binary translation, when emulating an instruction set architecture, we call it a semantic mistranslation bug. Here are three gitlab issues in the toml format. Tell me, if they describe a semantic mistranslation bug. Answer only with yes and no. 1. issue: id = 1371 title = "x86 BLSMSK semantic bug" state = "closed" created_at = "2022-12-16T06:43:29.794Z" closed_at = "2023-03-01T01:08:38.844Z" labels = ["Closed::Fixed", "accel: TCG", "target: i386"] url = "https://gitlab.com/qemu-project/qemu/-/issues/1371" host-os = "Windows 10 20H2" host-arch = "x86" qemu-version = "7.1.90 (v7.2.0-rc0)" guest-os = "None" guest-arch = "x86" description = """The result of instruction BLSMSK is different with from the CPU. The value of CF is different.""" reproduce = """1. Compile this code void main() { asm("mov rax, 0x65b2e276ad27c67"); asm("mov rbx, 0x62f34955226b2b5d"); asm("blsmsk eax, ebx"); } 2. Execute and compare the result with the CPU. - CPU - CF = 0 - QEMU - CF = 1""" additional = """This bug is discovered by research conducted by KAIST SoftSec.""" 2. issue: id = 1057 title = "AArch64: ISV is set to 1 in ESR_EL2 when taking a data abort with post-indexed instructions" state = "closed" created_at = "2022-06-02T20:50:04.695Z" closed_at = "2022-07-18T16:47:38.098Z" labels = ["accel: TCG", "target: arm", "workflow::In Progress"] url = "https://gitlab.com/qemu-project/qemu/-/issues/1057" host-os = "MacOS 12.4" host-arch = "ARM64" qemu-version = "QEMU emulator version 7.0.0" guest-os = "Custom - BedRock Hypervisor running with the NOVA microkernel" guest-arch = "ARMv8" description = """I think that I have a Qemu bug in my hands, but, I could still be missing something. Consider the following instruction: 0x0000000000000000: C3 44 00 B8 str w3, [x6], #4 notice the last #4, I think this is what we would call a post-indexed instruction (falls into the category of instructions with writeback). As I understand it, those instructions should not have ISV=1 in ESR_EL2 when faulting. Here is the relevant part of the manual: For other faults reported in ESR_EL2, ISV is 0 except for the following stage 2 aborts: • AArch64 loads and stores of a single general-purpose register (including the register specified with 0b11111, including those with Acquire/Release semantics, but excluding Load Exclusive or Store Exclusive and excluding those with writeback). However, I can see that Qemu sets ISV to 1 here. The ARM hardware that I tested gave me a value of ISV=0 for similar instructions. Another example of instruction: 0x00000000000002f8: 01 1C 40 38 ldrb w1, [x0, #1]!""" reproduce = """1. Run some hypervisor in EL2 2. Create a guest running at EL1 that executes one of the mentioned instructions (and make the instruction fault by writing to some unmapped page in SLP) 3. Observe the value of ESR_EL2 on data abort Unfortunately, I cannot provide an image to reproduce this (the software is not open-source). But, I would be happy to help test a patch.""" additional = "n/a" ----------------------------------------------------------------------------------------- yes no ----------------------------------------------------------------------------------------- Now I will give you Mailing Threads. Transform them into the toml format i provided you, which should be the only output. If a variable is not given in the mails, write "n/a". The id's should be 10000 and upwards. the state is closed, when the bug is fixed, and open, when the bug is not fixed. Additionally, add a 'mistranslation' variable, which is set to 'yes', if the bug is a mistranslation bug, and 'no', if not. First Thread: On Tue, Apr 29, 2025 at 01:55:59PM +0800, Xiaoyao Li wrote: > Date: Tue, 29 Apr 2025 13:55:59 +0800 > From: Xiaoyao Li > Subject: Re: [Bug] QEMU TCG warnings after commit c6bd2dd63420 - HTT / > CMP_LEG bits > > On 4/29/2025 11:02 AM, Ewan Hai wrote: > > Hi Community, > > > > This email contains 3 bugs appear to share the same root cause. > > > > [1] We ran into the following warnings when running QEMU v10.0.0 in TCG > > mode: > > > > qemu-system-x86_64 \ > > -machine q35 \ > > -m 4G -smp 4 \ > > -kernel ./arch/x86/boot/bzImage \ > > -bios /usr/share/ovmf/OVMF.fd \ > > -drive file=~/kernel/rootfs.ext4,index=0,format=raw,media=disk \ > > -drive file=~/kernel/swap.img,index=1,format=raw,media=disk \ > > -nographic \ > > -append 'root=/dev/sda rw resume=/dev/sdb console=ttyS0 nokaslr' > > > > qemu-system-x86_64: warning: TCG doesn't support requested feature: > > CPUID.01H:EDX.ht [bit 28] > > qemu-system-x86_64: warning: TCG doesn't support requested feature: > > CPUID.80000001H:ECX.cmp-legacy [bit 1] > > (repeats 4 times, once per vCPU) > > > > Tracing the history shows that commit c6bd2dd63420 "i386/cpu: Set up > > CPUID_HT in x86_cpu_expand_features() instead of cpu_x86_cpuid()" is > > what introduced the warnings. > > > > Since that commit, TCG unconditionally advertises HTT (CPUID 1 EDX[28]) > > and CMP_LEG (CPUID 8000_0001 ECX[1]). Because TCG itself has no SMT > > support, these bits trigger the warnings above. > > > > [2] Also, Zhao pointed me to a similar report on GitLab: > > https://gitlab.com/qemu-project/qemu/-/issues/2894 > > The symptoms there look identical to what we're seeing. > > > > By convention we file one issue per email, but these two appear to share > > the same root cause, so I'm describing them together here. > > It was caused by my two patches. I think the fix can be as follow. > If no objection from the community, I can submit the formal patch. > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c > index 1f970aa4daa6..fb95aadd6161 100644 > --- a/target/i386/cpu.c > +++ b/target/i386/cpu.c > @@ -776,11 +776,12 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t > vendor1, > CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | CPUID_SEP | \ > CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \ > CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \ > - CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE) > + CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE | \ > + CPUID_HT) > /* partly implemented: > CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */ > /* missing: > - CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, CPUID_PBE */ > + CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */ > > /* > * Kernel-only features that can be shown to usermode programs even if > @@ -848,7 +849,8 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t > vendor1, > > #define TCG_EXT3_FEATURES (CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | \ > CPUID_EXT3_CR8LEG | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A | \ > - CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_KERNEL_FEATURES) > + CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_KERNEL_FEATURES | \ > + CPUID_EXT3_CMP_LEG) > > #define TCG_EXT4_FEATURES 0 This fix is fine for me...at least from SDM, HTT depends on topology and it should exist when user sets "-smp 4". > > [3] My colleague Alan noticed what appears to be a related problem: if > > we launch a guest with '-cpu ,-ht --enable-kvm', which means > > explicitly removing the ht flag, but the guest still reports HT(cat > > /proc/cpuinfo in linux guest) enabled. In other words, under KVM the ht > > bit seems to be forced on even when the user tries to disable it. > > XiaoYao reminded me that issue [3] stems from a different patch. Please > ignore it for now—I'll start a separate thread to discuss that one > independently. I haven't found any other thread :-). By the way, just curious, in what cases do you need to disbale the HT flag? "-smp 4" means 4 cores with 1 thread per core, and is it not enough? As for the “-ht” behavior, I'm also unsure whether this should be fixed or not - one possible consideration is whether “-ht” would be useful. On Tue, Apr 29, 2025 at 01:55:59PM +0800, Xiaoyao Li wrote: > Date: Tue, 29 Apr 2025 13:55:59 +0800 > From: Xiaoyao Li > Subject: Re: [Bug] QEMU TCG warnings after commit c6bd2dd63420 - HTT / > CMP_LEG bits > > On 4/29/2025 11:02 AM, Ewan Hai wrote: > > Hi Community, > > > > This email contains 3 bugs appear to share the same root cause. > > > > [1] We ran into the following warnings when running QEMU v10.0.0 in TCG > > mode: > > > > qemu-system-x86_64 \ > > -machine q35 \ > > -m 4G -smp 4 \ > > -kernel ./arch/x86/boot/bzImage \ > > -bios /usr/share/ovmf/OVMF.fd \ > > -drive file=~/kernel/rootfs.ext4,index=0,format=raw,media=disk \ > > -drive file=~/kernel/swap.img,index=1,format=raw,media=disk \ > > -nographic \ > > -append 'root=/dev/sda rw resume=/dev/sdb console=ttyS0 nokaslr' > > > > qemu-system-x86_64: warning: TCG doesn't support requested feature: > > CPUID.01H:EDX.ht [bit 28] > > qemu-system-x86_64: warning: TCG doesn't support requested feature: > > CPUID.80000001H:ECX.cmp-legacy [bit 1] > > (repeats 4 times, once per vCPU) > > > > Tracing the history shows that commit c6bd2dd63420 "i386/cpu: Set up > > CPUID_HT in x86_cpu_expand_features() instead of cpu_x86_cpuid()" is > > what introduced the warnings. > > > > Since that commit, TCG unconditionally advertises HTT (CPUID 1 EDX[28]) > > and CMP_LEG (CPUID 8000_0001 ECX[1]). Because TCG itself has no SMT > > support, these bits trigger the warnings above. > > > > [2] Also, Zhao pointed me to a similar report on GitLab: > > https://gitlab.com/qemu-project/qemu/-/issues/2894 > > The symptoms there look identical to what we're seeing. > > > > By convention we file one issue per email, but these two appear to share > > the same root cause, so I'm describing them together here. > > It was caused by my two patches. I think the fix can be as follow. > If no objection from the community, I can submit the formal patch. > > diff --git a/target/i386/cpu.c b/target/i386/cpu.c > index 1f970aa4daa6..fb95aadd6161 100644 > --- a/target/i386/cpu.c > +++ b/target/i386/cpu.c > @@ -776,11 +776,12 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t > vendor1, > CPUID_PAE | CPUID_MCE | CPUID_CX8 | CPUID_APIC | CPUID_SEP | \ > CPUID_MTRR | CPUID_PGE | CPUID_MCA | CPUID_CMOV | CPUID_PAT | \ > CPUID_PSE36 | CPUID_CLFLUSH | CPUID_ACPI | CPUID_MMX | \ > - CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE) > + CPUID_FXSR | CPUID_SSE | CPUID_SSE2 | CPUID_SS | CPUID_DE | \ > + CPUID_HT) > /* partly implemented: > CPUID_MTRR, CPUID_MCA, CPUID_CLFLUSH (needed for Win64) */ > /* missing: > - CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_HT, CPUID_TM, CPUID_PBE */ > + CPUID_VME, CPUID_DTS, CPUID_SS, CPUID_TM, CPUID_PBE */ > > /* > * Kernel-only features that can be shown to usermode programs even if > @@ -848,7 +849,8 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t > vendor1, > > #define TCG_EXT3_FEATURES (CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | \ > CPUID_EXT3_CR8LEG | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A | \ > - CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_KERNEL_FEATURES) > + CPUID_EXT3_3DNOWPREFETCH | CPUID_EXT3_KERNEL_FEATURES | \ > + CPUID_EXT3_CMP_LEG) > > #define TCG_EXT4_FEATURES 0 This fix is fine for me...at least from SDM, HTT depends on topology and it should exist when user sets "-smp 4". > > [3] My colleague Alan noticed what appears to be a related problem: if > > we launch a guest with '-cpu ,-ht --enable-kvm', which means > > explicitly removing the ht flag, but the guest still reports HT(cat > > /proc/cpuinfo in linux guest) enabled. In other words, under KVM the ht > > bit seems to be forced on even when the user tries to disable it. > > XiaoYao reminded me that issue [3] stems from a different patch. Please > ignore it for now—I'll start a separate thread to discuss that one > independently. I haven't found any other thread :-). By the way, just curious, in what cases do you need to disbale the HT flag? "-smp 4" means 4 cores with 1 thread per core, and is it not enough? As for the “-ht” behavior, I'm also unsure whether this should be fixed or not - one possible consideration is whether “-ht” would be useful. ----------------------------------------------------------------------------------------- toml [[issue]] id = 10000 title = "QEMU TCG warnings after commit c6bd2dd63420 - HTT / CMP_LEG bits" state = "closed" created_at = "2025-04-29T11:02:00.000Z" closed_at = "n/a" labels = ["accel: TCG", "target: i386"] url = "n/a" host-os = "n/a" host-arch = "x86" qemu-version = "10.0.0" guest-os = "Linux" guest-arch = "x86" description = """TCG unconditionally advertises HTT (CPUID 1 EDX[28]) and CMP_LEG (CPUID 8000_0001 ECX[1]) despite lacking SMT support, causing warnings.""" reproduce = """Run QEMU in TCG mode with a Linux guest and observe CPUID-related warnings.""" additional = """Related to GitLab issue #2894. Fix involves adjusting CPU feature flags in target/i386/cpu.c.""" mistranslation = "no" [[issue]] id = 10001 title = "KVM forces HT flag even when explicitly disabled" state = "open" created_at = "2025-04-29T11:02:00.000Z" closed_at = "n/a" labels = ["accel: KVM", "target: i386"] url = "n/a" host-os = "n/a" host-arch = "x86" qemu-version = "10.0.0" guest-os = "Linux" guest-arch = "x86" description = """When launching a guest with '-cpu ,-ht --enable-kvm', the HT flag remains enabled in /proc/cpuinfo.""" reproduce = """Launch a KVM guest with HT explicitly disabled and check CPU flags.""" additional = """Separate issue from the TCG warnings. Discussion pending in another thread.""" mistranslation = "no" Second Thread: This bug was fixed in the package qemu - 1:9.0.2+ds-4ubuntu5.3 ----------------------------------------------------------------------------------------- qemu (1:9.0.2+ds-4ubuntu5.3) oracular; urgency=medium * d/p/u/lp2049698/*: Add full boot order support on s390x (LP: #2049698) * Cherry-pick prerequisite for above backport (to avoid FTBFS): - d/p/u/lp2049698/0-hw-s390x-sclp.c-include-s390-virtio-ccw.h-to-make.patch * d/qemu-system-data.links: symlink s390-netboot.img -> s390-ccw.img for backwards compatibility, as the code is now combined. * Fix qemu-aarch64-static segfaults running ldconfig.real (LP: #2072564) - lp-2072564-01-linux-user-Honor-elf-alignment-when-placing-images.patch - lp-2072564-02-elfload-Fix-alignment-when-unmapping-excess-reservat.patch Thanks to Dimitry Andric for identifying the fix. -- Lukas Märdian Thu, 13 Mar 2025 17:18:50 +0100 ** Changed in: qemu (Ubuntu Oracular) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/2072564 Title: qemu-aarch64-static segfaults running ldconfig.real (amd64 host) Status in QEMU: Fix Released Status in qemu package in Ubuntu: Fix Released Status in qemu source package in Noble: Fix Committed Status in qemu source package in Oracular: Fix Released Bug description: [ Impact ] * QEMU crashes when running (emulating) ldconfig in a Ubuntu 22.04 arm64 guest * This affects the qemu-user-static 1:8.2.2+ds-0ubuntu1 package on Ubuntu 24.04+, running on a amd64 host. * When running docker containers with Ubuntu 22.04 in them, emulating arm64 with qemu-aarch64-static, invocations of ldconfig (actually ldconfig.real) segfault, leading to problems when loading shared libraries. [ Test Plan ] * Reproducer is very easy: $ sudo snap install docker docker 27.5.1 from Canonical** installed $ docker run -ti --platform linux/arm64/v8 ubuntu:22.04 Unable to find image 'ubuntu:22.04' locally 22.04: Pulling from library/ubuntu 0d1c17d4e593: Pull complete Digest: sha256:ed1544e454989078f5dec1bfdabd8c5cc9c48e0705d07b678ab6ae3fb61952d2 Status: Downloaded newer image for ubuntu:22.04 # Execute ldconfig.real inside the arm64 guest. # This should not crash after the fix! root@ad80af5378dc:/# /sbin/ldconfig.real qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped) [ Where problems could occur ] * This changes the alignment of sections in the ELF binary via QEMUs elfloader, if something goes wrong with this change, it could lead to all kind of crashes (segfault) of any emulated binaries. [ Other Info ] * Upstream bug: https://gitlab.com/qemu-project/qemu/-/issues/1913 * Upstream fix: https://gitlab.com/qemu-project/qemu/-/commit/4b7b20a3 - Fix dependency (needed for QEMU < 9.20): https://gitlab.com/qemu-project/qemu/-/commit/c81d1faf --- original bug report --- This affects the qemu-user-static 1:8.2.2+ds-0ubuntu1 package on Ubuntu 24.04, running on a amd64 host. When running docker containers with Ubuntu 22.04 in them, emulating arm64 with qemu-aarch64-static, invocations of ldconfig (actually ldconfig.real) segfault. For example: $ docker run -ti --platform linux/arm64/v8 ubuntu:22.04 root@8861ff640a1c:/# /sbin/ldconfig.real Segmentation fault If you copy the ldconfig.real binary to the host, and run it directly via qemu-aarch64-static: $ gdb --args qemu-aarch64-static ./ldconfig.real GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git Copyright (C) 2024 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: . Find the GDB manual and other documentation resources online at: . For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from qemu-aarch64-static... Reading symbols from /home/dim/.cache/debuginfod_client/86579812b213be0964189499f62f176bea817bf2/debuginfo... (gdb) r Starting program: /usr/bin/qemu-aarch64-static ./ldconfig.real [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". [New Thread 0x7ffff76006c0 (LWP 28378)] Thread 1 "qemu-aarch64-st" received signal SIGSEGV, Segmentation fault. 0x00007fffe801645b in ?? () (gdb) disassemble No function contains program counter for selected frame. It looks like this is a known qemu regression after v8.1.1: https://gitlab.com/qemu-project/qemu/-/issues/1913 Downgrading the package to qemu-user- static_8.0.4+dfsg-1ubuntu3_amd64.deb fixes the segfault. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/2072564/+subscriptions ----------------------------------------------------------------------------------------- toml [[issue]] id = 10002 title = "qemu-aarch64-static segfaults running ldconfig.real (amd64 host)" state = "closed" created_at = "n/a" closed_at = "2025-03-13T17:18:50+01:00" labels = ["accel: TCG", "target: arm", "ubuntu"] url = "https://bugs.launchpad.net/bugs/2072564" host-os = "Ubuntu 24.04+" host-arch = "amd64" qemu-version = "1:9.0.2+ds-4ubuntu5.3" guest-os = "Ubuntu 22.04 (arm64)" guest-arch = "arm64" description = """QEMU crashes when running ldconfig in an emulated arm64 guest, affecting qemu-user-static on amd64 hosts.""" reproduce = """Run 'ldconfig.real' in an emulated arm64 container or via qemu-aarch64-static.""" additional = """Upstream bug: https://gitlab.com/qemu-project/qemu/-/issues/1913. Fixed by ELF alignment patches.""" mistranslation = "no" Third Thread: Hi, all When I did the cxl memory hot-plug test on QEMU, I accidentally connected two memdev to the same downstream port, the command like below: > -object memory-backend-ram,size=262144k,share=on,id=vmem0 \ > -object memory-backend-ram,size=262144k,share=on,id=vmem1 \ > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ > -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \ > -device cxl-upstream,bus=root_port0,id=us0 \ > -device cxl-downstream,port=0,bus=us0,id=swport00,chassis=0,slot=5 \ > -device cxl-downstream,port=0,bus=us0,id=swport01,chassis=0,slot=7 \ same downstream port but has different slot! > -device cxl-type3,bus=swport00,volatile-memdev=vmem0,id=cxl-vmem0 \ > -device cxl-type3,bus=swport01,volatile-memdev=vmem1,id=cxl-vmem1 \ > -M > cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=64G,cxl-fmw.0.interleave-granularity=4k > \ There is no error occurred when vm start, but when I executed the “cxl list” command to view the CXL objects info, the process can not end properly. Then I used strace to trace the process, I found that the process is in infinity loop: # strace cxl list ...... clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000000}, NULL) = 0 openat(AT_FDCWD, "/sys/bus/cxl/flush", O_WRONLY|O_CLOEXEC) = 3 write(3, "1\n\0", 3) = 3 close(3) = 0 access("/run/udev/queue", F_OK) = 0 [Environment]: linux: V6.10-rc3 QEMU: V9.0.0 ndctl: v79 I know this is because of the wrong use of the QEMU command, but I think we should be aware of this error in one of the QEMU, OS or ndctl side at least. Thanks Xingtao ----------------------------------------------------------------------------------------- toml [[issue]] id = 10003 title = "CXL memory hot-plug with incorrect configuration causes infinite loop in 'cxl list'" state = "open" created_at = "n/a" closed_at = "n/a" labels = ["device: cxl", "bug"] url = "n/a" host-os = "n/a" host-arch = "n/a" qemu-version = "9.0.0" guest-os = "Linux (v6.10-rc3)" guest-arch = "n/a" description = """When incorrectly connecting two memdevs to the same CXL downstream port, 'cxl list' command enters infinite loop.""" reproduce = """1. Configure CXL with two memdevs on same downstream port (different slots) 2. Start VM 3. Run 'cxl list' command""" additional = """Environment: QEMU v9.0.0, Linux v6.10-rc3, ndctl v79. Issue occurs despite incorrect configuration - should be caught by either QEMU, OS or ndctl.""" mistranslation = "no" Fourth Thread: [SRU] migration was active, but no RAM info was set [Impact] * While live-migrating many instances concurrently, libvirt sometimes return `internal error: migration was active, but no RAM info was set:` * Effects of this bug are mostly observed in large scale clusters with a lot of live migration activity. * Has second order effects for consumers of migration monitor such as libvirt and openstack. [Test Case] Synthetic reproducer with GDB in comment #21. Steps to Reproduce: 1. live evacuate a compute 2. live migration of one or more instances fails with the above error N.B Due to the nature of this bug it is difficult consistently reproduce. In an environment where it has been observed it is estimated to occur approximately 1/1000 migrations. [Where problems could occur] * In the event of a regression the migration monitor may report an inconsistent state. [Original Bug Description] While live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set: ~~~ 2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set [Impact] * While live-migrating many instances concurrently, libvirt sometimes return `internal error: migration was active, but no RAM info was set:` * Effects of this bug are mostly observed in large scale clusters with a lot of live migration activity. * Has second order effects for consumers of migration monitor such as libvirt and openstack. [Test Case] Synthetic reproducer with GDB in comment #21. Steps to Reproduce: 1. live evacuate a compute 2. live migration of one or more instances fails with the above error N.B Due to the nature of this bug it is difficult consistently reproduce. In an environment where it has been observed it is estimated to occur approximately 1/1000 migrations. [Where problems could occur] * In the event of a regression the migration monitor may report an inconsistent state. [Original Bug Description] While live-migrating many instances concurrently, libvirt sometimes return internal error: migration was active, but no RAM info was set: ~~~ 2022-03-30 06:08:37.197 7 WARNING nova.virt.libvirt.driver [req-5c3296cf-88ee-4af6-ae6a-ddba99935e23 - - - - -] [instance: af339c99-1182-4489-b15c-21e52f50f724] Error monitoring migration: internal error: migration was active, but no RAM info was set: libvirt.libvirtError: internal error: migration was active, but no RAM info was set toml [[issue]] id = 10004 title = "Migration active but no RAM info set during concurrent live migrations" state = "open" created_at = "n/a" closed_at = "n/a" labels = ["migration", "libvirt", "openstack"] url = "n/a" host-os = "n/a" host-arch = "n/a" qemu-version = "n/a" guest-os = "n/a" guest-arch = "n/a" description = """During concurrent live migrations, libvirt reports 'migration was active, but no RAM info was set' error intermittently""" reproduce = """1. Perform live evacuation of compute node 2. Observe failure in approximately 1/1000 migrations with error message 3. Synthetic reproducer available via GDB in comments""" additional = """Affects large scale clusters with heavy migration activity. Impacts libvirt and OpenStack migration monitoring. Difficult to reproduce consistently.""" mistranslation = "no"