4 files changed, 6483 insertions, 0 deletions
diff --git a/results/classifier/zero-shot/007/debug/36568044 b/results/classifier/zero-shot/007/debug/36568044
new file mode 100644
index 000000000..cdc1d6312
--- /dev/null
+++ b/results/classifier/zero-shot/007/debug/36568044
@@ -0,0 +1,4591 @@
+debug: 0.939
+device: 0.931
+graphic: 0.931
+other: 0.930
+permissions: 0.927
+PID: 0.926
+semantic: 0.923
+performance: 0.920
+KVM: 0.914
+socket: 0.907
+vnc: 0.905
+network: 0.904
+boot: 0.895
+files: 0.884
+
+[BUG, RFC] cpr-transfer: qxl guest driver crashes after migration
+
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+>
+EMULATOR=/path/to/emulator
+>
+ROOTFS=/path/to/image
+>
+QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>
+$EMULATOR -enable-kvm \
+>
+-machine q35 \
+>
+-cpu host -smp 2 -m 2G \
+>
+-object
+>
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
+>
+-machine memory-backend=ram0 \
+>
+-machine aux-ram-share=on \
+>
+-drive file=$ROOTFS,media=disk,if=virtio \
+>
+-qmp unix:$QMPSOCK,server=on,wait=off \
+>
+-nographic \
+>
+-device qxl-vga
+Run migration target:
+>
+EMULATOR=/path/to/emulator
+>
+ROOTFS=/path/to/image
+>
+QMPSOCK=/var/run/alma8qmp-dst.sock
+>
+>
+>
+>
+$EMULATOR -enable-kvm \
+>
+-machine q35 \
+>
+-cpu host -smp 2 -m 2G \
+>
+-object
+>
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
+>
+-machine memory-backend=ram0 \
+>
+-machine aux-ram-share=on \
+>
+-drive file=$ROOTFS,media=disk,if=virtio \
+>
+-qmp unix:$QMPSOCK,server=on,wait=off \
+>
+-nographic \
+>
+-device qxl-vga \
+>
+-incoming tcp:0:44444 \
+>
+-incoming '{"channel-type": "cpr", "addr": { "transport": "socket",
+>
+"type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+>
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+>
+QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>
+$QMPSHELL -p $QMPSOCK <<EOF
+>
+migrate-set-parameters mode=cpr-transfer
+>
+migrate
+>
+channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
+>
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+>
+[   73.962002] [TTM] Buffer eviction failed
+>
+[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
+>
+[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate
+>
+VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which speeds up
+the crash in the guest):
+>
+#!/bin/bash
+>
+>
+chvt 3
+>
+>
+for j in $(seq 80); do
+>
+echo "$(date) starting round $j"
+>
+if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != ""
+>
+]; then
+>
+echo "bug was reproduced after $j tries"
+>
+exit 1
+>
+fi
+>
+for i in $(seq 100); do
+>
+dmesg > /dev/tty3
+>
+done
+>
+done
+>
+>
+echo "bug could not be reproduced"
+>
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into this?  Any
+suggestions would be appreciated.  Thanks!
+
+Andrey
+
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+     -machine q35 \
+     -cpu host -smp 2 -m 2G \
+     -object 
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
+     -machine memory-backend=ram0 \
+     -machine aux-ram-share=on \
+     -drive file=$ROOTFS,media=disk,if=virtio \
+     -qmp unix:$QMPSOCK,server=on,wait=off \
+     -nographic \
+     -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+-machine q35 \
+     -cpu host -smp 2 -m 2G \
+     -object 
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
+     -machine memory-backend=ram0 \
+     -machine aux-ram-share=on \
+     -drive file=$ROOTFS,media=disk,if=virtio \
+     -qmp unix:$QMPSOCK,server=on,wait=off \
+     -nographic \
+     -device qxl-vga \
+     -incoming tcp:0:44444 \
+     -incoming '{"channel-type": "cpr", "addr": { "transport": "socket", "type": "unix", 
+"path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+     migrate-set-parameters mode=cpr-transfer
+     migrate 
+channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+[   73.962002] [TTM] Buffer eviction failed
+[   73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
+[   73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate 
+VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+         echo "$(date) starting round $j"
+         if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" 
+]; then
+                 echo "bug was reproduced after $j tries"
+                 exit 1
+         fi
+         for i in $(seq 100); do
+                 dmesg > /dev/tty3
+         done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into this?  Any
+suggestions would be appreciated.  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+
+- Steve
+
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+Â Â Â Â  -machine q35 \
+Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â  -object 
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
+Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â  -nographic \
+Â Â Â Â  -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+Â Â Â Â  -machine q35 \
+Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â  -object 
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/ram0,share=on\
+Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â  -nographic \
+Â Â Â Â  -device qxl-vga \
+Â Â Â Â  -incoming tcp:0:44444 \
+Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport": "socket", "type": "unix", 
+"path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+Â Â Â Â  migrate 
+channels=[{"channel-type":"main","addr":{"transport":"socket","type":"inet","host":"0","port":"44444"}},{"channel-type":"cpr","addr":{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+[Â Â  73.962002] [TTM] Buffer eviction failed
+[Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824, 0x00000001)
+[Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to allocate 
+VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" 
+]; then
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+Â Â Â Â Â Â Â Â  fi
+Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+Â Â Â Â Â Â Â Â  done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into this?Â  Any
+suggestions would be appreciated.Â  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+Also try adding this patch to see if it flags any ram blocks as not
+compatible with cpr.  A message is printed at migration start time.
+1740667681-257312-1-git-send-email-steven.sistare@oracle.com
+/">https://lore.kernel.org/qemu-devel/
+1740667681-257312-1-git-send-email-steven.sistare@oracle.com
+/
+- Steve
+
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+>
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+>
+> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+>
+>> Hi all,
+>
+>>
+>
+>> We've been experimenting with cpr-transfer migration mode recently and
+>
+>> have discovered the following issue with the guest QXL driver:
+>
+>>
+>
+>> Run migration source:
+>
+>>> EMULATOR=/path/to/emulator
+>
+>>> ROOTFS=/path/to/image
+>
+>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>
+>
+>>> $EMULATOR -enable-kvm \
+>
+>>> Â Â Â Â  -machine q35 \
+>
+>>> Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>> Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+>
+>>> ram0,share=on\
+>
+>>> Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>> Â Â Â Â  -machine aux-ram-share=on \
+>
+>>> Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>> Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>> Â Â Â Â  -nographic \
+>
+>>> Â Â Â Â  -device qxl-vga
+>
+>>
+>
+>> Run migration target:
+>
+>>> EMULATOR=/path/to/emulator
+>
+>>> ROOTFS=/path/to/image
+>
+>>> QMPSOCK=/var/run/alma8qmp-dst.sock
+>
+>>> $EMULATOR -enable-kvm \
+>
+>>> Â Â Â Â  -machine q35 \
+>
+>>> Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>> Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+>
+>>> ram0,share=on\
+>
+>>> Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>> Â Â Â Â  -machine aux-ram-share=on \
+>
+>>> Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>> Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>> Â Â Â Â  -nographic \
+>
+>>> Â Â Â Â  -device qxl-vga \
+>
+>>> Â Â Â Â  -incoming tcp:0:44444 \
+>
+>>> Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+>
+>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+>
+>>
+>
+>>
+>
+>> Launch the migration:
+>
+>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+>
+>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>
+>
+>>> $QMPSHELL -p $QMPSOCK <<EOF
+>
+>>> Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+>
+>>> Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+>
+>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
+>
+>>> {"channel-type":"cpr","addr":
+>
+>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+>
+>>> dst.sock"}}]
+>
+>>> EOF
+>
+>>
+>
+>> Then, after a while, QXL guest driver on target crashes spewing the
+>
+>> following messages:
+>
+>>> [Â Â  73.962002] [TTM] Buffer eviction failed
+>
+>>> [Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+>
+>>> 0x00000001)
+>
+>>> [Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+>
+>>> allocate VRAM BO
+>
+>>
+>
+>> That seems to be a known kernel QXL driver bug:
+>
+>>
+>
+>>
+https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
+>
+>>
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+>
+>>
+>
+>> (the latter discussion contains that reproduce script which speeds up
+>
+>> the crash in the guest):
+>
+>>> #!/bin/bash
+>
+>>>
+>
+>>> chvt 3
+>
+>>>
+>
+>>> for j in $(seq 80); do
+>
+>>> Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+>
+>>> Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate VRAM
+>
+>>> BO")" != "" ]; then
+>
+>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+>
+>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+>
+>>> Â Â Â Â Â Â Â Â  fi
+>
+>>> Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+>
+>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+>
+>>> Â Â Â Â Â Â Â Â  done
+>
+>>> done
+>
+>>>
+>
+>>> echo "bug could not be reproduced"
+>
+>>> exit 0
+>
+>>
+>
+>> The bug itself seems to remain unfixed, as I was able to reproduce that
+>
+>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+>
+>> cpr-transfer code also seems to be buggy as it triggers the crash -
+>
+>> without the cpr-transfer migration the above reproduce doesn't lead to
+>
+>> crash on the source VM.
+>
+>>
+>
+>> I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+>
+>> rather passes it through the memory backend object, our code might
+>
+>> somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+>
+>> corruption so far.
+>
+>>
+>
+>> Could somebody help the investigation and take a look into this?Â  Any
+>
+>> suggestions would be appreciated.Â  Thanks!
+>
+>
+>
+> Possibly some memory region created by qxl is not being preserved.
+>
+> Try adding these traces to see what is preserved:
+>
+>
+>
+> -trace enable='*cpr*'
+>
+> -trace enable='*ram_alloc*'
+>
+>
+Also try adding this patch to see if it flags any ram blocks as not
+>
+compatible with cpr.Â  A message is printed at migration start time.
+>
+Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-
+>
+steven.sistare@oracle.com/
+>
+>
+- Steve
+>
+With the traces enabled + the "migration: ram block cpr blockers" patch
+applied:
+
+Source:
+>
+cpr_find_fd pc.bios, id 0 returns -1
+>
+cpr_save_fd pc.bios, id 0, fd 22
+>
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+>
+0x7fec18e00000
+>
+cpr_find_fd pc.rom, id 0 returns -1
+>
+cpr_save_fd pc.rom, id 0, fd 23
+>
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+>
+0x7fec18c00000
+>
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+>
+cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+>
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
+>
+24 host 0x7fec18a00000
+>
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+>
+cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+>
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
+>
+fd 25 host 0x7feb77e00000
+>
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+>
+cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+>
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27
+>
+host 0x7fec18800000
+>
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+>
+cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+>
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
+>
+fd 28 host 0x7feb73c00000
+>
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+>
+cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+>
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34
+>
+host 0x7fec18600000
+>
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+>
+cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+>
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 35
+>
+host 0x7fec18200000
+>
+cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+>
+cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+>
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36
+>
+host 0x7feb8b600000
+>
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+>
+cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+>
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host
+>
+0x7feb8b400000
+>
+>
+cpr_state_save cpr-transfer mode
+>
+cpr_transfer_output /var/run/alma8cpr-dst.sock
+Target:
+>
+cpr_transfer_input /var/run/alma8cpr-dst.sock
+>
+cpr_state_load cpr-transfer mode
+>
+cpr_find_fd pc.bios, id 0 returns 20
+>
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+>
+0x7fcdc9800000
+>
+cpr_find_fd pc.rom, id 0 returns 19
+>
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+>
+0x7fcdc9600000
+>
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+>
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
+>
+18 host 0x7fcdc9400000
+>
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+>
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
+>
+fd 17 host 0x7fcd27e00000
+>
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+>
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16
+>
+host 0x7fcdc9200000
+>
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+>
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
+>
+fd 15 host 0x7fcd23c00000
+>
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+>
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14
+>
+host 0x7fcdc8800000
+>
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+>
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 13
+>
+host 0x7fcdc8400000
+>
+cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+>
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11
+>
+host 0x7fcdc8200000
+>
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+>
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host
+>
+0x7fcd3be00000
+Looks like both vga.vram and qxl.vram are being preserved (with the same
+addresses), and no incompatible ram blocks are found during migration.
+
+Andrey
+
+On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+>
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+>
+> On 2/28/2025 1:13 PM, Steven Sistare wrote:
+>
+>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+>
+>>> Hi all,
+>
+>>>
+>
+>>> We've been experimenting with cpr-transfer migration mode recently and
+>
+>>> have discovered the following issue with the guest QXL driver:
+>
+>>>
+>
+>>> Run migration source:
+>
+>>>> EMULATOR=/path/to/emulator
+>
+>>>> ROOTFS=/path/to/image
+>
+>>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>>
+>
+>>>> $EMULATOR -enable-kvm \
+>
+>>>> Â Â Â Â  -machine q35 \
+>
+>>>> Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>>> Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+>
+>>>> ram0,share=on\
+>
+>>>> Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>>> Â Â Â Â  -machine aux-ram-share=on \
+>
+>>>> Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>>> Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>>> Â Â Â Â  -nographic \
+>
+>>>> Â Â Â Â  -device qxl-vga
+>
+>>>
+>
+>>> Run migration target:
+>
+>>>> EMULATOR=/path/to/emulator
+>
+>>>> ROOTFS=/path/to/image
+>
+>>>> QMPSOCK=/var/run/alma8qmp-dst.sock
+>
+>>>> $EMULATOR -enable-kvm \
+>
+>>>> Â Â Â Â  -machine q35 \
+>
+>>>> Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>>> Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+>
+>>>> ram0,share=on\
+>
+>>>> Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>>> Â Â Â Â  -machine aux-ram-share=on \
+>
+>>>> Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>>> Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>>> Â Â Â Â  -nographic \
+>
+>>>> Â Â Â Â  -device qxl-vga \
+>
+>>>> Â Â Â Â  -incoming tcp:0:44444 \
+>
+>>>> Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+>
+>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+>
+>>>
+>
+>>>
+>
+>>> Launch the migration:
+>
+>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+>
+>>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>>
+>
+>>>> $QMPSHELL -p $QMPSOCK <<EOF
+>
+>>>> Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+>
+>>>> Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+>
+>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
+>
+>>>> {"channel-type":"cpr","addr":
+>
+>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+>
+>>>> dst.sock"}}]
+>
+>>>> EOF
+>
+>>>
+>
+>>> Then, after a while, QXL guest driver on target crashes spewing the
+>
+>>> following messages:
+>
+>>>> [Â Â  73.962002] [TTM] Buffer eviction failed
+>
+>>>> [Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+>
+>>>> 0x00000001)
+>
+>>>> [Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+>
+>>>> allocate VRAM BO
+>
+>>>
+>
+>>> That seems to be a known kernel QXL driver bug:
+>
+>>>
+>
+>>>
+https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
+>
+>>>
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+>
+>>>
+>
+>>> (the latter discussion contains that reproduce script which speeds up
+>
+>>> the crash in the guest):
+>
+>>>> #!/bin/bash
+>
+>>>>
+>
+>>>> chvt 3
+>
+>>>>
+>
+>>>> for j in $(seq 80); do
+>
+>>>> Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+>
+>>>> Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate VRAM
+>
+>>>> BO")" != "" ]; then
+>
+>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+>
+>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+>
+>>>> Â Â Â Â Â Â Â Â  fi
+>
+>>>> Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+>
+>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+>
+>>>> Â Â Â Â Â Â Â Â  done
+>
+>>>> done
+>
+>>>>
+>
+>>>> echo "bug could not be reproduced"
+>
+>>>> exit 0
+>
+>>>
+>
+>>> The bug itself seems to remain unfixed, as I was able to reproduce that
+>
+>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+>
+>>> cpr-transfer code also seems to be buggy as it triggers the crash -
+>
+>>> without the cpr-transfer migration the above reproduce doesn't lead to
+>
+>>> crash on the source VM.
+>
+>>>
+>
+>>> I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+>
+>>> rather passes it through the memory backend object, our code might
+>
+>>> somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+>
+>>> corruption so far.
+>
+>>>
+>
+>>> Could somebody help the investigation and take a look into this?Â  Any
+>
+>>> suggestions would be appreciated.Â  Thanks!
+>
+>>
+>
+>> Possibly some memory region created by qxl is not being preserved.
+>
+>> Try adding these traces to see what is preserved:
+>
+>>
+>
+>> -trace enable='*cpr*'
+>
+>> -trace enable='*ram_alloc*'
+>
+>
+>
+> Also try adding this patch to see if it flags any ram blocks as not
+>
+> compatible with cpr.Â  A message is printed at migration start time.
+>
+> Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-
+>
+> steven.sistare@oracle.com/
+>
+>
+>
+> - Steve
+>
+>
+>
+>
+With the traces enabled + the "migration: ram block cpr blockers" patch
+>
+applied:
+>
+>
+Source:
+>
+> cpr_find_fd pc.bios, id 0 returns -1
+>
+> cpr_save_fd pc.bios, id 0, fd 22
+>
+> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+>
+> 0x7fec18e00000
+>
+> cpr_find_fd pc.rom, id 0 returns -1
+>
+> cpr_save_fd pc.rom, id 0, fd 23
+>
+> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+>
+> 0x7fec18c00000
+>
+> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+>
+> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+>
+> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
+>
+> 24 host 0x7fec18a00000
+>
+> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+>
+> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+>
+> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
+>
+> fd 25 host 0x7feb77e00000
+>
+> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+>
+> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+>
+> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27
+>
+> host 0x7fec18800000
+>
+> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+>
+> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+>
+> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
+>
+> fd 28 host 0x7feb73c00000
+>
+> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+>
+> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+>
+> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34
+>
+> host 0x7fec18600000
+>
+> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+>
+> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+>
+> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd
+>
+> 35 host 0x7fec18200000
+>
+> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+>
+> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+>
+> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36
+>
+> host 0x7feb8b600000
+>
+> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+>
+> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+>
+> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host
+>
+> 0x7feb8b400000
+>
+>
+>
+> cpr_state_save cpr-transfer mode
+>
+> cpr_transfer_output /var/run/alma8cpr-dst.sock
+>
+>
+Target:
+>
+> cpr_transfer_input /var/run/alma8cpr-dst.sock
+>
+> cpr_state_load cpr-transfer mode
+>
+> cpr_find_fd pc.bios, id 0 returns 20
+>
+> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+>
+> 0x7fcdc9800000
+>
+> cpr_find_fd pc.rom, id 0 returns 19
+>
+> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+>
+> 0x7fcdc9600000
+>
+> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+>
+> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd
+>
+> 18 host 0x7fcdc9400000
+>
+> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+>
+> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864
+>
+> fd 17 host 0x7fcd27e00000
+>
+> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+>
+> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16
+>
+> host 0x7fcdc9200000
+>
+> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+>
+> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864
+>
+> fd 15 host 0x7fcd23c00000
+>
+> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+>
+> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14
+>
+> host 0x7fcdc8800000
+>
+> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+>
+> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd
+>
+> 13 host 0x7fcdc8400000
+>
+> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+>
+> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11
+>
+> host 0x7fcdc8200000
+>
+> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+>
+> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host
+>
+> 0x7fcd3be00000
+>
+>
+Looks like both vga.vram and qxl.vram are being preserved (with the same
+>
+addresses), and no incompatible ram blocks are found during migration.
+>
+Sorry, addressed are not the same, of course.  However corresponding ram
+blocks do seem to be preserved and initialized.
+
+On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+ Â Â Â Â  -machine q35 \
+ Â Â Â Â  -cpu host -smp 2 -m 2G \
+ Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+ram0,share=on\
+ Â Â Â Â  -machine memory-backend=ram0 \
+ Â Â Â Â  -machine aux-ram-share=on \
+ Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+ Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+ Â Â Â Â  -nographic \
+ Â Â Â Â  -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+ Â Â Â Â  -machine q35 \
+ Â Â Â Â  -cpu host -smp 2 -m 2G \
+ Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+ram0,share=on\
+ Â Â Â Â  -machine memory-backend=ram0 \
+ Â Â Â Â  -machine aux-ram-share=on \
+ Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+ Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+ Â Â Â Â  -nographic \
+ Â Â Â Â  -device qxl-vga \
+ Â Â Â Â  -incoming tcp:0:44444 \
+ Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+ Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+ Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+{"transport":"socket","type":"inet","host":"0","port":"44444"}},
+{"channel-type":"cpr","addr":
+{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+[Â Â  73.962002] [TTM] Buffer eviction failed
+[Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+0x00000001)
+[Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+allocate VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+ Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+ Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate VRAM
+BO")" != "" ]; then
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+ Â Â Â Â Â Â Â Â  fi
+ Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+ Â Â Â Â Â Â Â Â  done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into this?Â  Any
+suggestions would be appreciated.Â  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+Also try adding this patch to see if it flags any ram blocks as not
+compatible with cpr.Â  A message is printed at migration start time.
+ Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-email-
+steven.sistare@oracle.com/
+
+- Steve
+With the traces enabled + the "migration: ram block cpr blockers" patch
+applied:
+
+Source:
+cpr_find_fd pc.bios, id 0 returns -1
+cpr_save_fd pc.bios, id 0, fd 22
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host 
+0x7fec18e00000
+cpr_find_fd pc.rom, id 0 returns -1
+cpr_save_fd pc.rom, id 0, fd 23
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host 
+0x7fec18c00000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd 24 
+host 0x7fec18a00000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 fd 
+25 host 0x7feb77e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 27 host 
+0x7fec18800000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 fd 
+28 host 0x7feb73c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 34 host 
+0x7fec18600000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 35 
+host 0x7fec18200000
+cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 36 host 
+0x7feb8b600000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 37 host 
+0x7feb8b400000
+
+cpr_state_save cpr-transfer mode
+cpr_transfer_output /var/run/alma8cpr-dst.sock
+Target:
+cpr_transfer_input /var/run/alma8cpr-dst.sock
+cpr_state_load cpr-transfer mode
+cpr_find_fd pc.bios, id 0 returns 20
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host 
+0x7fcdc9800000
+cpr_find_fd pc.rom, id 0 returns 19
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host 
+0x7fcdc9600000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size 262144 fd 18 
+host 0x7fcdc9400000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size 67108864 fd 
+17 host 0x7fcd27e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192 fd 16 host 
+0x7fcdc9200000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size 67108864 fd 
+15 host 0x7fcd23c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536 fd 14 host 
+0x7fcdc8800000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size 2097152 fd 13 
+host 0x7fcdc8400000
+cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536 fd 11 host 
+0x7fcdc8200000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd 10 host 
+0x7fcd3be00000
+Looks like both vga.vram and qxl.vram are being preserved (with the same
+addresses), and no incompatible ram blocks are found during migration.
+Sorry, addressed are not the same, of course.  However corresponding ram
+blocks do seem to be preserved and initialized.
+So far, I have not reproduced the guest driver failure.
+
+However, I have isolated places where new QEMU improperly writes to
+the qxl memory regions prior to starting the guest, by mmap'ing them
+readonly after cpr:
+
+  qemu_ram_alloc_internal()
+    if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+        ram_flags |= RAM_READONLY;
+    new_block = qemu_ram_alloc_from_fd(...)
+
+I have attached a draft fix; try it and let me know.
+My console window looks fine before and after cpr, using
+-vnc $hostip:0 -vga qxl
+
+- Steve
+0001-hw-qxl-cpr-support-preliminary.patch
+Description:
+Text document
+
+On 3/4/25 9:05 PM, Steven Sistare wrote:
+>
+On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+>
+> On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+>
+>> On 2/28/25 8:20 PM, Steven Sistare wrote:
+>
+>>> On 2/28/2025 1:13 PM, Steven Sistare wrote:
+>
+>>>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+>
+>>>>> Hi all,
+>
+>>>>>
+>
+>>>>> We've been experimenting with cpr-transfer migration mode recently
+>
+>>>>> and
+>
+>>>>> have discovered the following issue with the guest QXL driver:
+>
+>>>>>
+>
+>>>>> Run migration source:
+>
+>>>>>> EMULATOR=/path/to/emulator
+>
+>>>>>> ROOTFS=/path/to/image
+>
+>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>>>>
+>
+>>>>>> $EMULATOR -enable-kvm \
+>
+>>>>>> Â Â Â Â Â  -machine q35 \
+>
+>>>>>> Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>>>>> Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+>
+>>>>>> ram0,share=on\
+>
+>>>>>> Â Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>>>>> Â Â Â Â Â  -machine aux-ram-share=on \
+>
+>>>>>> Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>>>>> Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>>>>> Â Â Â Â Â  -nographic \
+>
+>>>>>> Â Â Â Â Â  -device qxl-vga
+>
+>>>>>
+>
+>>>>> Run migration target:
+>
+>>>>>> EMULATOR=/path/to/emulator
+>
+>>>>>> ROOTFS=/path/to/image
+>
+>>>>>> QMPSOCK=/var/run/alma8qmp-dst.sock
+>
+>>>>>> $EMULATOR -enable-kvm \
+>
+>>>>>> Â Â Â Â Â  -machine q35 \
+>
+>>>>>> Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>>>>> Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+>
+>>>>>> ram0,share=on\
+>
+>>>>>> Â Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>>>>> Â Â Â Â Â  -machine aux-ram-share=on \
+>
+>>>>>> Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>>>>> Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>>>>> Â Â Â Â Â  -nographic \
+>
+>>>>>> Â Â Â Â Â  -device qxl-vga \
+>
+>>>>>> Â Â Â Â Â  -incoming tcp:0:44444 \
+>
+>>>>>> Â Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+>
+>>>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+>
+>>>>>
+>
+>>>>>
+>
+>>>>> Launch the migration:
+>
+>>>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+>
+>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>>>>
+>
+>>>>>> $QMPSHELL -p $QMPSOCK <<EOF
+>
+>>>>>> Â Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+>
+>>>>>> Â Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+>
+>>>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
+>
+>>>>>> {"channel-type":"cpr","addr":
+>
+>>>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+>
+>>>>>> dst.sock"}}]
+>
+>>>>>> EOF
+>
+>>>>>
+>
+>>>>> Then, after a while, QXL guest driver on target crashes spewing the
+>
+>>>>> following messages:
+>
+>>>>>> [Â Â  73.962002] [TTM] Buffer eviction failed
+>
+>>>>>> [Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+>
+>>>>>> 0x00000001)
+>
+>>>>>> [Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+>
+>>>>>> allocate VRAM BO
+>
+>>>>>
+>
+>>>>> That seems to be a known kernel QXL driver bug:
+>
+>>>>>
+>
+>>>>>
+https://lore.kernel.org/all/20220907094423.93581-1-
+>
+>>>>> min_halo@163.com/T/
+>
+>>>>>
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+>
+>>>>>
+>
+>>>>> (the latter discussion contains that reproduce script which speeds up
+>
+>>>>> the crash in the guest):
+>
+>>>>>> #!/bin/bash
+>
+>>>>>>
+>
+>>>>>> chvt 3
+>
+>>>>>>
+>
+>>>>>> for j in $(seq 80); do
+>
+>>>>>> Â Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+>
+>>>>>> Â Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate VRAM
+>
+>>>>>> BO")" != "" ]; then
+>
+>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+>
+>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+>
+>>>>>> Â Â Â Â Â Â Â Â Â  fi
+>
+>>>>>> Â Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+>
+>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+>
+>>>>>> Â Â Â Â Â Â Â Â Â  done
+>
+>>>>>> done
+>
+>>>>>>
+>
+>>>>>> echo "bug could not be reproduced"
+>
+>>>>>> exit 0
+>
+>>>>>
+>
+>>>>> The bug itself seems to remain unfixed, as I was able to reproduce
+>
+>>>>> that
+>
+>>>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+>
+>>>>> cpr-transfer code also seems to be buggy as it triggers the crash -
+>
+>>>>> without the cpr-transfer migration the above reproduce doesn't
+>
+>>>>> lead to
+>
+>>>>> crash on the source VM.
+>
+>>>>>
+>
+>>>>> I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+>
+>>>>> rather passes it through the memory backend object, our code might
+>
+>>>>> somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+>
+>>>>> corruption so far.
+>
+>>>>>
+>
+>>>>> Could somebody help the investigation and take a look into this?Â  Any
+>
+>>>>> suggestions would be appreciated.Â  Thanks!
+>
+>>>>
+>
+>>>> Possibly some memory region created by qxl is not being preserved.
+>
+>>>> Try adding these traces to see what is preserved:
+>
+>>>>
+>
+>>>> -trace enable='*cpr*'
+>
+>>>> -trace enable='*ram_alloc*'
+>
+>>>
+>
+>>> Also try adding this patch to see if it flags any ram blocks as not
+>
+>>> compatible with cpr.Â  A message is printed at migration start time.
+>
+>>> Â Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
+>
+>>> email-
+>
+>>> steven.sistare@oracle.com/
+>
+>>>
+>
+>>> - Steve
+>
+>>>
+>
+>>
+>
+>> With the traces enabled + the "migration: ram block cpr blockers" patch
+>
+>> applied:
+>
+>>
+>
+>> Source:
+>
+>>> cpr_find_fd pc.bios, id 0 returns -1
+>
+>>> cpr_save_fd pc.bios, id 0, fd 22
+>
+>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+>
+>>> 0x7fec18e00000
+>
+>>> cpr_find_fd pc.rom, id 0 returns -1
+>
+>>> cpr_save_fd pc.rom, id 0, fd 23
+>
+>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+>
+>>> 0x7fec18c00000
+>
+>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+>
+>>> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+>
+>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+>
+>>> 262144 fd 24 host 0x7fec18a00000
+>
+>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+>
+>>> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+>
+>>> 67108864 fd 25 host 0x7feb77e00000
+>
+>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+>
+>>> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+>
+>>> fd 27 host 0x7fec18800000
+>
+>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+>
+>>> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+>
+>>> 67108864 fd 28 host 0x7feb73c00000
+>
+>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+>
+>>> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+>
+>>> fd 34 host 0x7fec18600000
+>
+>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+>
+>>> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+>
+>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+>
+>>> 2097152 fd 35 host 0x7fec18200000
+>
+>>> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+>
+>>> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+>
+>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+>
+>>> fd 36 host 0x7feb8b600000
+>
+>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+>
+>>> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+>
+>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+>
+>>> 37 host 0x7feb8b400000
+>
+>>>
+>
+>>> cpr_state_save cpr-transfer mode
+>
+>>> cpr_transfer_output /var/run/alma8cpr-dst.sock
+>
+>>
+>
+>> Target:
+>
+>>> cpr_transfer_input /var/run/alma8cpr-dst.sock
+>
+>>> cpr_state_load cpr-transfer mode
+>
+>>> cpr_find_fd pc.bios, id 0 returns 20
+>
+>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+>
+>>> 0x7fcdc9800000
+>
+>>> cpr_find_fd pc.rom, id 0 returns 19
+>
+>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+>
+>>> 0x7fcdc9600000
+>
+>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+>
+>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+>
+>>> 262144 fd 18 host 0x7fcdc9400000
+>
+>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+>
+>>> 67108864 fd 17 host 0x7fcd27e00000
+>
+>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+>
+>>> fd 16 host 0x7fcdc9200000
+>
+>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+>
+>>> 67108864 fd 15 host 0x7fcd23c00000
+>
+>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+>
+>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+>
+>>> fd 14 host 0x7fcdc8800000
+>
+>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+>
+>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+>
+>>> 2097152 fd 13 host 0x7fcdc8400000
+>
+>>> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+>
+>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+>
+>>> fd 11 host 0x7fcdc8200000
+>
+>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+>
+>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+>
+>>> 10 host 0x7fcd3be00000
+>
+>>
+>
+>> Looks like both vga.vram and qxl.vram are being preserved (with the same
+>
+>> addresses), and no incompatible ram blocks are found during migration.
+>
+>
+>
+> Sorry, addressed are not the same, of course.Â  However corresponding ram
+>
+> blocks do seem to be preserved and initialized.
+>
+>
+So far, I have not reproduced the guest driver failure.
+>
+>
+However, I have isolated places where new QEMU improperly writes to
+>
+the qxl memory regions prior to starting the guest, by mmap'ing them
+>
+readonly after cpr:
+>
+>
+Â  qemu_ram_alloc_internal()
+>
+Â Â Â  if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+>
+Â Â Â Â Â Â Â  ram_flags |= RAM_READONLY;
+>
+Â Â Â  new_block = qemu_ram_alloc_from_fd(...)
+>
+>
+I have attached a draft fix; try it and let me know.
+>
+My console window looks fine before and after cpr, using
+>
+-vnc $hostip:0 -vga qxl
+>
+>
+- Steve
+Regarding the reproduce: when I launch the buggy version with the same
+options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
+my VNC client silently hangs on the target after a while.  Could it
+happen on your stand as well?  Could you try launching VM with
+"-nographic -device qxl-vga"?  That way VM's serial console is given you
+directly in the shell, so when qxl driver crashes you're still able to
+inspect the kernel messages.
+
+As for your patch, I can report that it doesn't resolve the issue as it
+is.  But I was able to track down another possible memory corruption
+using your approach with readonly mmap'ing:
+
+>
+Program terminated with signal SIGSEGV, Segmentation fault.
+>
+#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+>
+412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
+>
+[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
+>
+(gdb) bt
+>
+#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+>
+#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
+>
+errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
+>
+#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
+>
+errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
+>
+#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
+>
+errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
+>
+#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
+>
+value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
+>
+#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
+>
+v=0x5638996f3770, name=0x56389759b141 "realized", opaque=0x5638987893d0,
+>
+errp=0x7ffd3c2b84e0)
+>
+at ../qom/object.c:2374
+>
+#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
+>
+name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
+>
+at ../qom/object.c:1449
+>
+#7  0x00005638970f8586 in object_property_set_qobject (obj=0x5638996e0e70,
+>
+name=0x56389759b141 "realized", value=0x5638996df900, errp=0x7ffd3c2b84e0)
+>
+at ../qom/qom-qobject.c:28
+>
+#8  0x00005638970f3d8d in object_property_set_bool (obj=0x5638996e0e70,
+>
+name=0x56389759b141 "realized", value=true, errp=0x7ffd3c2b84e0)
+>
+at ../qom/object.c:1519
+>
+#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
+>
+bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
+>
+#10 0x0000563896dba675 in qdev_device_add_from_qdict (opts=0x5638996dfe50,
+>
+from_json=false, errp=0x7ffd3c2b84e0) at ../system/qdev-monitor.c:714
+>
+#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
+>
+errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
+>
+#12 0x0000563896dc48f1 in device_init_func (opaque=0x0, opts=0x563898786150,
+>
+errp=0x56389855dc40 <error_fatal>) at ../system/vl.c:1207
+>
+#13 0x000056389737a6cc in qemu_opts_foreach
+>
+(list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
+>
+<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
+>
+at ../util/qemu-option.c:1135
+>
+#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/vl.c:2745
+>
+#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
+>
+<error_fatal>) at ../system/vl.c:2806
+>
+#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948) at
+>
+../system/vl.c:3838
+>
+#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at
+>
+../system/main.c:72
+So the attached adjusted version of your patch does seem to help.  At
+least I can't reproduce the crash on my stand.
+
+I'm wondering, could it be useful to explicitly mark all the reused
+memory regions readonly upon cpr-transfer, and then make them writable
+back again after the migration is done?  That way we will be segfaulting
+early on instead of debugging tricky memory corruptions.
+
+Andrey
+0001-hw-qxl-cpr-support-preliminary.patch
+Description:
+Text Data
+
+On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
+On 3/4/25 9:05 PM, Steven Sistare wrote:
+On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently
+and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+ Â Â Â Â Â  -machine q35 \
+ Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+ Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+ram0,share=on\
+ Â Â Â Â Â  -machine memory-backend=ram0 \
+ Â Â Â Â Â  -machine aux-ram-share=on \
+ Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+ Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+ Â Â Â Â Â  -nographic \
+ Â Â Â Â Â  -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+ Â Â Â Â Â  -machine q35 \
+ Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+ Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+ram0,share=on\
+ Â Â Â Â Â  -machine memory-backend=ram0 \
+ Â Â Â Â Â  -machine aux-ram-share=on \
+ Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+ Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+ Â Â Â Â Â  -nographic \
+ Â Â Â Â Â  -device qxl-vga \
+ Â Â Â Â Â  -incoming tcp:0:44444 \
+ Â Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+ Â Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+ Â Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+{"transport":"socket","type":"inet","host":"0","port":"44444"}},
+{"channel-type":"cpr","addr":
+{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+[Â Â  73.962002] [TTM] Buffer eviction failed
+[Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+0x00000001)
+[Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+allocate VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-
+min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+ Â Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+ Â Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate VRAM
+BO")" != "" ]; then
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+ Â Â Â Â Â Â Â Â Â  fi
+ Â Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+ Â Â Â Â Â Â Â Â Â  done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce
+that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't
+lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into this?Â  Any
+suggestions would be appreciated.Â  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+Also try adding this patch to see if it flags any ram blocks as not
+compatible with cpr.Â  A message is printed at migration start time.
+ Â Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
+email-
+steven.sistare@oracle.com/
+
+- Steve
+With the traces enabled + the "migration: ram block cpr blockers" patch
+applied:
+
+Source:
+cpr_find_fd pc.bios, id 0 returns -1
+cpr_save_fd pc.bios, id 0, fd 22
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+0x7fec18e00000
+cpr_find_fd pc.rom, id 0 returns -1
+cpr_save_fd pc.rom, id 0, fd 23
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+0x7fec18c00000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 24 host 0x7fec18a00000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 25 host 0x7feb77e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 27 host 0x7fec18800000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 28 host 0x7feb73c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 34 host 0x7fec18600000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 35 host 0x7fec18200000
+cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 36 host 0x7feb8b600000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+37 host 0x7feb8b400000
+
+cpr_state_save cpr-transfer mode
+cpr_transfer_output /var/run/alma8cpr-dst.sock
+Target:
+cpr_transfer_input /var/run/alma8cpr-dst.sock
+cpr_state_load cpr-transfer mode
+cpr_find_fd pc.bios, id 0 returns 20
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+0x7fcdc9800000
+cpr_find_fd pc.rom, id 0 returns 19
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+0x7fcdc9600000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 18 host 0x7fcdc9400000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 17 host 0x7fcd27e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 16 host 0x7fcdc9200000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 15 host 0x7fcd23c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 14 host 0x7fcdc8800000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 13 host 0x7fcdc8400000
+cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 11 host 0x7fcdc8200000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+10 host 0x7fcd3be00000
+Looks like both vga.vram and qxl.vram are being preserved (with the same
+addresses), and no incompatible ram blocks are found during migration.
+Sorry, addressed are not the same, of course.Â  However corresponding ram
+blocks do seem to be preserved and initialized.
+So far, I have not reproduced the guest driver failure.
+
+However, I have isolated places where new QEMU improperly writes to
+the qxl memory regions prior to starting the guest, by mmap'ing them
+readonly after cpr:
+
+ Â  qemu_ram_alloc_internal()
+ Â Â Â  if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+ Â Â Â Â Â Â Â  ram_flags |= RAM_READONLY;
+ Â Â Â  new_block = qemu_ram_alloc_from_fd(...)
+
+I have attached a draft fix; try it and let me know.
+My console window looks fine before and after cpr, using
+-vnc $hostip:0 -vga qxl
+
+- Steve
+Regarding the reproduce: when I launch the buggy version with the same
+options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
+my VNC client silently hangs on the target after a while.  Could it
+happen on your stand as well?
+cpr does not preserve the vnc connection and session.  To test, I specify
+port 0 for the source VM and port 1 for the dest.  When the src vnc goes
+dormant the dest vnc becomes active.
+Could you try launching VM with
+"-nographic -device qxl-vga"?  That way VM's serial console is given you
+directly in the shell, so when qxl driver crashes you're still able to
+inspect the kernel messages.
+I have been running like that, but have not reproduced the qxl driver crash,
+and I suspect my guest image+kernel is too old.  However, once I realized the
+issue was post-cpr modification of qxl memory, I switched my attention to the
+fix.
+As for your patch, I can report that it doesn't resolve the issue as it
+is.  But I was able to track down another possible memory corruption
+using your approach with readonly mmap'ing:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+412         d->ram->magic       = cpu_to_le32(QXL_RAM_MAGIC);
+[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
+(gdb) bt
+#0  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+#1  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70, 
+errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
+#2  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70, 
+errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
+#3  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70, 
+errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
+#4  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70, value=true, 
+errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
+#5  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70, v=0x5638996f3770, 
+name=0x56389759b141 "realized", opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
+     at ../qom/object.c:2374
+#6  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70, name=0x56389759b141 
+"realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
+     at ../qom/object.c:1449
+#7  0x00005638970f8586 in object_property_set_qobject (obj=0x5638996e0e70, 
+name=0x56389759b141 "realized", value=0x5638996df900, errp=0x7ffd3c2b84e0)
+     at ../qom/qom-qobject.c:28
+#8  0x00005638970f3d8d in object_property_set_bool (obj=0x5638996e0e70, 
+name=0x56389759b141 "realized", value=true, errp=0x7ffd3c2b84e0)
+     at ../qom/object.c:1519
+#9  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70, bus=0x563898cf3c20, 
+errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
+#10 0x0000563896dba675 in qdev_device_add_from_qdict (opts=0x5638996dfe50, 
+from_json=false, errp=0x7ffd3c2b84e0) at ../system/qdev-monitor.c:714
+#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150, errp=0x56389855dc40 
+<error_fatal>) at ../system/qdev-monitor.c:733
+#12 0x0000563896dc48f1 in device_init_func (opaque=0x0, opts=0x563898786150, 
+errp=0x56389855dc40 <error_fatal>) at ../system/vl.c:1207
+#13 0x000056389737a6cc in qemu_opts_foreach
+     (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca <device_init_func>, 
+opaque=0x0, errp=0x56389855dc40 <error_fatal>)
+     at ../util/qemu-option.c:1135
+#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/vl.c:2745
+#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40 
+<error_fatal>) at ../system/vl.c:2806
+#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948) at 
+../system/vl.c:3838
+#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at 
+../system/main.c:72
+So the attached adjusted version of your patch does seem to help.  At
+least I can't reproduce the crash on my stand.
+Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram are
+definitely harmful.  Try V2 of the patch, attached, which skips the lines
+of init_qxl_ram that modify guest memory.
+I'm wondering, could it be useful to explicitly mark all the reused
+memory regions readonly upon cpr-transfer, and then make them writable
+back again after the migration is done?  That way we will be segfaulting
+early on instead of debugging tricky memory corruptions.
+It's a useful debugging technique, but changing protection on a large memory 
+region
+can be too expensive for production due to TLB shootdowns.
+
+Also, there are cases where writes are performed but the value is guaranteed to
+be the same:
+  qxl_post_load()
+    qxl_set_mode()
+      d->rom->mode = cpu_to_le32(modenr);
+The value is the same because mode and shadow_rom.mode were passed in vmstate
+from old qemu.
+
+- Steve
+0001-hw-qxl-cpr-support-preliminary-V2.patch
+Description:
+Text document
+
+On 3/5/25 22:19, Steven Sistare wrote:
+On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
+On 3/4/25 9:05 PM, Steven Sistare wrote:
+On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently
+and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+Â Â Â Â Â Â  -machine q35 \
+Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â Â Â  -object
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+ram0,share=on\
+Â Â Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â Â Â  -nographic \
+Â Â Â Â Â Â  -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+Â Â Â Â Â Â  -machine q35 \
+Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â Â Â  -object
+memory-backend-file,id=ram0,size=2G,mem-path=/dev/shm/
+ram0,share=on\
+Â Â Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â Â Â  -nographic \
+Â Â Â Â Â Â  -device qxl-vga \
+Â Â Â Â Â Â  -incoming tcp:0:44444 \
+Â Â Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+Â Â Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+Â Â Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+{"transport":"socket","type":"inet","host":"0","port":"44444"}},
+{"channel-type":"cpr","addr":
+{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing
+the
+following messages:
+[Â Â  73.962002] [TTM] Buffer eviction failed
+[Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+0x00000001)
+[Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR*
+failed to
+allocate VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-
+min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which
+speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+Â Â Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+Â Â Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to
+allocate VRAM
+BO")" != "" ]; then
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+Â Â Â Â Â Â Â Â Â Â  fi
+Â Â Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+Â Â Â Â Â Â Â Â Â Â  done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce
+that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the
+crash -
+without the cpr-transfer migration the above reproduce doesn't
+lead to
+crash on the source VM.
+I suspect that, as cpr-transfer doesn't migrate the guest
+memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+corruption so far.
+Could somebody help the investigation and take a look into
+this?Â  Any
+suggestions would be appreciated.Â  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+Also try adding this patch to see if it flags any ram blocks as not
+compatible with cpr.Â  A message is printed at migration start time.
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
+email-
+steven.sistare@oracle.com/
+
+- Steve
+With the traces enabled + the "migration: ram block cpr blockers"
+patch
+applied:
+
+Source:
+cpr_find_fd pc.bios, id 0 returns -1
+cpr_save_fd pc.bios, id 0, fd 22
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+0x7fec18e00000
+cpr_find_fd pc.rom, id 0 returns -1
+cpr_save_fd pc.rom, id 0, fd 23
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+0x7fec18c00000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 24 host 0x7fec18a00000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 25 host 0x7feb77e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 27 host 0x7fec18800000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 28 host 0x7feb73c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 34 host 0x7fec18600000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 35 host 0x7fec18200000
+cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 36 host 0x7feb8b600000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+37 host 0x7feb8b400000
+
+cpr_state_save cpr-transfer mode
+cpr_transfer_output /var/run/alma8cpr-dst.sock
+Target:
+cpr_transfer_input /var/run/alma8cpr-dst.sock
+cpr_state_load cpr-transfer mode
+cpr_find_fd pc.bios, id 0 returns 20
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+0x7fcdc9800000
+cpr_find_fd pc.rom, id 0 returns 19
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+0x7fcdc9600000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 18 host 0x7fcdc9400000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 17 host 0x7fcd27e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 16 host 0x7fcdc9200000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 15 host 0x7fcd23c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 14 host 0x7fcdc8800000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 13 host 0x7fcdc8400000
+cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 11 host 0x7fcdc8200000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+10 host 0x7fcd3be00000
+Looks like both vga.vram and qxl.vram are being preserved (with
+the same
+addresses), and no incompatible ram blocks are found during
+migration.
+Sorry, addressed are not the same, of course.Â  However
+corresponding ram
+blocks do seem to be preserved and initialized.
+So far, I have not reproduced the guest driver failure.
+
+However, I have isolated places where new QEMU improperly writes to
+the qxl memory regions prior to starting the guest, by mmap'ing them
+readonly after cpr:
+
+Â Â  qemu_ram_alloc_internal()
+Â Â Â Â  if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+Â Â Â Â Â Â Â Â  ram_flags |= RAM_READONLY;
+Â Â Â Â  new_block = qemu_ram_alloc_from_fd(...)
+
+I have attached a draft fix; try it and let me know.
+My console window looks fine before and after cpr, using
+-vnc $hostip:0 -vga qxl
+
+- Steve
+Regarding the reproduce: when I launch the buggy version with the same
+options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
+my VNC client silently hangs on the target after a while.Â  Could it
+happen on your stand as well?
+cpr does not preserve the vnc connection and session.Â  To test, I specify
+port 0 for the source VM and port 1 for the dest.Â  When the src vnc goes
+dormant the dest vnc becomes active.
+Could you try launching VM with
+"-nographic -device qxl-vga"?Â  That way VM's serial console is given you
+directly in the shell, so when qxl driver crashes you're still able to
+inspect the kernel messages.
+I have been running like that, but have not reproduced the qxl driver
+crash,
+and I suspect my guest image+kernel is too old.Â  However, once I
+realized the
+issue was post-cpr modification of qxl memory, I switched my attention
+to the
+fix.
+As for your patch, I can report that it doesn't resolve the issue as it
+is.Â  But I was able to track down another possible memory corruption
+using your approach with readonly mmap'ing:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+412Â Â Â Â Â Â Â Â  d->ram->magicÂ Â Â Â Â Â  = cpu_to_le32(QXL_RAM_MAGIC);
+[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
+(gdb) bt
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+#1Â  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
+errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
+#2Â  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
+errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
+#3Â  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
+errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
+#4Â  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
+value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
+#5Â  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
+v=0x5638996f3770, name=0x56389759b141 "realized",
+opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
+Â Â Â Â  at ../qom/object.c:2374
+#6Â  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
+name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
+Â Â Â Â  at ../qom/object.c:1449
+#7Â  0x00005638970f8586 in object_property_set_qobject
+(obj=0x5638996e0e70, name=0x56389759b141 "realized",
+value=0x5638996df900, errp=0x7ffd3c2b84e0)
+Â Â Â Â  at ../qom/qom-qobject.c:28
+#8Â  0x00005638970f3d8d in object_property_set_bool
+(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
+errp=0x7ffd3c2b84e0)
+Â Â Â Â  at ../qom/object.c:1519
+#9Â  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
+bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
+#10 0x0000563896dba675 in qdev_device_add_from_qdict
+(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at
+../system/qdev-monitor.c:714
+#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
+errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
+#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
+opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at
+../system/vl.c:1207
+#13 0x000056389737a6cc in qemu_opts_foreach
+Â Â Â Â  (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
+<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
+Â Â Â Â  at ../util/qemu-option.c:1135
+#14 0x0000563896dc89b5 in qemu_create_cli_devices () at
+../system/vl.c:2745
+#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
+<error_fatal>) at ../system/vl.c:2806
+#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
+at ../system/vl.c:3838
+#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at
+../system/main.c:72
+So the attached adjusted version of your patch does seem to help.Â  At
+least I can't reproduce the crash on my stand.
+Thanks for the stack trace; the calls to SPICE_RING_INIT in
+init_qxl_ram are
+definitely harmful.Â  Try V2 of the patch, attached, which skips the lines
+of init_qxl_ram that modify guest memory.
+I'm wondering, could it be useful to explicitly mark all the reused
+memory regions readonly upon cpr-transfer, and then make them writable
+back again after the migration is done?Â  That way we will be segfaulting
+early on instead of debugging tricky memory corruptions.
+It's a useful debugging technique, but changing protection on a large
+memory region
+can be too expensive for production due to TLB shootdowns.
+Good point. Though we could move this code under non-default option to
+avoid re-writing.
+
+Den
+
+On 3/5/25 11:19 PM, Steven Sistare wrote:
+>
+On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
+>
+> On 3/4/25 9:05 PM, Steven Sistare wrote:
+>
+>> On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+>
+>>> On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+>
+>>>> On 2/28/25 8:20 PM, Steven Sistare wrote:
+>
+>>>>> On 2/28/2025 1:13 PM, Steven Sistare wrote:
+>
+>>>>>> On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+>
+>>>>>>> Hi all,
+>
+>>>>>>>
+>
+>>>>>>> We've been experimenting with cpr-transfer migration mode recently
+>
+>>>>>>> and
+>
+>>>>>>> have discovered the following issue with the guest QXL driver:
+>
+>>>>>>>
+>
+>>>>>>> Run migration source:
+>
+>>>>>>>> EMULATOR=/path/to/emulator
+>
+>>>>>>>> ROOTFS=/path/to/image
+>
+>>>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>>>>>>
+>
+>>>>>>>> $EMULATOR -enable-kvm \
+>
+>>>>>>>> Â Â Â Â Â Â  -machine q35 \
+>
+>>>>>>>> Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>>>>>>> Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+>
+>>>>>>>> dev/shm/
+>
+>>>>>>>> ram0,share=on\
+>
+>>>>>>>> Â Â Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>>>>>>> Â Â Â Â Â Â  -machine aux-ram-share=on \
+>
+>>>>>>>> Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>>>>>>> Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>>>>>>> Â Â Â Â Â Â  -nographic \
+>
+>>>>>>>> Â Â Â Â Â Â  -device qxl-vga
+>
+>>>>>>>
+>
+>>>>>>> Run migration target:
+>
+>>>>>>>> EMULATOR=/path/to/emulator
+>
+>>>>>>>> ROOTFS=/path/to/image
+>
+>>>>>>>> QMPSOCK=/var/run/alma8qmp-dst.sock
+>
+>>>>>>>> $EMULATOR -enable-kvm \
+>
+>>>>>>>> Â Â Â Â Â Â  -machine q35 \
+>
+>>>>>>>> Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+>
+>>>>>>>> Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+>
+>>>>>>>> dev/shm/
+>
+>>>>>>>> ram0,share=on\
+>
+>>>>>>>> Â Â Â Â Â Â  -machine memory-backend=ram0 \
+>
+>>>>>>>> Â Â Â Â Â Â  -machine aux-ram-share=on \
+>
+>>>>>>>> Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+>
+>>>>>>>> Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+>
+>>>>>>>> Â Â Â Â Â Â  -nographic \
+>
+>>>>>>>> Â Â Â Â Â Â  -device qxl-vga \
+>
+>>>>>>>> Â Â Â Â Â Â  -incoming tcp:0:44444 \
+>
+>>>>>>>> Â Â Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+>
+>>>>>>>> "socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+>
+>>>>>>>
+>
+>>>>>>>
+>
+>>>>>>> Launch the migration:
+>
+>>>>>>>> QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+>
+>>>>>>>> QMPSOCK=/var/run/alma8qmp-src.sock
+>
+>>>>>>>>
+>
+>>>>>>>> $QMPSHELL -p $QMPSOCK <<EOF
+>
+>>>>>>>> Â Â Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+>
+>>>>>>>> Â Â Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+>
+>>>>>>>> {"transport":"socket","type":"inet","host":"0","port":"44444"}},
+>
+>>>>>>>> {"channel-type":"cpr","addr":
+>
+>>>>>>>> {"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+>
+>>>>>>>> dst.sock"}}]
+>
+>>>>>>>> EOF
+>
+>>>>>>>
+>
+>>>>>>> Then, after a while, QXL guest driver on target crashes spewing the
+>
+>>>>>>> following messages:
+>
+>>>>>>>> [Â Â  73.962002] [TTM] Buffer eviction failed
+>
+>>>>>>>> [Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+>
+>>>>>>>> 0x00000001)
+>
+>>>>>>>> [Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+>
+>>>>>>>> allocate VRAM BO
+>
+>>>>>>>
+>
+>>>>>>> That seems to be a known kernel QXL driver bug:
+>
+>>>>>>>
+>
+>>>>>>>
+https://lore.kernel.org/all/20220907094423.93581-1-
+>
+>>>>>>> min_halo@163.com/T/
+>
+>>>>>>>
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+>
+>>>>>>>
+>
+>>>>>>> (the latter discussion contains that reproduce script which
+>
+>>>>>>> speeds up
+>
+>>>>>>> the crash in the guest):
+>
+>>>>>>>> #!/bin/bash
+>
+>>>>>>>>
+>
+>>>>>>>> chvt 3
+>
+>>>>>>>>
+>
+>>>>>>>> for j in $(seq 80); do
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate
+>
+>>>>>>>> VRAM
+>
+>>>>>>>> BO")" != "" ]; then
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â  fi
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+>
+>>>>>>>> Â Â Â Â Â Â Â Â Â Â  done
+>
+>>>>>>>> done
+>
+>>>>>>>>
+>
+>>>>>>>> echo "bug could not be reproduced"
+>
+>>>>>>>> exit 0
+>
+>>>>>>>
+>
+>>>>>>> The bug itself seems to remain unfixed, as I was able to reproduce
+>
+>>>>>>> that
+>
+>>>>>>> with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+>
+>>>>>>> cpr-transfer code also seems to be buggy as it triggers the crash -
+>
+>>>>>>> without the cpr-transfer migration the above reproduce doesn't
+>
+>>>>>>> lead to
+>
+>>>>>>> crash on the source VM.
+>
+>>>>>>>
+>
+>>>>>>> I suspect that, as cpr-transfer doesn't migrate the guest
+>
+>>>>>>> memory, but
+>
+>>>>>>> rather passes it through the memory backend object, our code might
+>
+>>>>>>> somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+>
+>>>>>>> corruption so far.
+>
+>>>>>>>
+>
+>>>>>>> Could somebody help the investigation and take a look into
+>
+>>>>>>> this?Â  Any
+>
+>>>>>>> suggestions would be appreciated.Â  Thanks!
+>
+>>>>>>
+>
+>>>>>> Possibly some memory region created by qxl is not being preserved.
+>
+>>>>>> Try adding these traces to see what is preserved:
+>
+>>>>>>
+>
+>>>>>> -trace enable='*cpr*'
+>
+>>>>>> -trace enable='*ram_alloc*'
+>
+>>>>>
+>
+>>>>> Also try adding this patch to see if it flags any ram blocks as not
+>
+>>>>> compatible with cpr.Â  A message is printed at migration start time.
+>
+>>>>> Â Â Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
+>
+>>>>> email-
+>
+>>>>> steven.sistare@oracle.com/
+>
+>>>>>
+>
+>>>>> - Steve
+>
+>>>>>
+>
+>>>>
+>
+>>>> With the traces enabled + the "migration: ram block cpr blockers"
+>
+>>>> patch
+>
+>>>> applied:
+>
+>>>>
+>
+>>>> Source:
+>
+>>>>> cpr_find_fd pc.bios, id 0 returns -1
+>
+>>>>> cpr_save_fd pc.bios, id 0, fd 22
+>
+>>>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+>
+>>>>> 0x7fec18e00000
+>
+>>>>> cpr_find_fd pc.rom, id 0 returns -1
+>
+>>>>> cpr_save_fd pc.rom, id 0, fd 23
+>
+>>>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+>
+>>>>> 0x7fec18c00000
+>
+>>>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+>
+>>>>> cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+>
+>>>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+>
+>>>>> 262144 fd 24 host 0x7fec18a00000
+>
+>>>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+>
+>>>>> cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+>
+>>>>> 67108864 fd 25 host 0x7feb77e00000
+>
+>>>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+>
+>>>>> cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+>
+>>>>> fd 27 host 0x7fec18800000
+>
+>>>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+>
+>>>>> cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+>
+>>>>> 67108864 fd 28 host 0x7feb73c00000
+>
+>>>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+>
+>>>>> cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+>
+>>>>> fd 34 host 0x7fec18600000
+>
+>>>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+>
+>>>>> cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+>
+>>>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+>
+>>>>> 2097152 fd 35 host 0x7fec18200000
+>
+>>>>> cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+>
+>>>>> cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+>
+>>>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+>
+>>>>> fd 36 host 0x7feb8b600000
+>
+>>>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+>
+>>>>> cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+>
+>>>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+>
+>>>>> 37 host 0x7feb8b400000
+>
+>>>>>
+>
+>>>>> cpr_state_save cpr-transfer mode
+>
+>>>>> cpr_transfer_output /var/run/alma8cpr-dst.sock
+>
+>>>>
+>
+>>>> Target:
+>
+>>>>> cpr_transfer_input /var/run/alma8cpr-dst.sock
+>
+>>>>> cpr_state_load cpr-transfer mode
+>
+>>>>> cpr_find_fd pc.bios, id 0 returns 20
+>
+>>>>> qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+>
+>>>>> 0x7fcdc9800000
+>
+>>>>> cpr_find_fd pc.rom, id 0 returns 19
+>
+>>>>> qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+>
+>>>>> 0x7fcdc9600000
+>
+>>>>> cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+>
+>>>>> qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+>
+>>>>> 262144 fd 18 host 0x7fcdc9400000
+>
+>>>>> cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+>
+>>>>> 67108864 fd 17 host 0x7fcd27e00000
+>
+>>>>> cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+>
+>>>>> fd 16 host 0x7fcdc9200000
+>
+>>>>> cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+>
+>>>>> 67108864 fd 15 host 0x7fcd23c00000
+>
+>>>>> cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+>
+>>>>> qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+>
+>>>>> fd 14 host 0x7fcdc8800000
+>
+>>>>> cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+>
+>>>>> qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+>
+>>>>> 2097152 fd 13 host 0x7fcdc8400000
+>
+>>>>> cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+>
+>>>>> qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+>
+>>>>> fd 11 host 0x7fcdc8200000
+>
+>>>>> cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+>
+>>>>> qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+>
+>>>>> 10 host 0x7fcd3be00000
+>
+>>>>
+>
+>>>> Looks like both vga.vram and qxl.vram are being preserved (with the
+>
+>>>> same
+>
+>>>> addresses), and no incompatible ram blocks are found during migration.
+>
+>>>
+>
+>>> Sorry, addressed are not the same, of course.Â  However corresponding
+>
+>>> ram
+>
+>>> blocks do seem to be preserved and initialized.
+>
+>>
+>
+>> So far, I have not reproduced the guest driver failure.
+>
+>>
+>
+>> However, I have isolated places where new QEMU improperly writes to
+>
+>> the qxl memory regions prior to starting the guest, by mmap'ing them
+>
+>> readonly after cpr:
+>
+>>
+>
+>> Â Â  qemu_ram_alloc_internal()
+>
+>> Â Â Â Â  if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+>
+>> Â Â Â Â Â Â Â Â  ram_flags |= RAM_READONLY;
+>
+>> Â Â Â Â  new_block = qemu_ram_alloc_from_fd(...)
+>
+>>
+>
+>> I have attached a draft fix; try it and let me know.
+>
+>> My console window looks fine before and after cpr, using
+>
+>> -vnc $hostip:0 -vga qxl
+>
+>>
+>
+>> - Steve
+>
+>
+>
+> Regarding the reproduce: when I launch the buggy version with the same
+>
+> options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
+>
+> my VNC client silently hangs on the target after a while.Â  Could it
+>
+> happen on your stand as well?Â
+>
+>
+cpr does not preserve the vnc connection and session.Â  To test, I specify
+>
+port 0 for the source VM and port 1 for the dest.Â  When the src vnc goes
+>
+dormant the dest vnc becomes active.
+>
+Sure, I meant that VNC on the dest (on the port 1) works for a while
+after the migration and then hangs, apparently after the guest QXL crash.
+
+>
+> Could you try launching VM with
+>
+> "-nographic -device qxl-vga"?Â  That way VM's serial console is given you
+>
+> directly in the shell, so when qxl driver crashes you're still able to
+>
+> inspect the kernel messages.
+>
+>
+I have been running like that, but have not reproduced the qxl driver
+>
+crash,
+>
+and I suspect my guest image+kernel is too old.
+Yes, that's probably the case.  But the crash occurs on my Fedora 41
+guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
+be buggy.
+
+
+>
+However, once I realized the
+>
+issue was post-cpr modification of qxl memory, I switched my attention
+>
+to the
+>
+fix.
+>
+>
+> As for your patch, I can report that it doesn't resolve the issue as it
+>
+> is.Â  But I was able to track down another possible memory corruption
+>
+> using your approach with readonly mmap'ing:
+>
+>
+>
+>> Program terminated with signal SIGSEGV, Segmentation fault.
+>
+>> #0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+>
+>> 412Â Â Â Â Â Â Â Â  d->ram->magicÂ Â Â Â Â Â  = cpu_to_le32(QXL_RAM_MAGIC);
+>
+>> [Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
+>
+>> (gdb) bt
+>
+>> #0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+>
+>> #1Â  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
+>
+>> errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
+>
+>> #2Â  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
+>
+>> errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
+>
+>> #3Â  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
+>
+>> errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
+>
+>> #4Â  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
+>
+>> value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
+>
+>> #5Â  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
+>
+>> v=0x5638996f3770, name=0x56389759b141 "realized",
+>
+>> opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
+>
+>> Â Â Â Â  at ../qom/object.c:2374
+>
+>> #6Â  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
+>
+>> name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
+>
+>> Â Â Â Â  at ../qom/object.c:1449
+>
+>> #7Â  0x00005638970f8586 in object_property_set_qobject
+>
+>> (obj=0x5638996e0e70, name=0x56389759b141 "realized",
+>
+>> value=0x5638996df900, errp=0x7ffd3c2b84e0)
+>
+>> Â Â Â Â  at ../qom/qom-qobject.c:28
+>
+>> #8Â  0x00005638970f3d8d in object_property_set_bool
+>
+>> (obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
+>
+>> errp=0x7ffd3c2b84e0)
+>
+>> Â Â Â Â  at ../qom/object.c:1519
+>
+>> #9Â  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
+>
+>> bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
+>
+>> #10 0x0000563896dba675 in qdev_device_add_from_qdict
+>
+>> (opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
+>
+>> system/qdev-monitor.c:714
+>
+>> #11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
+>
+>> errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
+>
+>> #12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
+>
+>> opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
+>
+>> vl.c:1207
+>
+>> #13 0x000056389737a6cc in qemu_opts_foreach
+>
+>> Â Â Â Â  (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
+>
+>> <device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
+>
+>> Â Â Â Â  at ../util/qemu-option.c:1135
+>
+>> #14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
+>
+>> vl.c:2745
+>
+>> #15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
+>
+>> <error_fatal>) at ../system/vl.c:2806
+>
+>> #16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
+>
+>> at ../system/vl.c:3838
+>
+>> #17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
+>
+>> system/main.c:72
+>
+>
+>
+> So the attached adjusted version of your patch does seem to help.Â  At
+>
+> least I can't reproduce the crash on my stand.
+>
+>
+Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
+>
+are
+>
+definitely harmful.Â  Try V2 of the patch, attached, which skips the lines
+>
+of init_qxl_ram that modify guest memory.
+>
+Thanks, your v2 patch does seem to prevent the crash.  Would you re-send
+it to the list as a proper fix?
+
+>
+> I'm wondering, could it be useful to explicitly mark all the reused
+>
+> memory regions readonly upon cpr-transfer, and then make them writable
+>
+> back again after the migration is done?Â  That way we will be segfaulting
+>
+> early on instead of debugging tricky memory corruptions.
+>
+>
+It's a useful debugging technique, but changing protection on a large
+>
+memory region
+>
+can be too expensive for production due to TLB shootdowns.
+>
+>
+Also, there are cases where writes are performed but the value is
+>
+guaranteed to
+>
+be the same:
+>
+Â  qxl_post_load()
+>
+Â Â Â  qxl_set_mode()
+>
+Â Â Â Â Â  d->rom->mode = cpu_to_le32(modenr);
+>
+The value is the same because mode and shadow_rom.mode were passed in
+>
+vmstate
+>
+from old qemu.
+>
+There're also cases where devices' ROM might be re-initialized.  E.g.
+this segfault occures upon further exploration of RO mapped RAM blocks:
+
+>
+Program terminated with signal SIGSEGV, Segmentation fault.
+>
+#0  __memmove_avx_unaligned_erms () at
+>
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+>
+664             rep     movsb
+>
+[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
+>
+(gdb) bt
+>
+#0  __memmove_avx_unaligned_erms () at
+>
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+>
+#1  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380,
+>
+owner=0x55aa2019ac10, name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
+>
+at ../hw/core/loader.c:1032
+>
+#2  0x000055aa1d031577 in rom_add_blob
+>
+(name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072,
+>
+max_len=2097152, addr=18446744073709551615, fw_file_name=0x55aa1da51f13
+>
+"etc/acpi/tables", fw_callback=0x55aa1d441f59 <acpi_build_update>,
+>
+callback_opaque=0x55aa20ff0010, as=0x0, read_only=true) at
+>
+../hw/core/loader.c:1147
+>
+#3  0x000055aa1cfd788d in acpi_add_rom_blob
+>
+(update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010,
+>
+blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at
+>
+../hw/acpi/utils.c:46
+>
+#4  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
+>
+#5  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0)
+>
+at ../hw/i386/pc.c:638
+>
+#6  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10
+>
+<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
+>
+#7  0x000055aa1d039ee5 in qdev_machine_creation_done () at
+>
+../hw/core/machine.c:1749
+>
+#8  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40
+>
+<error_fatal>) at ../system/vl.c:2779
+>
+#9  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40
+>
+<error_fatal>) at ../system/vl.c:2807
+>
+#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at
+>
+../system/vl.c:3838
+>
+#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at
+>
+../system/main.c:72
+I'm not sure whether ACPI tables ROM in particular is rewritten with the
+same content, but there might be cases where ROM can be read from file
+system upon initialization.  That is undesirable as guest kernel
+certainly won't be too happy about sudden change of the device's ROM
+content.
+
+So the issue we're dealing with here is any unwanted memory related
+device initialization upon cpr.
+
+For now the only thing that comes to my mind is to make a test where we
+put as many devices as we can into a VM, make ram blocks RO upon cpr
+(and remap them as RW later after migration is done, if needed), and
+catch any unwanted memory violations.  As Den suggested, we might
+consider adding that behaviour as a separate non-default option (or
+"migrate" command flag specific to cpr-transfer), which would only be
+used in the testing.
+
+Andrey
+
+On 3/6/25 16:16, Andrey Drobyshev wrote:
+On 3/5/25 11:19 PM, Steven Sistare wrote:
+On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
+On 3/4/25 9:05 PM, Steven Sistare wrote:
+On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently
+and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+ Â Â Â Â Â Â  -machine q35 \
+ Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+ Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+dev/shm/
+ram0,share=on\
+ Â Â Â Â Â Â  -machine memory-backend=ram0 \
+ Â Â Â Â Â Â  -machine aux-ram-share=on \
+ Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+ Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+ Â Â Â Â Â Â  -nographic \
+ Â Â Â Â Â Â  -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+ Â Â Â Â Â Â  -machine q35 \
+ Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+ Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+dev/shm/
+ram0,share=on\
+ Â Â Â Â Â Â  -machine memory-backend=ram0 \
+ Â Â Â Â Â Â  -machine aux-ram-share=on \
+ Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+ Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+ Â Â Â Â Â Â  -nographic \
+ Â Â Â Â Â Â  -device qxl-vga \
+ Â Â Â Â Â Â  -incoming tcp:0:44444 \
+ Â Â Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+ Â Â Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+ Â Â Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+{"transport":"socket","type":"inet","host":"0","port":"44444"}},
+{"channel-type":"cpr","addr":
+{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+[Â Â  73.962002] [TTM] Buffer eviction failed
+[Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+0x00000001)
+[Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+allocate VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-
+min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which
+speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+ Â Â Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+ Â Â Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate
+VRAM
+BO")" != "" ]; then
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+ Â Â Â Â Â Â Â Â Â Â  fi
+ Â Â Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+ Â Â Â Â Â Â Â Â Â Â  done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce
+that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't
+lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest
+memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into
+this?Â  Any
+suggestions would be appreciated.Â  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+Also try adding this patch to see if it flags any ram blocks as not
+compatible with cpr.Â  A message is printed at migration start time.
+ Â Â Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
+email-
+steven.sistare@oracle.com/
+
+- Steve
+With the traces enabled + the "migration: ram block cpr blockers"
+patch
+applied:
+
+Source:
+cpr_find_fd pc.bios, id 0 returns -1
+cpr_save_fd pc.bios, id 0, fd 22
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+0x7fec18e00000
+cpr_find_fd pc.rom, id 0 returns -1
+cpr_save_fd pc.rom, id 0, fd 23
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+0x7fec18c00000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 24 host 0x7fec18a00000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 25 host 0x7feb77e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 27 host 0x7fec18800000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 28 host 0x7feb73c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 34 host 0x7fec18600000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 35 host 0x7fec18200000
+cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 36 host 0x7feb8b600000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+37 host 0x7feb8b400000
+
+cpr_state_save cpr-transfer mode
+cpr_transfer_output /var/run/alma8cpr-dst.sock
+Target:
+cpr_transfer_input /var/run/alma8cpr-dst.sock
+cpr_state_load cpr-transfer mode
+cpr_find_fd pc.bios, id 0 returns 20
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+0x7fcdc9800000
+cpr_find_fd pc.rom, id 0 returns 19
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+0x7fcdc9600000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 18 host 0x7fcdc9400000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 17 host 0x7fcd27e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 16 host 0x7fcdc9200000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 15 host 0x7fcd23c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 14 host 0x7fcdc8800000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 13 host 0x7fcdc8400000
+cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 11 host 0x7fcdc8200000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+10 host 0x7fcd3be00000
+Looks like both vga.vram and qxl.vram are being preserved (with the
+same
+addresses), and no incompatible ram blocks are found during migration.
+Sorry, addressed are not the same, of course.Â  However corresponding
+ram
+blocks do seem to be preserved and initialized.
+So far, I have not reproduced the guest driver failure.
+
+However, I have isolated places where new QEMU improperly writes to
+the qxl memory regions prior to starting the guest, by mmap'ing them
+readonly after cpr:
+
+ Â Â  qemu_ram_alloc_internal()
+ Â Â Â Â  if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+ Â Â Â Â Â Â Â Â  ram_flags |= RAM_READONLY;
+ Â Â Â Â  new_block = qemu_ram_alloc_from_fd(...)
+
+I have attached a draft fix; try it and let me know.
+My console window looks fine before and after cpr, using
+-vnc $hostip:0 -vga qxl
+
+- Steve
+Regarding the reproduce: when I launch the buggy version with the same
+options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
+my VNC client silently hangs on the target after a while.Â  Could it
+happen on your stand as well?
+cpr does not preserve the vnc connection and session.Â  To test, I specify
+port 0 for the source VM and port 1 for the dest.Â  When the src vnc goes
+dormant the dest vnc becomes active.
+Sure, I meant that VNC on the dest (on the port 1) works for a while
+after the migration and then hangs, apparently after the guest QXL crash.
+Could you try launching VM with
+"-nographic -device qxl-vga"?Â  That way VM's serial console is given you
+directly in the shell, so when qxl driver crashes you're still able to
+inspect the kernel messages.
+I have been running like that, but have not reproduced the qxl driver
+crash,
+and I suspect my guest image+kernel is too old.
+Yes, that's probably the case.  But the crash occurs on my Fedora 41
+guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
+be buggy.
+However, once I realized the
+issue was post-cpr modification of qxl memory, I switched my attention
+to the
+fix.
+As for your patch, I can report that it doesn't resolve the issue as it
+is.Â  But I was able to track down another possible memory corruption
+using your approach with readonly mmap'ing:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+412Â Â Â Â Â Â Â Â  d->ram->magicÂ Â Â Â Â Â  = cpu_to_le32(QXL_RAM_MAGIC);
+[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
+(gdb) bt
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+#1Â  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
+errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
+#2Â  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
+errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
+#3Â  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
+errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
+#4Â  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
+value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
+#5Â  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
+v=0x5638996f3770, name=0x56389759b141 "realized",
+opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
+ Â Â Â Â  at ../qom/object.c:2374
+#6Â  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
+name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
+ Â Â Â Â  at ../qom/object.c:1449
+#7Â  0x00005638970f8586 in object_property_set_qobject
+(obj=0x5638996e0e70, name=0x56389759b141 "realized",
+value=0x5638996df900, errp=0x7ffd3c2b84e0)
+ Â Â Â Â  at ../qom/qom-qobject.c:28
+#8Â  0x00005638970f3d8d in object_property_set_bool
+(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
+errp=0x7ffd3c2b84e0)
+ Â Â Â Â  at ../qom/object.c:1519
+#9Â  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
+bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
+#10 0x0000563896dba675 in qdev_device_add_from_qdict
+(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
+system/qdev-monitor.c:714
+#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
+errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
+#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
+opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
+vl.c:1207
+#13 0x000056389737a6cc in qemu_opts_foreach
+ Â Â Â Â  (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
+<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
+ Â Â Â Â  at ../util/qemu-option.c:1135
+#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
+vl.c:2745
+#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
+<error_fatal>) at ../system/vl.c:2806
+#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
+at ../system/vl.c:3838
+#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
+system/main.c:72
+So the attached adjusted version of your patch does seem to help.Â  At
+least I can't reproduce the crash on my stand.
+Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
+are
+definitely harmful.Â  Try V2 of the patch, attached, which skips the lines
+of init_qxl_ram that modify guest memory.
+Thanks, your v2 patch does seem to prevent the crash.  Would you re-send
+it to the list as a proper fix?
+I'm wondering, could it be useful to explicitly mark all the reused
+memory regions readonly upon cpr-transfer, and then make them writable
+back again after the migration is done?Â  That way we will be segfaulting
+early on instead of debugging tricky memory corruptions.
+It's a useful debugging technique, but changing protection on a large
+memory region
+can be too expensive for production due to TLB shootdowns.
+
+Also, there are cases where writes are performed but the value is
+guaranteed to
+be the same:
+ Â  qxl_post_load()
+ Â Â Â  qxl_set_mode()
+ Â Â Â Â Â  d->rom->mode = cpu_to_le32(modenr);
+The value is the same because mode and shadow_rom.mode were passed in
+vmstate
+from old qemu.
+There're also cases where devices' ROM might be re-initialized.  E.g.
+this segfault occures upon further exploration of RO mapped RAM blocks:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0  __memmove_avx_unaligned_erms () at 
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+664             rep     movsb
+[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
+(gdb) bt
+#0  __memmove_avx_unaligned_erms () at 
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+#1  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380, owner=0x55aa2019ac10, 
+name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
+     at ../hw/core/loader.c:1032
+#2  0x000055aa1d031577 in rom_add_blob
+     (name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072, max_len=2097152, 
+addr=18446744073709551615, fw_file_name=0x55aa1da51f13 "etc/acpi/tables", 
+fw_callback=0x55aa1d441f59 <acpi_build_update>, callback_opaque=0x55aa20ff0010, as=0x0, 
+read_only=true) at ../hw/core/loader.c:1147
+#3  0x000055aa1cfd788d in acpi_add_rom_blob
+     (update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010, 
+blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at ../hw/acpi/utils.c:46
+#4  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
+#5  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0) 
+at ../hw/i386/pc.c:638
+#6  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10 
+<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
+#7  0x000055aa1d039ee5 in qdev_machine_creation_done () at 
+../hw/core/machine.c:1749
+#8  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40 
+<error_fatal>) at ../system/vl.c:2779
+#9  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40 
+<error_fatal>) at ../system/vl.c:2807
+#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at 
+../system/vl.c:3838
+#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at 
+../system/main.c:72
+I'm not sure whether ACPI tables ROM in particular is rewritten with the
+same content, but there might be cases where ROM can be read from file
+system upon initialization.  That is undesirable as guest kernel
+certainly won't be too happy about sudden change of the device's ROM
+content.
+
+So the issue we're dealing with here is any unwanted memory related
+device initialization upon cpr.
+
+For now the only thing that comes to my mind is to make a test where we
+put as many devices as we can into a VM, make ram blocks RO upon cpr
+(and remap them as RW later after migration is done, if needed), and
+catch any unwanted memory violations.  As Den suggested, we might
+consider adding that behaviour as a separate non-default option (or
+"migrate" command flag specific to cpr-transfer), which would only be
+used in the testing.
+
+Andrey
+No way. ACPI with the source must be used in the same way as BIOSes
+and optional ROMs.
+
+Den
+
+On 3/6/2025 10:52 AM, Denis V. Lunev wrote:
+On 3/6/25 16:16, Andrey Drobyshev wrote:
+On 3/5/25 11:19 PM, Steven Sistare wrote:
+On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
+On 3/4/25 9:05 PM, Steven Sistare wrote:
+On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently
+and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+Â Â Â Â Â Â Â  -machine q35 \
+Â Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+dev/shm/
+ram0,share=on\
+Â Â Â Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â Â Â Â  -nographic \
+Â Â Â Â Â Â Â  -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+Â Â Â Â Â Â Â  -machine q35 \
+Â Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+dev/shm/
+ram0,share=on\
+Â Â Â Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â Â Â Â  -nographic \
+Â Â Â Â Â Â Â  -device qxl-vga \
+Â Â Â Â Â Â Â  -incoming tcp:0:44444 \
+Â Â Â Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+Â Â Â Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+Â Â Â Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+{"transport":"socket","type":"inet","host":"0","port":"44444"}},
+{"channel-type":"cpr","addr":
+{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+[Â Â  73.962002] [TTM] Buffer eviction failed
+[Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+0x00000001)
+[Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+allocate VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-
+min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which
+speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+Â Â Â Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+Â Â Â Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate
+VRAM
+BO")" != "" ]; then
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+Â Â Â Â Â Â Â Â Â Â Â  fi
+Â Â Â Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+Â Â Â Â Â Â Â Â Â Â Â  done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce
+that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't
+lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest
+memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into
+this?Â  Any
+suggestions would be appreciated.Â  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+Also try adding this patch to see if it flags any ram blocks as not
+compatible with cpr.Â  A message is printed at migration start time.
+Â Â Â Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
+email-
+steven.sistare@oracle.com/
+
+- Steve
+With the traces enabled + the "migration: ram block cpr blockers"
+patch
+applied:
+
+Source:
+cpr_find_fd pc.bios, id 0 returns -1
+cpr_save_fd pc.bios, id 0, fd 22
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+0x7fec18e00000
+cpr_find_fd pc.rom, id 0 returns -1
+cpr_save_fd pc.rom, id 0, fd 23
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+0x7fec18c00000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 24 host 0x7fec18a00000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 25 host 0x7feb77e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 27 host 0x7fec18800000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 28 host 0x7feb73c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 34 host 0x7fec18600000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 35 host 0x7fec18200000
+cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 36 host 0x7feb8b600000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+37 host 0x7feb8b400000
+
+cpr_state_save cpr-transfer mode
+cpr_transfer_output /var/run/alma8cpr-dst.sock
+Target:
+cpr_transfer_input /var/run/alma8cpr-dst.sock
+cpr_state_load cpr-transfer mode
+cpr_find_fd pc.bios, id 0 returns 20
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+0x7fcdc9800000
+cpr_find_fd pc.rom, id 0 returns 19
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+0x7fcdc9600000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 18 host 0x7fcdc9400000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 17 host 0x7fcd27e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 16 host 0x7fcdc9200000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 15 host 0x7fcd23c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 14 host 0x7fcdc8800000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 13 host 0x7fcdc8400000
+cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 11 host 0x7fcdc8200000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+10 host 0x7fcd3be00000
+Looks like both vga.vram and qxl.vram are being preserved (with the
+same
+addresses), and no incompatible ram blocks are found during migration.
+Sorry, addressed are not the same, of course.Â  However corresponding
+ram
+blocks do seem to be preserved and initialized.
+So far, I have not reproduced the guest driver failure.
+
+However, I have isolated places where new QEMU improperly writes to
+the qxl memory regions prior to starting the guest, by mmap'ing them
+readonly after cpr:
+
+Â Â Â  qemu_ram_alloc_internal()
+Â Â Â Â Â  if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+Â Â Â Â Â Â Â Â Â  ram_flags |= RAM_READONLY;
+Â Â Â Â Â  new_block = qemu_ram_alloc_from_fd(...)
+
+I have attached a draft fix; try it and let me know.
+My console window looks fine before and after cpr, using
+-vnc $hostip:0 -vga qxl
+
+- Steve
+Regarding the reproduce: when I launch the buggy version with the same
+options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
+my VNC client silently hangs on the target after a while.Â  Could it
+happen on your stand as well?
+cpr does not preserve the vnc connection and session.Â  To test, I specify
+port 0 for the source VM and port 1 for the dest.Â  When the src vnc goes
+dormant the dest vnc becomes active.
+Sure, I meant that VNC on the dest (on the port 1) works for a while
+after the migration and then hangs, apparently after the guest QXL crash.
+Could you try launching VM with
+"-nographic -device qxl-vga"?Â  That way VM's serial console is given you
+directly in the shell, so when qxl driver crashes you're still able to
+inspect the kernel messages.
+I have been running like that, but have not reproduced the qxl driver
+crash,
+and I suspect my guest image+kernel is too old.
+Yes, that's probably the case.Â  But the crash occurs on my Fedora 41
+guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
+be buggy.
+However, once I realized the
+issue was post-cpr modification of qxl memory, I switched my attention
+to the
+fix.
+As for your patch, I can report that it doesn't resolve the issue as it
+is.Â  But I was able to track down another possible memory corruption
+using your approach with readonly mmap'ing:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+412Â Â Â Â Â Â Â Â  d->ram->magicÂ Â Â Â Â Â  = cpu_to_le32(QXL_RAM_MAGIC);
+[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
+(gdb) bt
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+#1Â  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
+errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
+#2Â  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
+errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
+#3Â  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
+errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
+#4Â  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
+value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
+#5Â  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
+v=0x5638996f3770, name=0x56389759b141 "realized",
+opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/object.c:2374
+#6Â  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
+name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/object.c:1449
+#7Â  0x00005638970f8586 in object_property_set_qobject
+(obj=0x5638996e0e70, name=0x56389759b141 "realized",
+value=0x5638996df900, errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/qom-qobject.c:28
+#8Â  0x00005638970f3d8d in object_property_set_bool
+(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
+errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/object.c:1519
+#9Â  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
+bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
+#10 0x0000563896dba675 in qdev_device_add_from_qdict
+(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
+system/qdev-monitor.c:714
+#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
+errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
+#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
+opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
+vl.c:1207
+#13 0x000056389737a6cc in qemu_opts_foreach
+Â Â Â Â Â  (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
+<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
+Â Â Â Â Â  at ../util/qemu-option.c:1135
+#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
+vl.c:2745
+#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
+<error_fatal>) at ../system/vl.c:2806
+#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
+at ../system/vl.c:3838
+#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
+system/main.c:72
+So the attached adjusted version of your patch does seem to help.Â  At
+least I can't reproduce the crash on my stand.
+Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
+are
+definitely harmful.Â  Try V2 of the patch, attached, which skips the lines
+of init_qxl_ram that modify guest memory.
+Thanks, your v2 patch does seem to prevent the crash.Â  Would you re-send
+it to the list as a proper fix?
+Yes.  Was waiting for your confirmation.
+I'm wondering, could it be useful to explicitly mark all the reused
+memory regions readonly upon cpr-transfer, and then make them writable
+back again after the migration is done?Â  That way we will be segfaulting
+early on instead of debugging tricky memory corruptions.
+It's a useful debugging technique, but changing protection on a large
+memory region
+can be too expensive for production due to TLB shootdowns.
+
+Also, there are cases where writes are performed but the value is
+guaranteed to
+be the same:
+Â Â  qxl_post_load()
+Â Â Â Â  qxl_set_mode()
+Â Â Â Â Â Â  d->rom->mode = cpu_to_le32(modenr);
+The value is the same because mode and shadow_rom.mode were passed in
+vmstate
+from old qemu.
+There're also cases where devices' ROM might be re-initialized.Â  E.g.
+this segfault occures upon further exploration of RO mapped RAM blocks:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0Â  __memmove_avx_unaligned_erms () at 
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+664Â Â Â Â Â Â Â Â Â Â Â Â  repÂ Â Â Â  movsb
+[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
+(gdb) bt
+#0Â  __memmove_avx_unaligned_erms () at 
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+#1Â  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380, owner=0x55aa2019ac10, 
+name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
+Â Â Â Â  at ../hw/core/loader.c:1032
+#2Â  0x000055aa1d031577 in rom_add_blob
+Â Â Â Â  (name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072, max_len=2097152, 
+addr=18446744073709551615, fw_file_name=0x55aa1da51f13 "etc/acpi/tables", 
+fw_callback=0x55aa1d441f59 <acpi_build_update>, callback_opaque=0x55aa20ff0010, as=0x0, 
+read_only=true) at ../hw/core/loader.c:1147
+#3Â  0x000055aa1cfd788d in acpi_add_rom_blob
+Â Â Â Â  (update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010, 
+blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at ../hw/acpi/utils.c:46
+#4Â  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
+#5Â  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0) 
+at ../hw/i386/pc.c:638
+#6Â  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10 
+<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
+#7Â  0x000055aa1d039ee5 in qdev_machine_creation_done () at 
+../hw/core/machine.c:1749
+#8Â  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40 
+<error_fatal>) at ../system/vl.c:2779
+#9Â  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40 
+<error_fatal>) at ../system/vl.c:2807
+#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at 
+../system/vl.c:3838
+#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at 
+../system/main.c:72
+I'm not sure whether ACPI tables ROM in particular is rewritten with the
+same content, but there might be cases where ROM can be read from file
+system upon initialization.Â  That is undesirable as guest kernel
+certainly won't be too happy about sudden change of the device's ROM
+content.
+
+So the issue we're dealing with here is any unwanted memory related
+device initialization upon cpr.
+
+For now the only thing that comes to my mind is to make a test where we
+put as many devices as we can into a VM, make ram blocks RO upon cpr
+(and remap them as RW later after migration is done, if needed), and
+catch any unwanted memory violations.Â  As Den suggested, we might
+consider adding that behaviour as a separate non-default option (or
+"migrate" command flag specific to cpr-transfer), which would only be
+used in the testing.
+I'll look into adding an option, but there may be too many false positives,
+such as the qxl_set_mode case above.  And the maintainers may object to me
+eliminating the false positives by adding more CPR_IN tests, due to gratuitous
+(from their POV) ugliness.
+
+But I will use the technique to look for more write violations.
+Andrey
+No way. ACPI with the source must be used in the same way as BIOSes
+and optional ROMs.
+Yup, its a bug.  Will fix.
+
+- Steve
+
+see
+1741380954-341079-1-git-send-email-steven.sistare@oracle.com
+/">https://lore.kernel.org/qemu-devel/
+1741380954-341079-1-git-send-email-steven.sistare@oracle.com
+/
+- Steve
+
+On 3/6/2025 11:13 AM, Steven Sistare wrote:
+On 3/6/2025 10:52 AM, Denis V. Lunev wrote:
+On 3/6/25 16:16, Andrey Drobyshev wrote:
+On 3/5/25 11:19 PM, Steven Sistare wrote:
+On 3/5/2025 11:50 AM, Andrey Drobyshev wrote:
+On 3/4/25 9:05 PM, Steven Sistare wrote:
+On 2/28/2025 1:37 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:35 PM, Andrey Drobyshev wrote:
+On 2/28/25 8:20 PM, Steven Sistare wrote:
+On 2/28/2025 1:13 PM, Steven Sistare wrote:
+On 2/28/2025 12:39 PM, Andrey Drobyshev wrote:
+Hi all,
+
+We've been experimenting with cpr-transfer migration mode recently
+and
+have discovered the following issue with the guest QXL driver:
+
+Run migration source:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$EMULATOR -enable-kvm \
+Â Â Â Â Â Â Â  -machine q35 \
+Â Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+dev/shm/
+ram0,share=on\
+Â Â Â Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â Â Â Â  -nographic \
+Â Â Â Â Â Â Â  -device qxl-vga
+Run migration target:
+EMULATOR=/path/to/emulator
+ROOTFS=/path/to/image
+QMPSOCK=/var/run/alma8qmp-dst.sock
+$EMULATOR -enable-kvm \
+Â Â Â Â Â Â Â  -machine q35 \
+Â Â Â Â Â Â Â  -cpu host -smp 2 -m 2G \
+Â Â Â Â Â Â Â  -object memory-backend-file,id=ram0,size=2G,mem-path=/
+dev/shm/
+ram0,share=on\
+Â Â Â Â Â Â Â  -machine memory-backend=ram0 \
+Â Â Â Â Â Â Â  -machine aux-ram-share=on \
+Â Â Â Â Â Â Â  -drive file=$ROOTFS,media=disk,if=virtio \
+Â Â Â Â Â Â Â  -qmp unix:$QMPSOCK,server=on,wait=off \
+Â Â Â Â Â Â Â  -nographic \
+Â Â Â Â Â Â Â  -device qxl-vga \
+Â Â Â Â Â Â Â  -incoming tcp:0:44444 \
+Â Â Â Â Â Â Â  -incoming '{"channel-type": "cpr", "addr": { "transport":
+"socket", "type": "unix", "path": "/var/run/alma8cpr-dst.sock"}}'
+Launch the migration:
+QMPSHELL=/root/src/qemu/master/scripts/qmp/qmp-shell
+QMPSOCK=/var/run/alma8qmp-src.sock
+
+$QMPSHELL -p $QMPSOCK <<EOF
+Â Â Â Â Â Â Â  migrate-set-parameters mode=cpr-transfer
+Â Â Â Â Â Â Â  migrate channels=[{"channel-type":"main","addr":
+{"transport":"socket","type":"inet","host":"0","port":"44444"}},
+{"channel-type":"cpr","addr":
+{"transport":"socket","type":"unix","path":"/var/run/alma8cpr-
+dst.sock"}}]
+EOF
+Then, after a while, QXL guest driver on target crashes spewing the
+following messages:
+[Â Â  73.962002] [TTM] Buffer eviction failed
+[Â Â  73.962072] qxl 0000:00:02.0: object_init failed for (3149824,
+0x00000001)
+[Â Â  73.962081] [drm:qxl_alloc_bo_reserved [qxl]] *ERROR* failed to
+allocate VRAM BO
+That seems to be a known kernel QXL driver bug:
+https://lore.kernel.org/all/20220907094423.93581-1-
+min_halo@163.com/T/
+https://lore.kernel.org/lkml/ZTgydqRlK6WX_b29@eldamar.lan/
+(the latter discussion contains that reproduce script which
+speeds up
+the crash in the guest):
+#!/bin/bash
+
+chvt 3
+
+for j in $(seq 80); do
+Â Â Â Â Â Â Â Â Â Â Â  echo "$(date) starting round $j"
+Â Â Â Â Â Â Â Â Â Â Â  if [ "$(journalctl --boot | grep "failed to allocate
+VRAM
+BO")" != "" ]; then
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  echo "bug was reproduced after $j tries"
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  exit 1
+Â Â Â Â Â Â Â Â Â Â Â  fi
+Â Â Â Â Â Â Â Â Â Â Â  for i in $(seq 100); do
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  dmesg > /dev/tty3
+Â Â Â Â Â Â Â Â Â Â Â  done
+done
+
+echo "bug could not be reproduced"
+exit 0
+The bug itself seems to remain unfixed, as I was able to reproduce
+that
+with Fedora 41 guest, as well as AlmaLinux 8 guest. However our
+cpr-transfer code also seems to be buggy as it triggers the crash -
+without the cpr-transfer migration the above reproduce doesn't
+lead to
+crash on the source VM.
+
+I suspect that, as cpr-transfer doesn't migrate the guest
+memory, but
+rather passes it through the memory backend object, our code might
+somehow corrupt the VRAM.Â  However, I wasn't able to trace the
+corruption so far.
+
+Could somebody help the investigation and take a look into
+this?Â  Any
+suggestions would be appreciated.Â  Thanks!
+Possibly some memory region created by qxl is not being preserved.
+Try adding these traces to see what is preserved:
+
+-trace enable='*cpr*'
+-trace enable='*ram_alloc*'
+Also try adding this patch to see if it flags any ram blocks as not
+compatible with cpr.Â  A message is printed at migration start time.
+Â Â Â Â
+https://lore.kernel.org/qemu-devel/1740667681-257312-1-git-send-
+email-
+steven.sistare@oracle.com/
+
+- Steve
+With the traces enabled + the "migration: ram block cpr blockers"
+patch
+applied:
+
+Source:
+cpr_find_fd pc.bios, id 0 returns -1
+cpr_save_fd pc.bios, id 0, fd 22
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 22 host
+0x7fec18e00000
+cpr_find_fd pc.rom, id 0 returns -1
+cpr_save_fd pc.rom, id 0, fd 23
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 23 host
+0x7fec18c00000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns -1
+cpr_save_fd 0000:00:01.0/e1000e.rom, id 0, fd 24
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 24 host 0x7fec18a00000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/vga.vram, id 0, fd 25
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 25 host 0x7feb77e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vrom, id 0, fd 27
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 27 host 0x7fec18800000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.vram, id 0, fd 28
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 28 host 0x7feb73c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns -1
+cpr_save_fd 0000:00:02.0/qxl.rom, id 0, fd 34
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 34 host 0x7fec18600000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/tables, id 0, fd 35
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 35 host 0x7fec18200000
+cpr_find_fd /rom@etc/table-loader, id 0 returns -1
+cpr_save_fd /rom@etc/table-loader, id 0, fd 36
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 36 host 0x7feb8b600000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns -1
+cpr_save_fd /rom@etc/acpi/rsdp, id 0, fd 37
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+37 host 0x7feb8b400000
+
+cpr_state_save cpr-transfer mode
+cpr_transfer_output /var/run/alma8cpr-dst.sock
+Target:
+cpr_transfer_input /var/run/alma8cpr-dst.sock
+cpr_state_load cpr-transfer mode
+cpr_find_fd pc.bios, id 0 returns 20
+qemu_ram_alloc_shared pc.bios size 262144 max_size 262144 fd 20 host
+0x7fcdc9800000
+cpr_find_fd pc.rom, id 0 returns 19
+qemu_ram_alloc_shared pc.rom size 131072 max_size 131072 fd 19 host
+0x7fcdc9600000
+cpr_find_fd 0000:00:01.0/e1000e.rom, id 0 returns 18
+qemu_ram_alloc_shared 0000:00:01.0/e1000e.rom size 262144 max_size
+262144 fd 18 host 0x7fcdc9400000
+cpr_find_fd 0000:00:02.0/vga.vram, id 0 returns 17
+qemu_ram_alloc_shared 0000:00:02.0/vga.vram size 67108864 max_size
+67108864 fd 17 host 0x7fcd27e00000
+cpr_find_fd 0000:00:02.0/qxl.vrom, id 0 returns 16
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vrom size 8192 max_size 8192
+fd 16 host 0x7fcdc9200000
+cpr_find_fd 0000:00:02.0/qxl.vram, id 0 returns 15
+qemu_ram_alloc_shared 0000:00:02.0/qxl.vram size 67108864 max_size
+67108864 fd 15 host 0x7fcd23c00000
+cpr_find_fd 0000:00:02.0/qxl.rom, id 0 returns 14
+qemu_ram_alloc_shared 0000:00:02.0/qxl.rom size 65536 max_size 65536
+fd 14 host 0x7fcdc8800000
+cpr_find_fd /rom@etc/acpi/tables, id 0 returns 13
+qemu_ram_alloc_shared /rom@etc/acpi/tables size 131072 max_size
+2097152 fd 13 host 0x7fcdc8400000
+cpr_find_fd /rom@etc/table-loader, id 0 returns 11
+qemu_ram_alloc_shared /rom@etc/table-loader size 4096 max_size 65536
+fd 11 host 0x7fcdc8200000
+cpr_find_fd /rom@etc/acpi/rsdp, id 0 returns 10
+qemu_ram_alloc_shared /rom@etc/acpi/rsdp size 4096 max_size 4096 fd
+10 host 0x7fcd3be00000
+Looks like both vga.vram and qxl.vram are being preserved (with the
+same
+addresses), and no incompatible ram blocks are found during migration.
+Sorry, addressed are not the same, of course.Â  However corresponding
+ram
+blocks do seem to be preserved and initialized.
+So far, I have not reproduced the guest driver failure.
+
+However, I have isolated places where new QEMU improperly writes to
+the qxl memory regions prior to starting the guest, by mmap'ing them
+readonly after cpr:
+
+Â Â Â  qemu_ram_alloc_internal()
+Â Â Â Â Â  if (reused && (strstr(name, "qxl") || strstr("name", "vga")))
+Â Â Â Â Â Â Â Â Â  ram_flags |= RAM_READONLY;
+Â Â Â Â Â  new_block = qemu_ram_alloc_from_fd(...)
+
+I have attached a draft fix; try it and let me know.
+My console window looks fine before and after cpr, using
+-vnc $hostip:0 -vga qxl
+
+- Steve
+Regarding the reproduce: when I launch the buggy version with the same
+options as you, i.e. "-vnc 0.0.0.0:$port -vga qxl", and do cpr-transfer,
+my VNC client silently hangs on the target after a while.Â  Could it
+happen on your stand as well?
+cpr does not preserve the vnc connection and session.Â  To test, I specify
+port 0 for the source VM and port 1 for the dest.Â  When the src vnc goes
+dormant the dest vnc becomes active.
+Sure, I meant that VNC on the dest (on the port 1) works for a while
+after the migration and then hangs, apparently after the guest QXL crash.
+Could you try launching VM with
+"-nographic -device qxl-vga"?Â  That way VM's serial console is given you
+directly in the shell, so when qxl driver crashes you're still able to
+inspect the kernel messages.
+I have been running like that, but have not reproduced the qxl driver
+crash,
+and I suspect my guest image+kernel is too old.
+Yes, that's probably the case.Â  But the crash occurs on my Fedora 41
+guest with the 6.11.5-300.fc41.x86_64 kernel, so newer kernels seem to
+be buggy.
+However, once I realized the
+issue was post-cpr modification of qxl memory, I switched my attention
+to the
+fix.
+As for your patch, I can report that it doesn't resolve the issue as it
+is.Â  But I was able to track down another possible memory corruption
+using your approach with readonly mmap'ing:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+412Â Â Â Â Â Â Â Â  d->ram->magicÂ Â Â Â Â Â  = cpu_to_le32(QXL_RAM_MAGIC);
+[Current thread is 1 (Thread 0x7f1a4f83b480 (LWP 229798))]
+(gdb) bt
+#0Â  init_qxl_ram (d=0x5638996e0e70) at ../hw/display/qxl.c:412
+#1Â  0x0000563896e7f467 in qxl_realize_common (qxl=0x5638996e0e70,
+errp=0x7ffd3c2b8170) at ../hw/display/qxl.c:2142
+#2Â  0x0000563896e7fda1 in qxl_realize_primary (dev=0x5638996e0e70,
+errp=0x7ffd3c2b81d0) at ../hw/display/qxl.c:2257
+#3Â  0x0000563896c7e8f2 in pci_qdev_realize (qdev=0x5638996e0e70,
+errp=0x7ffd3c2b8250) at ../hw/pci/pci.c:2174
+#4Â  0x00005638970eb54b in device_set_realized (obj=0x5638996e0e70,
+value=true, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:494
+#5Â  0x00005638970f5e14 in property_set_bool (obj=0x5638996e0e70,
+v=0x5638996f3770, name=0x56389759b141 "realized",
+opaque=0x5638987893d0, errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/object.c:2374
+#6Â  0x00005638970f39f8 in object_property_set (obj=0x5638996e0e70,
+name=0x56389759b141 "realized", v=0x5638996f3770, errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/object.c:1449
+#7Â  0x00005638970f8586 in object_property_set_qobject
+(obj=0x5638996e0e70, name=0x56389759b141 "realized",
+value=0x5638996df900, errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/qom-qobject.c:28
+#8Â  0x00005638970f3d8d in object_property_set_bool
+(obj=0x5638996e0e70, name=0x56389759b141 "realized", value=true,
+errp=0x7ffd3c2b84e0)
+Â Â Â Â Â  at ../qom/object.c:1519
+#9Â  0x00005638970eacb0 in qdev_realize (dev=0x5638996e0e70,
+bus=0x563898cf3c20, errp=0x7ffd3c2b84e0) at ../hw/core/qdev.c:276
+#10 0x0000563896dba675 in qdev_device_add_from_qdict
+(opts=0x5638996dfe50, from_json=false, errp=0x7ffd3c2b84e0) at ../
+system/qdev-monitor.c:714
+#11 0x0000563896dba721 in qdev_device_add (opts=0x563898786150,
+errp=0x56389855dc40 <error_fatal>) at ../system/qdev-monitor.c:733
+#12 0x0000563896dc48f1 in device_init_func (opaque=0x0,
+opts=0x563898786150, errp=0x56389855dc40 <error_fatal>) at ../system/
+vl.c:1207
+#13 0x000056389737a6cc in qemu_opts_foreach
+Â Â Â Â Â  (list=0x563898427b60 <qemu_device_opts>, func=0x563896dc48ca
+<device_init_func>, opaque=0x0, errp=0x56389855dc40 <error_fatal>)
+Â Â Â Â Â  at ../util/qemu-option.c:1135
+#14 0x0000563896dc89b5 in qemu_create_cli_devices () at ../system/
+vl.c:2745
+#15 0x0000563896dc8c00 in qmp_x_exit_preconfig (errp=0x56389855dc40
+<error_fatal>) at ../system/vl.c:2806
+#16 0x0000563896dcb5de in qemu_init (argc=33, argv=0x7ffd3c2b8948)
+at ../system/vl.c:3838
+#17 0x0000563897297323 in main (argc=33, argv=0x7ffd3c2b8948) at ../
+system/main.c:72
+So the attached adjusted version of your patch does seem to help.Â  At
+least I can't reproduce the crash on my stand.
+Thanks for the stack trace; the calls to SPICE_RING_INIT in init_qxl_ram
+are
+definitely harmful.Â  Try V2 of the patch, attached, which skips the lines
+of init_qxl_ram that modify guest memory.
+Thanks, your v2 patch does seem to prevent the crash.Â  Would you re-send
+it to the list as a proper fix?
+Yes.Â  Was waiting for your confirmation.
+I'm wondering, could it be useful to explicitly mark all the reused
+memory regions readonly upon cpr-transfer, and then make them writable
+back again after the migration is done?Â  That way we will be segfaulting
+early on instead of debugging tricky memory corruptions.
+It's a useful debugging technique, but changing protection on a large
+memory region
+can be too expensive for production due to TLB shootdowns.
+
+Also, there are cases where writes are performed but the value is
+guaranteed to
+be the same:
+Â Â  qxl_post_load()
+Â Â Â Â  qxl_set_mode()
+Â Â Â Â Â Â  d->rom->mode = cpu_to_le32(modenr);
+The value is the same because mode and shadow_rom.mode were passed in
+vmstate
+from old qemu.
+There're also cases where devices' ROM might be re-initialized.Â  E.g.
+this segfault occures upon further exploration of RO mapped RAM blocks:
+Program terminated with signal SIGSEGV, Segmentation fault.
+#0Â  __memmove_avx_unaligned_erms () at 
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+664Â Â Â Â Â Â Â Â Â Â Â Â  repÂ Â Â Â  movsb
+[Current thread is 1 (Thread 0x7f6e7d08b480 (LWP 310379))]
+(gdb) bt
+#0Â  __memmove_avx_unaligned_erms () at 
+../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
+#1Â  0x000055aa1d030ecd in rom_set_mr (rom=0x55aa200ba380, owner=0x55aa2019ac10, 
+name=0x7fffb8272bc0 "/rom@etc/acpi/tables", ro=true)
+Â Â Â Â  at ../hw/core/loader.c:1032
+#2Â  0x000055aa1d031577 in rom_add_blob
+Â Â Â Â  (name=0x55aa1da51f13 "etc/acpi/tables", blob=0x55aa208a1070, len=131072, max_len=2097152, 
+addr=18446744073709551615, fw_file_name=0x55aa1da51f13 "etc/acpi/tables", 
+fw_callback=0x55aa1d441f59 <acpi_build_update>, callback_opaque=0x55aa20ff0010, as=0x0, 
+read_only=true) at ../hw/core/loader.c:1147
+#3Â  0x000055aa1cfd788d in acpi_add_rom_blob
+Â Â Â Â  (update=0x55aa1d441f59 <acpi_build_update>, opaque=0x55aa20ff0010, 
+blob=0x55aa1fc9aa00, name=0x55aa1da51f13 "etc/acpi/tables") at ../hw/acpi/utils.c:46
+#4Â  0x000055aa1d44213f in acpi_setup () at ../hw/i386/acpi-build.c:2720
+#5Â  0x000055aa1d434199 in pc_machine_done (notifier=0x55aa1ff15050, data=0x0) 
+at ../hw/i386/pc.c:638
+#6Â  0x000055aa1d876845 in notifier_list_notify (list=0x55aa1ea25c10 
+<machine_init_done_notifiers>, data=0x0) at ../util/notify.c:39
+#7Â  0x000055aa1d039ee5 in qdev_machine_creation_done () at 
+../hw/core/machine.c:1749
+#8Â  0x000055aa1d2c7b3e in qemu_machine_creation_done (errp=0x55aa1ea5cc40 
+<error_fatal>) at ../system/vl.c:2779
+#9Â  0x000055aa1d2c7c7d in qmp_x_exit_preconfig (errp=0x55aa1ea5cc40 
+<error_fatal>) at ../system/vl.c:2807
+#10 0x000055aa1d2ca64f in qemu_init (argc=35, argv=0x7fffb82730e8) at 
+../system/vl.c:3838
+#11 0x000055aa1d79638c in main (argc=35, argv=0x7fffb82730e8) at 
+../system/main.c:72
+I'm not sure whether ACPI tables ROM in particular is rewritten with the
+same content, but there might be cases where ROM can be read from file
+system upon initialization.Â  That is undesirable as guest kernel
+certainly won't be too happy about sudden change of the device's ROM
+content.
+
+So the issue we're dealing with here is any unwanted memory related
+device initialization upon cpr.
+
+For now the only thing that comes to my mind is to make a test where we
+put as many devices as we can into a VM, make ram blocks RO upon cpr
+(and remap them as RW later after migration is done, if needed), and
+catch any unwanted memory violations.Â  As Den suggested, we might
+consider adding that behaviour as a separate non-default option (or
+"migrate" command flag specific to cpr-transfer), which would only be
+used in the testing.
+I'll look into adding an option, but there may be too many false positives,
+such as the qxl_set_mode case above.Â  And the maintainers may object to me
+eliminating the false positives by adding more CPR_IN tests, due to gratuitous
+(from their POV) ugliness.
+
+But I will use the technique to look for more write violations.
+Andrey
+No way. ACPI with the source must be used in the same way as BIOSes
+and optional ROMs.
+Yup, its a bug.Â  Will fix.
+
+- Steve
+
diff --git a/results/classifier/zero-shot/007/debug/53568181 b/results/classifier/zero-shot/007/debug/53568181
new file mode 100644
index 000000000..9bfb773aa
--- /dev/null
+++ b/results/classifier/zero-shot/007/debug/53568181
@@ -0,0 +1,88 @@
+debug: 0.968
+permissions: 0.965
+performance: 0.948
+semantic: 0.943
+graphic: 0.940
+PID: 0.938
+device: 0.936
+vnc: 0.935
+network: 0.925
+other: 0.921
+KVM: 0.917
+files: 0.890
+boot: 0.876
+socket: 0.875
+
+[BUG] x86/PAT handling severely crippled AMD-V SVM KVM performance
+
+Hi, I maintain an out-of-tree 3D APIs pass-through QEMU device models at
+https://github.com/kjliew/qemu-3dfx
+that provide 3D acceleration for legacy
+32-bit Windows guests (Win98SE, WinME, Win2k and WinXP) with the focus on
+playing old legacy games from 1996-2003. It currently supports the now-defunct
+3Dfx propriety API called Glide and an alternative OpenGL pass-through based on
+MESA implementation.
+
+The basic concept of both implementations create memory-mapped virtual
+interfaces consist of host/guest shared memory with guest-push model instead of
+a more common host-pull model for typical QEMU device model implementation.
+Guest uses shared memory as FIFOs for drawing commands and data to bulk up the
+operations until serialization event that flushes the FIFOs into host. This
+achieves extremely good performance since virtual CPUs are fast with hardware
+acceleration (Intel VT/AMD-V) and reduces the overhead of frequent VMEXITs to
+service the device emulation. Both implementations work on Windows 10 with WHPX
+and HAXM accelerators as well as KVM in Linux.
+
+On Windows 10, QEMU WHPX implementation does not sync MSR_IA32_PAT during
+host/guest states sync. There is no visibility into the closed-source WHPX on
+how things are managed behind the scene, but from measuring performance figures
+I can conclude that it didn't handle the MSR_IA32_PAT correctly for both Intel
+and AMD. Call this fair enough, if you will, it didn't flag any concerns, in
+fact games such as Quake2 and Quake3 were still within playable frame rate of
+40~60FPS on Win2k/XP guest. Until the same games were run on Win98/ME guest and
+the frame rate blew off the roof (300~500FPS) on the same CPU and GPU. In fact,
+the later seemed to be more inlined with runnng the games bare-metal with vsync
+off.
+
+On Linux (at the time of writing kernel 5.6.7/Mesa 20.0), the difference
+prevailed. Intel CPUs (and it so happened that I was on laptop with Intel GPU),
+the VMX-based kvm_intel got it right while SVM-based kvm_amd did not.
+To put this in simple exaggeration, an aging Core i3-4010U/HD Graphics 4400
+(Haswell GT2) exhibited an insane performance in Quake2/Quake3 timedemos that
+totally crushed more recent AMD Ryzen 2500U APU/Vega 8 Graphics and AMD
+FX8300/NVIDIA GT730 on desktop. Simply unbelievable!
+
+It turned out that there was something to do with AMD-V NPT. By loading kvm_amd
+with npt=0, AMD Ryzen APU and FX8300 regained a huge performance leap. However,
+AMD NPT issue with KVM was supposedly fixed in 2017 kernel commits. NPT=0 would
+actually incur performance loss for VM due to intervention required by
+hypervisors to maintain the shadow page tables.  Finally, I was able to find the
+pointer that pointed to MSR_IA32_PAT register. By updating the MSR_IA32_PAT to
+0x0606xxxx0606xxxxULL, AMD CPUs now regain their rightful performance without
+taking the hit of NPT=0 for Linux KVM. Taking the same solution into Windows,
+both Intel and AMD CPUs no longer require Win98/ME guest to unleash the full
+performance potentials and performance figures based on games measured on WHPX
+were not very far behind Linux KVM.
+
+So I guess the problem lies in host/guest shared memory regions mapped as
+uncacheable from virtual CPU perspective. As virtual CPUs now completely execute
+in hardware context with x86 hardware virtualiztion extensions, the cacheability
+of memory types would severely impact the performance on guests. WHPX didn't
+handle it for both Intel EPT and AMD NPT, but KVM seems to do it right for Intel
+EPT. I don't have the correct fix for QEMU. But what I can do for my 3D APIs
+pass-through device models is to implement host-side hooks to reprogram and
+restore MSR_IA32_PAT upon activation/deactivation of the 3D APIs. Perhaps there
+is also a better solution of having the proper kernel drivers for virtual
+interfaces to manage the memory types of host/guest shared memory in kernel
+space, but to do that and the needs of Microsoft tools/DDKs, I will just forget
+it. The guest stubs uses the same kernel drivers included in 3Dfx drivers for
+memory mapping and the virtual interfaces remain driver-less from Windows OS
+perspective. Considering the current state of halting progress for QEMU native
+virgil3D to support Windows OS, I am just being pragmatic. I understand that
+QEMU virgil3D will eventually bring 3D acceleration for Windows guests, but I do
+not expect anything to support legacy 32-bit Windows OSes which have out-grown
+their commercial usefulness.
+
+Regards,
+KJ Liew
+
diff --git a/results/classifier/zero-shot/007/debug/64571620 b/results/classifier/zero-shot/007/debug/64571620
new file mode 100644
index 000000000..1de1160e2
--- /dev/null
+++ b/results/classifier/zero-shot/007/debug/64571620
@@ -0,0 +1,795 @@
+debug: 0.927
+other: 0.922
+semantic: 0.903
+permissions: 0.902
+device: 0.899
+performance: 0.897
+graphic: 0.897
+PID: 0.887
+boot: 0.879
+KVM: 0.867
+files: 0.855
+socket: 0.855
+network: 0.853
+vnc: 0.819
+
+[BUG] Migration hv_time rollback
+
+Hi,
+
+We are experiencing timestamp rollbacks during live-migration of
+Windows 10 guests with the following qemu configuration (linux 5.4.46
+and qemu master):
+```
+$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+```
+
+I have tracked the bug to the fact that `kvmclock` is not exposed and
+disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+
+I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+present and add Hyper-V support for the `kvmclock_current_nsec`
+function.
+
+I'm asking for advice because I am unsure this is the _right_ approach
+and how to keep migration compatibility between qemu versions.
+
+Thank you all,
+
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+cc'ing in Vitaly who knows about the hv stuff.
+
+* Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+Hi,
+>
+>
+We are experiencing timestamp rollbacks during live-migration of
+>
+Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+and qemu master):
+>
+```
+>
+$ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+```
+How big a jump are you seeing, and how did you notice it in the guest?
+
+Dave
+
+>
+I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+function.
+>
+>
+I'm asking for advice because I am unsure this is the _right_ approach
+>
+and how to keep migration compatibility between qemu versions.
+>
+>
+Thank you all,
+>
+>
+--
+>
+Antoine 'xdbob' Damhet
+-- 
+Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
+
+"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
+
+>
+cc'ing in Vitaly who knows about the hv stuff.
+>
+cc'ing Marcelo who knows about clocksources :-)
+
+>
+* Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+> Hi,
+>
+>
+>
+> We are experiencing timestamp rollbacks during live-migration of
+>
+> Windows 10 guests
+Are you migrating to the same hardware (with the same TSC frequency)? Is
+TSC used as the clocksource on the host?
+
+>
+>  with the following qemu configuration (linux 5.4.46
+>
+> and qemu master):
+>
+> ```
+>
+> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+> ```
+Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is
+not going to check for KVM identification anyway so we pretend we're
+Hyper-V. 
+
+Also, have you tried adding more Hyper-V enlightenments? 
+
+>
+>
+How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+Dave
+>
+>
+> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+>
+> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+> function.
+AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by
+the guest:
+
+   if (!(env->system_time_msr & 1ULL)) {
+        /* KVM clock not active */
+        return 0;
+    }
+
+and this is (and way) always false for Windows guests.
+
+>
+>
+>
+> I'm asking for advice because I am unsure this is the _right_ approach
+>
+> and how to keep migration compatibility between qemu versions.
+>
+>
+>
+> Thank you all,
+>
+>
+>
+> --
+>
+> Antoine 'xdbob' Damhet
+-- 
+Vitaly
+
+On Wed, Sep 16, 2020 at 01:59:43PM +0200, Vitaly Kuznetsov wrote:
+>
+"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
+>
+>
+> cc'ing in Vitaly who knows about the hv stuff.
+>
+>
+>
+>
+cc'ing Marcelo who knows about clocksources :-)
+>
+>
+> * Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+>> Hi,
+>
+>>
+>
+>> We are experiencing timestamp rollbacks during live-migration of
+>
+>> Windows 10 guests
+>
+>
+Are you migrating to the same hardware (with the same TSC frequency)? Is
+>
+TSC used as the clocksource on the host?
+Yes we are migrating to the exact same hardware. And yes TSC is used as
+a clocksource in the host (but the bug is still happening with `hpet` as
+a clocksource).
+
+>
+>
+>>  with the following qemu configuration (linux 5.4.46
+>
+>> and qemu master):
+>
+>> ```
+>
+>> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+>> ```
+>
+>
+Out of pure curiosity, what's the purpose of doing 'kvm=off'? Windows is
+>
+not going to check for KVM identification anyway so we pretend we're
+>
+Hyper-V.
+Some softwares explicitly checks for the presence of KVM and then crash
+if they find it in CPUID :/
+
+>
+>
+Also, have you tried adding more Hyper-V enlightenments?
+Yes, I published a stripped-down command-line for a minimal reproducer
+but even `hv-frequencies` and `hv-reenlightenment` don't help.
+
+>
+>
+>
+>
+> How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+>
+> Dave
+>
+>
+>
+>> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+>> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>>
+>
+>> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+>> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+>> function.
+>
+>
+AFAICT kvmclock_current_nsec() checks whether kvmclock was enabled by
+>
+the guest:
+>
+>
+if (!(env->system_time_msr & 1ULL)) {
+>
+/* KVM clock not active */
+>
+return 0;
+>
+}
+>
+>
+and this is (and way) always false for Windows guests.
+Hooo, I missed this piece. When is `clock_is_reliable` expected to be
+false ? Because if it is I still think we should be able to query at
+least `HV_X64_MSR_REFERENCE_TSC`
+
+>
+>
+>>
+>
+>> I'm asking for advice because I am unsure this is the _right_ approach
+>
+>> and how to keep migration compatibility between qemu versions.
+>
+>>
+>
+>> Thank you all,
+>
+>>
+>
+>> --
+>
+>> Antoine 'xdbob' Damhet
+>
+>
+--
+>
+Vitaly
+>
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote:
+>
+cc'ing in Vitaly who knows about the hv stuff.
+Thanks
+
+>
+>
+* Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+> Hi,
+>
+>
+>
+> We are experiencing timestamp rollbacks during live-migration of
+>
+> Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+> and qemu master):
+>
+> ```
+>
+> $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+> ```
+>
+>
+How big a jump are you seeing, and how did you notice it in the guest?
+I'm seeing jumps of about the guest uptime (indicating a reset of the
+counter). It's expected because we won't call `KVM_SET_CLOCK` to
+restore any value.
+
+We first noticed it because after some migrations `dwm.exe` crashes with
+the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in
+the past." error code.
+
+I can also confirm the following hack makes the behavior disappear:
+
+```
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+index 64283358f9..f334bdf35f 100644
+--- a/hw/i386/kvm/clock.c
++++ b/hw/i386/kvm/clock.c
+@@ -332,11 +332,7 @@ void kvmclock_create(void)
+ {
+     X86CPU *cpu = X86_CPU(first_cpu);
+
+-    if (kvm_enabled() &&
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+-        sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+-    }
++    sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+ }
+
+ static void kvmclock_register_types(void)
+diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
+index 32b1453e6a..11d980ba85 100644
+--- a/hw/i386/pc_piix.c
++++ b/hw/i386/pc_piix.c
+@@ -158,9 +158,7 @@ static void pc_init1(MachineState *machine,
+
+     x86_cpus_init(x86ms, pcmc->default_cpu_version);
+
+-    if (kvm_enabled() && pcmc->kvmclock_enabled) {
+-        kvmclock_create();
+-    }
++    kvmclock_create();
+
+     if (pcmc->pci_enabled) {
+         pci_memory = g_new(MemoryRegion, 1);
+```
+
+>
+>
+Dave
+>
+>
+> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+>
+> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+> function.
+>
+>
+>
+> I'm asking for advice because I am unsure this is the _right_ approach
+>
+> and how to keep migration compatibility between qemu versions.
+>
+>
+>
+> Thank you all,
+>
+>
+>
+> --
+>
+> Antoine 'xdbob' Damhet
+>
+>
+>
+--
+>
+Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
+>
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+Antoine Damhet <antoine.damhet@blade-group.com> writes:
+
+>
+On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote:
+>
+> cc'ing in Vitaly who knows about the hv stuff.
+>
+>
+Thanks
+>
+>
+>
+>
+> * Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+> > Hi,
+>
+> >
+>
+> > We are experiencing timestamp rollbacks during live-migration of
+>
+> > Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+> > and qemu master):
+>
+> > ```
+>
+> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+> > ```
+>
+>
+>
+> How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+I'm seeing jumps of about the guest uptime (indicating a reset of the
+>
+counter). It's expected because we won't call `KVM_SET_CLOCK` to
+>
+restore any value.
+>
+>
+We first noticed it because after some migrations `dwm.exe` crashes with
+>
+the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in
+>
+the past." error code.
+>
+>
+I can also confirm the following hack makes the behavior disappear:
+>
+>
+```
+>
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+>
+index 64283358f9..f334bdf35f 100644
+>
+--- a/hw/i386/kvm/clock.c
+>
++++ b/hw/i386/kvm/clock.c
+>
+@@ -332,11 +332,7 @@ void kvmclock_create(void)
+>
+{
+>
+X86CPU *cpu = X86_CPU(first_cpu);
+>
+>
+-    if (kvm_enabled() &&
+>
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+>
+-        sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+-    }
+>
++    sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+}
+>
+Oh, I think I see what's going on. When you add 'kvm=off'
+cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so
+kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on
+migration.
+
+In case we really want to support 'kvm=off' I think we can add Hyper-V
+features check here along with KVM, this should do the job.
+
+-- 
+Vitaly
+
+Vitaly Kuznetsov <vkuznets@redhat.com> writes:
+
+>
+Antoine Damhet <antoine.damhet@blade-group.com> writes:
+>
+>
+> On Wed, Sep 16, 2020 at 12:29:56PM +0100, Dr. David Alan Gilbert wrote:
+>
+>> cc'ing in Vitaly who knows about the hv stuff.
+>
+>
+>
+> Thanks
+>
+>
+>
+>>
+>
+>> * Antoine Damhet (antoine.damhet@blade-group.com) wrote:
+>
+>> > Hi,
+>
+>> >
+>
+>> > We are experiencing timestamp rollbacks during live-migration of
+>
+>> > Windows 10 guests with the following qemu configuration (linux 5.4.46
+>
+>> > and qemu master):
+>
+>> > ```
+>
+>> > $ qemu-system-x86_64 -enable-kvm -cpu host,kvm=off,hv_time [...]
+>
+>> > ```
+>
+>>
+>
+>> How big a jump are you seeing, and how did you notice it in the guest?
+>
+>
+>
+> I'm seeing jumps of about the guest uptime (indicating a reset of the
+>
+> counter). It's expected because we won't call `KVM_SET_CLOCK` to
+>
+> restore any value.
+>
+>
+>
+> We first noticed it because after some migrations `dwm.exe` crashes with
+>
+> the "(NTSTATUS) 0x8898009b - QueryPerformanceCounter returned a time in
+>
+> the past." error code.
+>
+>
+>
+> I can also confirm the following hack makes the behavior disappear:
+>
+>
+>
+> ```
+>
+> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+>
+> index 64283358f9..f334bdf35f 100644
+>
+> --- a/hw/i386/kvm/clock.c
+>
+> +++ b/hw/i386/kvm/clock.c
+>
+> @@ -332,11 +332,7 @@ void kvmclock_create(void)
+>
+>  {
+>
+>      X86CPU *cpu = X86_CPU(first_cpu);
+>
+>
+>
+> -    if (kvm_enabled() &&
+>
+> -        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
+> -                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2)))
+>
+> {
+>
+> -        sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+> -    }
+>
+> +    sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+>  }
+>
+>
+>
+>
+>
+Oh, I think I see what's going on. When you add 'kvm=off'
+>
+cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so
+>
+kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on
+>
+migration.
+>
+>
+In case we really want to support 'kvm=off' I think we can add Hyper-V
+>
+features check here along with KVM, this should do the job.
+Does the untested
+
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+index 64283358f91d..e03b2ca6d8f6 100644
+--- a/hw/i386/kvm/clock.c
++++ b/hw/i386/kvm/clock.c
+@@ -333,8 +333,9 @@ void kvmclock_create(void)
+     X86CPU *cpu = X86_CPU(first_cpu);
+ 
+     if (kvm_enabled() &&
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
++        ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
++                                         (1ULL << KVM_FEATURE_CLOCKSOURCE2))) 
+||
++         (cpu->env.features[FEAT_HYPERV_EAX] & HV_TIME_REF_COUNT_AVAILABLE))) {
+         sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+     }
+ }
+
+help?
+
+(I don't think we need to remove all 'if (kvm_enabled())' checks from
+machine types as 'kvm=off' should not be related).
+
+-- 
+Vitaly
+
+On Wed, Sep 16, 2020 at 02:50:56PM +0200, Vitaly Kuznetsov wrote:
+[...]
+
+>
+>>
+>
+>
+>
+>
+>
+> Oh, I think I see what's going on. When you add 'kvm=off'
+>
+> cpu->env.features[FEAT_KVM] is reset (see x86_cpu_expand_features()) so
+>
+> kvmclock QEMU device is not created and nobody calls KVM_SET_CLOCK on
+>
+> migration.
+>
+>
+>
+> In case we really want to support 'kvm=off' I think we can add Hyper-V
+>
+> features check here along with KVM, this should do the job.
+>
+>
+Does the untested
+>
+>
+diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
+>
+index 64283358f91d..e03b2ca6d8f6 100644
+>
+--- a/hw/i386/kvm/clock.c
+>
++++ b/hw/i386/kvm/clock.c
+>
+@@ -333,8 +333,9 @@ void kvmclock_create(void)
+>
+X86CPU *cpu = X86_CPU(first_cpu);
+>
+>
+if (kvm_enabled() &&
+>
+-        cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
+-                                       (1ULL << KVM_FEATURE_CLOCKSOURCE2))) {
+>
++        ((cpu->env.features[FEAT_KVM] & ((1ULL << KVM_FEATURE_CLOCKSOURCE) |
+>
++                                         (1ULL <<
+>
+KVM_FEATURE_CLOCKSOURCE2))) ||
+>
++         (cpu->env.features[FEAT_HYPERV_EAX] &
+>
+HV_TIME_REF_COUNT_AVAILABLE))) {
+>
+sysbus_create_simple(TYPE_KVM_CLOCK, -1, NULL);
+>
+}
+>
+}
+>
+>
+help?
+It appears to work :)
+
+>
+>
+(I don't think we need to remove all 'if (kvm_enabled())' checks from
+>
+machine types as 'kvm=off' should not be related).
+Indeed (I didn't look at the macro, it was just quick & dirty).
+
+>
+>
+--
+>
+Vitaly
+>
+>
+-- 
+Antoine 'xdbob' Damhet
+signature.asc
+Description:
+PGP signature
+
+On 16/09/20 13:29, Dr. David Alan Gilbert wrote:
+>
+> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>
+>
+> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+> function.
+Yes, this seems correct.  I would have to check but it may even be
+better to _always_ send kvmclock data in the live migration stream.
+
+Paolo
+
+Paolo Bonzini <pbonzini@redhat.com> writes:
+
+>
+On 16/09/20 13:29, Dr. David Alan Gilbert wrote:
+>
+>> I have tracked the bug to the fact that `kvmclock` is not exposed and
+>
+>> disabled from qemu PoV but is in fact used by `hv-time` (in KVM).
+>
+>>
+>
+>> I think we should enable the `kvmclock` (qemu device) if `hv-time` is
+>
+>> present and add Hyper-V support for the `kvmclock_current_nsec`
+>
+>> function.
+>
+>
+Yes, this seems correct.  I would have to check but it may even be
+>
+better to _always_ send kvmclock data in the live migration stream.
+>
+The question I have is: with 'kvm=off', do we actually restore TSC
+reading on migration? (and I guess the answer is 'no' or Hyper-V TSC
+page would 'just work' I guess). So yea, maybe dropping the
+'cpu->env.features[FEAT_KVM]' check is the right fix.
+
+-- 
+Vitaly
+
diff --git a/results/classifier/zero-shot/007/debug/96782458 b/results/classifier/zero-shot/007/debug/96782458
new file mode 100644
index 000000000..6fa03cc39
--- /dev/null
+++ b/results/classifier/zero-shot/007/debug/96782458
@@ -0,0 +1,1009 @@
+debug: 0.989
+permissions: 0.986
+performance: 0.985
+semantic: 0.984
+other: 0.982
+boot: 0.980
+PID: 0.980
+files: 0.978
+socket: 0.976
+vnc: 0.976
+device: 0.974
+graphic: 0.973
+network: 0.967
+KVM: 0.963
+
+[Qemu-devel] [BUG] Migrate failes between boards with different PMC counts
+
+Hi all,
+
+Recently, I found migration failed when enable vPMU.
+
+migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
+
+As long as enable vPMU, qemu will save / load the
+vmstate_msr_architectural_pmu(msr_global_ctrl) register during the migration.
+But global_ctrl generated based on cpuid(0xA), the number of general-purpose 
+performance
+monitoring counters(PMC) can vary according to Intel SDN. The number of PMC 
+presented
+to vm, does not support configuration currently, it depend on host cpuid, and 
+enable all pmc
+defaultly at KVM. It cause migration to fail between boards with different PMC 
+counts.
+
+The return value of cpuid (0xA) is different dur to cpu, according to Intel 
+SDNï¼18-10 Vol. 3B:
+
+Note: The number of general-purpose performance monitoring counters (i.e. N in 
+Figure 18-9)
+can vary across processor generations within a processor family, across 
+processor families, or
+could be different depending on the configuration chosen at boot time in the 
+BIOS regarding
+Intel Hyper Threading Technology, (e.g. N=2 for 45 nm Intel Atom processors; N 
+=4 for processors
+based on the Nehalem microarchitecture; for processors based on the Sandy Bridge
+microarchitecture, N = 4 if Intel Hyper Threading Technology is active and N=8 
+if not active).
+
+Also I found, N=8 if HT is not active based on the broadwellï¼,
+such as CPU E7-8890 v4 @ 2.20GHz   
+
+# ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -hda
+/data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true -incoming 
+tcp::8888
+Completed 100 %
+qemu-system-x86_64: error: failed to set MSR 0x38f to 0x7000000ff
+qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833: 
+kvm_put_msrs: 
+Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
+Aborted
+
+So make number of pmc configurable to vm ? Any better idea ?
+
+
+Regards,
+-Zhuang Yanying
+
+* Zhuangyanying (address@hidden) wrote:
+>
+Hi all,
+>
+>
+Recently, I found migration failed when enable vPMU.
+>
+>
+migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
+>
+>
+As long as enable vPMU, qemu will save / load the
+>
+vmstate_msr_architectural_pmu(msr_global_ctrl) register during the migration.
+>
+But global_ctrl generated based on cpuid(0xA), the number of general-purpose
+>
+performance
+>
+monitoring counters(PMC) can vary according to Intel SDN. The number of PMC
+>
+presented
+>
+to vm, does not support configuration currently, it depend on host cpuid, and
+>
+enable all pmc
+>
+defaultly at KVM. It cause migration to fail between boards with different
+>
+PMC counts.
+>
+>
+The return value of cpuid (0xA) is different dur to cpu, according to Intel
+>
+SDNï¼18-10 Vol. 3B:
+>
+>
+Note: The number of general-purpose performance monitoring counters (i.e. N
+>
+in Figure 18-9)
+>
+can vary across processor generations within a processor family, across
+>
+processor families, or
+>
+could be different depending on the configuration chosen at boot time in the
+>
+BIOS regarding
+>
+Intel Hyper Threading Technology, (e.g. N=2 for 45 nm Intel Atom processors;
+>
+N =4 for processors
+>
+based on the Nehalem microarchitecture; for processors based on the Sandy
+>
+Bridge
+>
+microarchitecture, N = 4 if Intel Hyper Threading Technology is active and
+>
+N=8 if not active).
+>
+>
+Also I found, N=8 if HT is not active based on the broadwellï¼,
+>
+such as CPU E7-8890 v4 @ 2.20GHz
+>
+>
+# ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -hda
+>
+/data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true -incoming
+>
+tcp::8888
+>
+Completed 100 %
+>
+qemu-system-x86_64: error: failed to set MSR 0x38f to 0x7000000ff
+>
+qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833:
+>
+kvm_put_msrs:
+>
+Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
+>
+Aborted
+>
+>
+So make number of pmc configurable to vm ? Any better idea ?
+Coincidentally we hit a similar problem a few days ago with -cpu host  - it 
+took me
+quite a while to spot the difference between the machines was the source
+had hyperthreading disabled.
+
+An option to set the number of counters makes sense to me; but I wonder
+how many other options we need as well.  Also, I'm not sure there's any
+easy way for libvirt etc to figure out how many counters a host supports - it's
+not in /proc/cpuinfo.
+
+Dave
+
+>
+>
+Regards,
+>
+-Zhuang Yanying
+--
+Dr. David Alan Gilbert / address@hidden / Manchester, UK
+
+On Mon, Apr 24, 2017 at 10:23:21AM +0100, Dr. David Alan Gilbert wrote:
+>
+* Zhuangyanying (address@hidden) wrote:
+>
+> Hi all,
+>
+>
+>
+> Recently, I found migration failed when enable vPMU.
+>
+>
+>
+> migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
+>
+>
+>
+> As long as enable vPMU, qemu will save / load the
+>
+> vmstate_msr_architectural_pmu(msr_global_ctrl) register during the
+>
+> migration.
+>
+> But global_ctrl generated based on cpuid(0xA), the number of
+>
+> general-purpose performance
+>
+> monitoring counters(PMC) can vary according to Intel SDN. The number of PMC
+>
+> presented
+>
+> to vm, does not support configuration currently, it depend on host cpuid,
+>
+> and enable all pmc
+>
+> defaultly at KVM. It cause migration to fail between boards with different
+>
+> PMC counts.
+>
+>
+>
+> The return value of cpuid (0xA) is different dur to cpu, according to Intel
+>
+> SDNï¼18-10 Vol. 3B:
+>
+>
+>
+> Note: The number of general-purpose performance monitoring counters (i.e. N
+>
+> in Figure 18-9)
+>
+> can vary across processor generations within a processor family, across
+>
+> processor families, or
+>
+> could be different depending on the configuration chosen at boot time in
+>
+> the BIOS regarding
+>
+> Intel Hyper Threading Technology, (e.g. N=2 for 45 nm Intel Atom
+>
+> processors; N =4 for processors
+>
+> based on the Nehalem microarchitecture; for processors based on the Sandy
+>
+> Bridge
+>
+> microarchitecture, N = 4 if Intel Hyper Threading Technology is active and
+>
+> N=8 if not active).
+>
+>
+>
+> Also I found, N=8 if HT is not active based on the broadwellï¼,
+>
+> such as CPU E7-8890 v4 @ 2.20GHz
+>
+>
+>
+> # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -hda
+>
+> /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true -incoming
+>
+> tcp::8888
+>
+> Completed 100 %
+>
+> qemu-system-x86_64: error: failed to set MSR 0x38f to 0x7000000ff
+>
+> qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833:
+>
+> kvm_put_msrs:
+>
+> Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
+>
+> Aborted
+>
+>
+>
+> So make number of pmc configurable to vm ? Any better idea ?
+>
+>
+Coincidentally we hit a similar problem a few days ago with -cpu host  - it
+>
+took me
+>
+quite a while to spot the difference between the machines was the source
+>
+had hyperthreading disabled.
+>
+>
+An option to set the number of counters makes sense to me; but I wonder
+>
+how many other options we need as well.  Also, I'm not sure there's any
+>
+easy way for libvirt etc to figure out how many counters a host supports -
+>
+it's not in /proc/cpuinfo.
+We actually try to avoid /proc/cpuinfo whereever possible. We do direct
+CPUID asm instructions to identify features, and prefer to use
+/sys/devices/system/cpu if that has suitable data
+
+Where do the PMC counts come from originally ? CPUID or something else ?
+
+Regards,
+Daniel
+-- 
+|:
+https://berrange.com
+-o-
+https://www.flickr.com/photos/dberrange
+:|
+|:
+https://libvirt.org
+-o-
+https://fstop138.berrange.com
+:|
+|:
+https://entangle-photo.org
+-o-
+https://www.instagram.com/dberrange
+:|
+
+* Daniel P. Berrange (address@hidden) wrote:
+>
+On Mon, Apr 24, 2017 at 10:23:21AM +0100, Dr. David Alan Gilbert wrote:
+>
+> * Zhuangyanying (address@hidden) wrote:
+>
+> > Hi all,
+>
+> >
+>
+> > Recently, I found migration failed when enable vPMU.
+>
+> >
+>
+> > migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
+>
+> >
+>
+> > As long as enable vPMU, qemu will save / load the
+>
+> > vmstate_msr_architectural_pmu(msr_global_ctrl) register during the
+>
+> > migration.
+>
+> > But global_ctrl generated based on cpuid(0xA), the number of
+>
+> > general-purpose performance
+>
+> > monitoring counters(PMC) can vary according to Intel SDN. The number of
+>
+> > PMC presented
+>
+> > to vm, does not support configuration currently, it depend on host cpuid,
+>
+> > and enable all pmc
+>
+> > defaultly at KVM. It cause migration to fail between boards with
+>
+> > different PMC counts.
+>
+> >
+>
+> > The return value of cpuid (0xA) is different dur to cpu, according to
+>
+> > Intel SDNï¼18-10 Vol. 3B:
+>
+> >
+>
+> > Note: The number of general-purpose performance monitoring counters (i.e.
+>
+> > N in Figure 18-9)
+>
+> > can vary across processor generations within a processor family, across
+>
+> > processor families, or
+>
+> > could be different depending on the configuration chosen at boot time in
+>
+> > the BIOS regarding
+>
+> > Intel Hyper Threading Technology, (e.g. N=2 for 45 nm Intel Atom
+>
+> > processors; N =4 for processors
+>
+> > based on the Nehalem microarchitecture; for processors based on the Sandy
+>
+> > Bridge
+>
+> > microarchitecture, N = 4 if Intel Hyper Threading Technology is active
+>
+> > and N=8 if not active).
+>
+> >
+>
+> > Also I found, N=8 if HT is not active based on the broadwellï¼,
+>
+> > such as CPU E7-8890 v4 @ 2.20GHz
+>
+> >
+>
+> > # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -hda
+>
+> > /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true -incoming
+>
+> > tcp::8888
+>
+> > Completed 100 %
+>
+> > qemu-system-x86_64: error: failed to set MSR 0x38f to 0x7000000ff
+>
+> > qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833:
+>
+> > kvm_put_msrs:
+>
+> > Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
+>
+> > Aborted
+>
+> >
+>
+> > So make number of pmc configurable to vm ? Any better idea ?
+>
+>
+>
+> Coincidentally we hit a similar problem a few days ago with -cpu host  - it
+>
+> took me
+>
+> quite a while to spot the difference between the machines was the source
+>
+> had hyperthreading disabled.
+>
+>
+>
+> An option to set the number of counters makes sense to me; but I wonder
+>
+> how many other options we need as well.  Also, I'm not sure there's any
+>
+> easy way for libvirt etc to figure out how many counters a host supports -
+>
+> it's not in /proc/cpuinfo.
+>
+>
+We actually try to avoid /proc/cpuinfo whereever possible. We do direct
+>
+CPUID asm instructions to identify features, and prefer to use
+>
+/sys/devices/system/cpu if that has suitable data
+>
+>
+Where do the PMC counts come from originally ? CPUID or something else ?
+Yes, they're bits 8..15 of CPUID leaf 0xa
+
+Dave
+
+>
+Regards,
+>
+Daniel
+>
+--
+>
+|:
+https://berrange.com
+-o-
+https://www.flickr.com/photos/dberrange
+:|
+>
+|:
+https://libvirt.org
+-o-
+https://fstop138.berrange.com
+:|
+>
+|:
+https://entangle-photo.org
+-o-
+https://www.instagram.com/dberrange
+:|
+--
+Dr. David Alan Gilbert / address@hidden / Manchester, UK
+
+On Mon, Apr 24, 2017 at 11:27:16AM +0100, Dr. David Alan Gilbert wrote:
+>
+* Daniel P. Berrange (address@hidden) wrote:
+>
+> On Mon, Apr 24, 2017 at 10:23:21AM +0100, Dr. David Alan Gilbert wrote:
+>
+> > * Zhuangyanying (address@hidden) wrote:
+>
+> > > Hi all,
+>
+> > >
+>
+> > > Recently, I found migration failed when enable vPMU.
+>
+> > >
+>
+> > > migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
+>
+> > >
+>
+> > > As long as enable vPMU, qemu will save / load the
+>
+> > > vmstate_msr_architectural_pmu(msr_global_ctrl) register during the
+>
+> > > migration.
+>
+> > > But global_ctrl generated based on cpuid(0xA), the number of
+>
+> > > general-purpose performance
+>
+> > > monitoring counters(PMC) can vary according to Intel SDN. The number of
+>
+> > > PMC presented
+>
+> > > to vm, does not support configuration currently, it depend on host
+>
+> > > cpuid, and enable all pmc
+>
+> > > defaultly at KVM. It cause migration to fail between boards with
+>
+> > > different PMC counts.
+>
+> > >
+>
+> > > The return value of cpuid (0xA) is different dur to cpu, according to
+>
+> > > Intel SDNï¼18-10 Vol. 3B:
+>
+> > >
+>
+> > > Note: The number of general-purpose performance monitoring counters
+>
+> > > (i.e. N in Figure 18-9)
+>
+> > > can vary across processor generations within a processor family, across
+>
+> > > processor families, or
+>
+> > > could be different depending on the configuration chosen at boot time
+>
+> > > in the BIOS regarding
+>
+> > > Intel Hyper Threading Technology, (e.g. N=2 for 45 nm Intel Atom
+>
+> > > processors; N =4 for processors
+>
+> > > based on the Nehalem microarchitecture; for processors based on the
+>
+> > > Sandy Bridge
+>
+> > > microarchitecture, N = 4 if Intel Hyper Threading Technology is active
+>
+> > > and N=8 if not active).
+>
+> > >
+>
+> > > Also I found, N=8 if HT is not active based on the broadwellï¼,
+>
+> > > such as CPU E7-8890 v4 @ 2.20GHz
+>
+> > >
+>
+> > > # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 -hda
+>
+> > > /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true
+>
+> > > -incoming tcp::8888
+>
+> > > Completed 100 %
+>
+> > > qemu-system-x86_64: error: failed to set MSR 0x38f to 0x7000000ff
+>
+> > > qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833:
+>
+> > > kvm_put_msrs:
+>
+> > > Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
+>
+> > > Aborted
+>
+> > >
+>
+> > > So make number of pmc configurable to vm ? Any better idea ?
+>
+> >
+>
+> > Coincidentally we hit a similar problem a few days ago with -cpu host  -
+>
+> > it took me
+>
+> > quite a while to spot the difference between the machines was the source
+>
+> > had hyperthreading disabled.
+>
+> >
+>
+> > An option to set the number of counters makes sense to me; but I wonder
+>
+> > how many other options we need as well.  Also, I'm not sure there's any
+>
+> > easy way for libvirt etc to figure out how many counters a host supports -
+>
+> > it's not in /proc/cpuinfo.
+>
+>
+>
+> We actually try to avoid /proc/cpuinfo whereever possible. We do direct
+>
+> CPUID asm instructions to identify features, and prefer to use
+>
+> /sys/devices/system/cpu if that has suitable data
+>
+>
+>
+> Where do the PMC counts come from originally ? CPUID or something else ?
+>
+>
+Yes, they're bits 8..15 of CPUID leaf 0xa
+Ok, that's easy enough for libvirt to detect then. More a question of what
+libvirt should then do this with the info....
+
+Regards,
+Daniel
+-- 
+|:
+https://berrange.com
+-o-
+https://www.flickr.com/photos/dberrange
+:|
+|:
+https://libvirt.org
+-o-
+https://fstop138.berrange.com
+:|
+|:
+https://entangle-photo.org
+-o-
+https://www.instagram.com/dberrange
+:|
+
+>
+-----Original Message-----
+>
+From: Daniel P. Berrange [
+mailto:address@hidden
+>
+Sent: Monday, April 24, 2017 6:34 PM
+>
+To: Dr. David Alan Gilbert
+>
+Cc: Zhuangyanying; Zhanghailiang; wangxin (U); address@hidden;
+>
+Gonglei (Arei); Huangzhichao; address@hidden
+>
+Subject: Re: [Qemu-devel] [BUG] Migrate failes between boards with different
+>
+PMC counts
+>
+>
+On Mon, Apr 24, 2017 at 11:27:16AM +0100, Dr. David Alan Gilbert wrote:
+>
+> * Daniel P. Berrange (address@hidden) wrote:
+>
+> > On Mon, Apr 24, 2017 at 10:23:21AM +0100, Dr. David Alan Gilbert wrote:
+>
+> > > * Zhuangyanying (address@hidden) wrote:
+>
+> > > > Hi all,
+>
+> > > >
+>
+> > > > Recently, I found migration failed when enable vPMU.
+>
+> > > >
+>
+> > > > migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
+>
+> > > >
+>
+> > > > As long as enable vPMU, qemu will save / load the
+>
+> > > > vmstate_msr_architectural_pmu(msr_global_ctrl) register during the
+>
+migration.
+>
+> > > > But global_ctrl generated based on cpuid(0xA), the number of
+>
+> > > > general-purpose performance monitoring counters(PMC) can vary
+>
+> > > > according to Intel SDN. The number of PMC presented to vm, does
+>
+> > > > not support configuration currently, it depend on host cpuid, and
+>
+> > > > enable
+>
+all pmc defaultly at KVM. It cause migration to fail between boards with
+>
+different PMC counts.
+>
+> > > >
+>
+> > > > The return value of cpuid (0xA) is different dur to cpu, according to
+>
+> > > > Intel
+>
+SDNï¼18-10 Vol. 3B:
+>
+> > > >
+>
+> > > > Note: The number of general-purpose performance monitoring
+>
+> > > > counters (i.e. N in Figure 18-9) can vary across processor
+>
+> > > > generations within a processor family, across processor
+>
+> > > > families, or could be different depending on the configuration
+>
+> > > > chosen at boot time in the BIOS regarding Intel Hyper Threading
+>
+> > > > Technology, (e.g. N=2 for 45 nm Intel Atom processors; N =4 for
+>
+processors based on the Nehalem microarchitecture; for processors based on
+>
+the Sandy Bridge microarchitecture, N = 4 if Intel Hyper Threading Technology
+>
+is active and N=8 if not active).
+>
+> > > >
+>
+> > > > Also I found, N=8 if HT is not active based on the broadwellï¼,
+>
+> > > > such as CPU E7-8890 v4 @ 2.20GHz
+>
+> > > >
+>
+> > > > # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m
+>
+> > > > 4096 -hda
+>
+> > > > /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true
+>
+> > > > -incoming tcp::8888 Completed 100 %
+>
+> > > > qemu-system-x86_64: error: failed to set MSR 0x38f to
+>
+> > > > 0x7000000ff
+>
+> > > > qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833:
+>
+kvm_put_msrs:
+>
+> > > > Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
+>
+> > > > Aborted
+>
+> > > >
+>
+> > > > So make number of pmc configurable to vm ? Any better idea ?
+>
+> > >
+>
+> > > Coincidentally we hit a similar problem a few days ago with -cpu
+>
+> > > host  - it took me quite a while to spot the difference between
+>
+> > > the machines was the source had hyperthreading disabled.
+>
+> > >
+>
+> > > An option to set the number of counters makes sense to me; but I
+>
+> > > wonder how many other options we need as well.  Also, I'm not sure
+>
+> > > there's any easy way for libvirt etc to figure out how many
+>
+> > > counters a host supports - it's not in /proc/cpuinfo.
+>
+> >
+>
+> > We actually try to avoid /proc/cpuinfo whereever possible. We do
+>
+> > direct CPUID asm instructions to identify features, and prefer to
+>
+> > use /sys/devices/system/cpu if that has suitable data
+>
+> >
+>
+> > Where do the PMC counts come from originally ? CPUID or something
+>
+else ?
+>
+>
+>
+> Yes, they're bits 8..15 of CPUID leaf 0xa
+>
+>
+Ok, that's easy enough for libvirt to detect then. More a question of what
+>
+libvirt
+>
+should then do this with the info....
+>
+Do you mean to do a validation at the begining of migration? in 
+qemuMigrationBakeCookie() & qemuMigrationEatCookie(), if the PMC numbers are 
+not equal, just quit migration?
+It maybe a good enough first edition.
+But for a further better edition, maybe it's better to support Heterogeneous 
+migration I think, so we might need to make PMC number configrable, then we 
+need to modify KVM/qemu as well.
+
+Regards,
+-Zhuang Yanying
+
+* Zhuangyanying (address@hidden) wrote:
+>
+>
+>
+> -----Original Message-----
+>
+> From: Daniel P. Berrange [
+mailto:address@hidden
+>
+> Sent: Monday, April 24, 2017 6:34 PM
+>
+> To: Dr. David Alan Gilbert
+>
+> Cc: Zhuangyanying; Zhanghailiang; wangxin (U); address@hidden;
+>
+> Gonglei (Arei); Huangzhichao; address@hidden
+>
+> Subject: Re: [Qemu-devel] [BUG] Migrate failes between boards with different
+>
+> PMC counts
+>
+>
+>
+> On Mon, Apr 24, 2017 at 11:27:16AM +0100, Dr. David Alan Gilbert wrote:
+>
+> > * Daniel P. Berrange (address@hidden) wrote:
+>
+> > > On Mon, Apr 24, 2017 at 10:23:21AM +0100, Dr. David Alan Gilbert wrote:
+>
+> > > > * Zhuangyanying (address@hidden) wrote:
+>
+> > > > > Hi all,
+>
+> > > > >
+>
+> > > > > Recently, I found migration failed when enable vPMU.
+>
+> > > > >
+>
+> > > > > migrate vPMU state was introduced in linux-3.10 + qemu-1.7.
+>
+> > > > >
+>
+> > > > > As long as enable vPMU, qemu will save / load the
+>
+> > > > > vmstate_msr_architectural_pmu(msr_global_ctrl) register during the
+>
+> migration.
+>
+> > > > > But global_ctrl generated based on cpuid(0xA), the number of
+>
+> > > > > general-purpose performance monitoring counters(PMC) can vary
+>
+> > > > > according to Intel SDN. The number of PMC presented to vm, does
+>
+> > > > > not support configuration currently, it depend on host cpuid, and
+>
+> > > > > enable
+>
+> all pmc defaultly at KVM. It cause migration to fail between boards with
+>
+> different PMC counts.
+>
+> > > > >
+>
+> > > > > The return value of cpuid (0xA) is different dur to cpu, according
+>
+> > > > > to Intel
+>
+> SDNï¼18-10 Vol. 3B:
+>
+> > > > >
+>
+> > > > > Note: The number of general-purpose performance monitoring
+>
+> > > > > counters (i.e. N in Figure 18-9) can vary across processor
+>
+> > > > > generations within a processor family, across processor
+>
+> > > > > families, or could be different depending on the configuration
+>
+> > > > > chosen at boot time in the BIOS regarding Intel Hyper Threading
+>
+> > > > > Technology, (e.g. N=2 for 45 nm Intel Atom processors; N =4 for
+>
+> processors based on the Nehalem microarchitecture; for processors based on
+>
+> the Sandy Bridge microarchitecture, N = 4 if Intel Hyper Threading
+>
+> Technology
+>
+> is active and N=8 if not active).
+>
+> > > > >
+>
+> > > > > Also I found, N=8 if HT is not active based on the broadwellï¼,
+>
+> > > > > such as CPU E7-8890 v4 @ 2.20GHz
+>
+> > > > >
+>
+> > > > > # ./x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m
+>
+> > > > > 4096 -hda
+>
+> > > > > /data/zyy/test_qemu.img.sles12sp1 -vnc :99 -cpu kvm64,pmu=true
+>
+> > > > > -incoming tcp::8888 Completed 100 %
+>
+> > > > > qemu-system-x86_64: error: failed to set MSR 0x38f to
+>
+> > > > > 0x7000000ff
+>
+> > > > > qemu-system-x86_64: /data/zyy/git/test/qemu/target/i386/kvm.c:1833:
+>
+> kvm_put_msrs:
+>
+> > > > > Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.
+>
+> > > > > Aborted
+>
+> > > > >
+>
+> > > > > So make number of pmc configurable to vm ? Any better idea ?
+>
+> > > >
+>
+> > > > Coincidentally we hit a similar problem a few days ago with -cpu
+>
+> > > > host  - it took me quite a while to spot the difference between
+>
+> > > > the machines was the source had hyperthreading disabled.
+>
+> > > >
+>
+> > > > An option to set the number of counters makes sense to me; but I
+>
+> > > > wonder how many other options we need as well.  Also, I'm not sure
+>
+> > > > there's any easy way for libvirt etc to figure out how many
+>
+> > > > counters a host supports - it's not in /proc/cpuinfo.
+>
+> > >
+>
+> > > We actually try to avoid /proc/cpuinfo whereever possible. We do
+>
+> > > direct CPUID asm instructions to identify features, and prefer to
+>
+> > > use /sys/devices/system/cpu if that has suitable data
+>
+> > >
+>
+> > > Where do the PMC counts come from originally ? CPUID or something
+>
+> else ?
+>
+> >
+>
+> > Yes, they're bits 8..15 of CPUID leaf 0xa
+>
+>
+>
+> Ok, that's easy enough for libvirt to detect then. More a question of what
+>
+> libvirt
+>
+> should then do this with the info....
+>
+>
+>
+>
+Do you mean to do a validation at the begining of migration? in
+>
+qemuMigrationBakeCookie() & qemuMigrationEatCookie(), if the PMC numbers are
+>
+not equal, just quit migration?
+>
+It maybe a good enough first edition.
+>
+But for a further better edition, maybe it's better to support Heterogeneous
+>
+migration I think, so we might need to make PMC number configrable, then we
+>
+need to modify KVM/qemu as well.
+Yes agreed; the only thing I wanted to check was that libvirt would have enough
+information to be able to use any feature we added to QEMU.
+
+Dave
+
+>
+Regards,
+>
+-Zhuang Yanying
+--
+Dr. David Alan Gilbert / address@hidden / Manchester, UK
+