summary refs log tree commit diff stats
path: root/results/classifier/gemma3:12b/network/1545052
diff options
context:
space:
mode:
Diffstat (limited to 'results/classifier/gemma3:12b/network/1545052')
-rw-r--r--results/classifier/gemma3:12b/network/154505259
1 files changed, 59 insertions, 0 deletions
diff --git a/results/classifier/gemma3:12b/network/1545052 b/results/classifier/gemma3:12b/network/1545052
new file mode 100644
index 00000000..1ac4af44
--- /dev/null
+++ b/results/classifier/gemma3:12b/network/1545052
@@ -0,0 +1,59 @@
+
+RDMA migration will hang forever if target QEMU fails to load vmstate
+
+Get a pair of machines with infiniband support. On one host run
+
+$ qemu-system-x86_64 -monitor stdio -incoming rdma:ibme:4444 -vnc :1 -m 1000
+
+To start an incoming migration.
+
+
+Now on the other host, run QEMU with an intentionally different configuration (ie different RAM size)
+
+$ qemu-system-x86_64 -monitor stdio -vnc :1 -m 2000
+
+Now trigger a migration on this source host
+
+(qemu) migrate rdma:ibpair:4444
+
+
+You will see on the target host, that it failed to load migration:
+
+dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (2) Ethernet
+qemu-system-x86_64: Length mismatch: pc.ram: 0x7d000000 in != 0x3e800000: Invalid argument
+qemu-system-x86_64: error while loading state for instance 0x0 of device 'ram'
+
+This is to be expected, however, at this point QEMU has hung and no longer responds to the monitor
+
+GDB shows the target host is stuck in this callpath
+
+#0  0x00007ffff39141cd in write () at ../sysdeps/unix/syscall-template.S:81
+#1  0x00007ffff27fe795 in rdma_get_cm_event.part.15 () from /lib64/librdmacm.so.1
+#2  0x000055555593e445 in qemu_rdma_cleanup (rdma=0x7fff9647e010) at migration/rdma.c:2210
+#3  0x000055555593ea45 in qemu_rdma_close (opaque=0x555557796770) at migration/rdma.c:2652
+#4  0x00005555559397cc in qemu_fclose (f=f@entry=0x5555564b1450) at migration/qemu-file.c:270
+#5  0x0000555555936b88 in process_incoming_migration_co (opaque=0x5555564b1450) at migration/migration.c:361
+#6  0x0000555555a25a1a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79
+#7  0x00007fffef5b3110 in ?? () from /lib64/libc.so.6
+
+
+
+Now, back on the source host again, you would expect to see that the migrate command failed. Instead, this QEMU is hung too. 
+
+GDB shows the source host, migrate thread, is stuck in this callpath:
+
+#0  0x00007ffff391522d in read#1  0x00007ffff00efd93 in ibv_get_cq_event () at /lib64/libibverbs.so.1
+#2  0x00005555559403f2 in qemu_rdma_block_for_wrid (rdma=rdma@entry=0x7fff3d07e010, wrid_requested=wrid_requested@entry=4000, byte_len=byte_len@entry=0x7fff39de370c) at migration/rdma.c:1511
+#3  0x000055555594058a in qemu_rdma_exchange_get_response (rdma=0x7fff3d07e010, head=head@entry=0x7fff39de3780, expecting=expecting@entry=2, idx=idx@entry=0)
+    at migration/rdma.c:1648
+#4  0x0000555555941e71 in qemu_rdma_exchange_send (rdma=0x7fff3d07e010, head=0x7fff39de3840, data=0x0, resp=0x7fff39de3870, resp_idx=0x7fff39de3880, callback=0x0) at migration/rdma.c:1725
+#5  0x00005555559447e4 in qemu_rdma_registration_stop (f=<optimized out>, opaque=<optimized out>, flags=0, data=<optimized out>) at migration/rdma.c:3302
+#6  0x000055555593bc4b in ram_control_after_iterate (f=f@entry=0x5555564c20f0, flags=flags@entry=0) at migration/qemu-file.c:157
+#7  0x0000555555740b59 in ram_save_setup (f=0x5555564c20f0, opaque=<optimized out>) at /home/berrange/src/virt/qemu/migration/ram.c:1959
+#8  0x00005555557451c1 in qemu_savevm_state_begin (f=0x5555564c20f0, params=params@entry=0x555555f6f048 <current_migration.37991+72>)
+    at /home/berrange/src/virt/qemu/migration/savevm.c:919
+#9  0x00005555559381a5 in migration_thread (opaque=0x555555f6f000 <current_migration.37991>) at migration/migration.c:1633
+#10 0x00007ffff390edc5 in start_thread (arg=0x7fff39de4700) at pthread_create.c:308
+
+
+It should have aborted migrate and set the status to failed.
\ No newline at end of file