diff options
Diffstat (limited to 'results/classifier/118/all/1842787')
| -rw-r--r-- | results/classifier/118/all/1842787 | 238 |
1 files changed, 238 insertions, 0 deletions
diff --git a/results/classifier/118/all/1842787 b/results/classifier/118/all/1842787 new file mode 100644 index 000000000..2a2ff6144 --- /dev/null +++ b/results/classifier/118/all/1842787 @@ -0,0 +1,238 @@ +user-level: 0.953 +peripherals: 0.950 +mistranslation: 0.948 +device: 0.944 +x86: 0.943 +register: 0.942 +performance: 0.938 +permissions: 0.936 +architecture: 0.935 +risc-v: 0.935 +virtual: 0.935 +files: 0.933 +TCG: 0.932 +arm: 0.931 +debug: 0.931 +KVM: 0.929 +assembly: 0.928 +PID: 0.928 +kernel: 0.928 +hypervisor: 0.928 +semantic: 0.926 +VMM: 0.926 +ppc: 0.925 +socket: 0.924 +graphic: 0.922 +vnc: 0.921 +boot: 0.919 +network: 0.915 +i386: 0.872 + +Writes permanently hang with very heavy I/O on virtio-scsi - worse on virtio-blk + +Up to date Arch Linux on host and guest. linux 5.2.11. QEMU 4.1.0. Full command line at bottom. + +Host gives QEMU two thin LVM volumes. The first is the root filesystem, and the second is for heavy I/O, on a Samsung 970 Evo 1TB. + +When maxing out the I/O on the second virtual block device using virtio-blk, I often get a "lockup" in about an hour or two. From the advise of iggy in IRC, I switched over to virtio-scsi. It ran perfectly for a few days, but then "locked up" in the same way. + +By "lockup", I mean writes to the second virtual block device permanently hang. I can read files from it, but even "touch foo" never times out, cannot be "kill -9"'ed, and is stuck in uninterruptible sleep. + +When this happens, writes to the first virtual block device with the root filesystem are fine, so the O/S itself remains responsive. + +The second virtual block device uses BTRFS. But, I have also tried XFS and reproduced the issue. + +In guest, when this starts, it starts logging "task X blocked for more than Y seconds". Below is an example of one of these. At this point, anything that is or does in the future write to this block device gets stuck in uninterruptible sleep. + +----- + +INFO: task kcompactd:232 blocked for more than 860 seconds. + Not tained 5.2.11-1 #1 +"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messae. +kcompactd0 D 0 232 2 0x80004000 +Call Trace: + ? __schedule+0x27f/0x6d0 + schedule+0x3d/0xc0 + io_schedule+0x12/0x40 + __lock_page+0x14a/0x250 + ? add_to_page_cache_lru+0xe0/0xe0 + migrate_pages+0x803/0xb70 + ? isolate_migratepages_block+0x9f0/0x9f0 + ? __reset_isolation_suitable+0x110/0x110 + compact_zone+0x6a2/0xd30 + kcompactd_do_work+0x134/0x260 + ? kvm_clock_read+0x14/0x30 + ? kvm_sched_clock_read+0x5/0x10 + kcompactd+0xd3/0x220 + ? wait_woken+0x80/0x80 + kthread+0xfd/0x130 + ? kcompactd_do_work+0x260/0x260 + ? kthread_park+0x80/0x80 + ret_from_fork+0x35/0x40 + +----- + +In guest, there are no other dmesg/journalctl entries other than "task...blocked". + +On host, there are no dmesg/journalctl entries whatsoever. Everything else in host continues to work fine, including other QEMU VM's on the same underlying SSD (but obviously different lvm volumes.) + +I understand there might not be enough to go on here, and I also understand it's possible this isn't a QEMU bug. Happy to run given commands or patches to help diagnose what's going on here. + +I'm now running a custom compiled QEMU 4.1.0, with debug symbols, so I can get a meaningful backtrace from the host point of view. + +----- + +/usr/bin/qemu-system-x86_64 + -name arch,process=qemu:arch + -no-user-config + -nodefaults + -nographic + -uuid 0528162b-2371-41d5-b8da-233fe61b6458 + -pidfile /tmp/0528162b-2371-41d5-b8da-233fe61b6458.pid + -machine q35,accel=kvm,vmport=off,dump-guest-core=off + -cpu SandyBridge-IBRS + -smp cpus=24,cores=12,threads=1,sockets=2 + -m 24G + -drive if=pflash,format=raw,readonly,file=/usr/share/ovmf/x64/OVMF_CODE.fd + -drive if=pflash,format=raw,readonly,file=/var/qemu/0528162b-2371-41d5-b8da-233fe61b6458.fd + -monitor telnet:localhost:8000,server,nowait,nodelay + -spice unix,addr=/tmp/0528162b-2371-41d5-b8da-233fe61b6458.sock,disable-ticketing + -device ioh3420,id=pcie.1,bus=pcie.0,slot=0 + -device virtio-vga,bus=pcie.1,addr=0 + -usbdevice tablet + -netdev bridge,id=network0,br=br0 + -device virtio-net-pci,netdev=network0,mac=02:37:de:79:19:09,bus=pcie.0,addr=3 + -device virtio-scsi-pci,id=scsi1 + -drive driver=raw,node-name=hd0,file=/dev/lvm/arch_root,if=none,discard=unmap + -device scsi-hd,drive=hd0,bootindex=1 + -drive driver=raw,node-name=hd1,file=/dev/lvm/arch_nvme,if=none,discard=unmap + -device scsi-hd,drive=hd1,bootindex=2 + +----- + +On Thu, Sep 05, 2019 at 03:42:03AM -0000, James Harvey wrote: +> ** Description changed: +> +> Up to date Arch Linux on host and guest. linux 5.2.11. QEMU 4.1.0. +> Full command line at bottom. +> +> Host gives QEMU two thin LVM volumes. The first is the root filesystem, +> and the second is for heavy I/O, on a Samsung 970 Evo 1TB. +> +> When maxing out the I/O on the second virtual block device using virtio- +> blk, I often get a "lockup" in about an hour or two. From the advise of +> iggy in IRC, I switched over to virtio-scsi. It ran perfectly for a few +> days, but then "locked up" in the same way. +> +> By "lockup", I mean writes to the second virtual block device +> permanently hang. I can read files from it, but even "touch foo" never +> times out, cannot be "kill -9"'ed, and is stuck in uninterruptible +> sleep. +> +> When this happens, writes to the first virtual block device with the +> root filesystem are fine, so the O/S itself remains responsive. +> +> The second virtual block device uses BTRFS. But, I have also tried XFS +> and reproduced the issue. +> +> In guest, when this starts, it starts logging "task X blocked for more +> than Y seconds". Below is an example of one of these. At this point, +> anything that is or does in the future write to this block device gets +> stuck in uninterruptible sleep. +> +> ----- +> +> INFO: task kcompactd:232 blocked for more than 860 seconds. +> Not tained 5.2.11-1 #1 +> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this messae. +> kcompactd0 D 0 232 2 0x80004000 +> Call Trace: +> ? __schedule+0x27f/0x6d0 +> schedule+0x3d/0xc0 +> io_schedule+0x12/0x40 +> __lock_page+0x14a/0x250 +> ? add_to_page_cache_lru+0xe0/0xe0 +> migrate_pages+0x803/0xb70 +> ? isolate_migratepages_block+0x9f0/0x9f0 +> ? __reset_isolation_suitable+0x110/0x110 +> compact_zone+0x6a2/0xd30 +> kcompactd_do_work+0x134/0x260 +> ? kvm_clock_read+0x14/0x30 +> ? kvm_sched_clock_read+0x5/0x10 +> kcompactd+0xd3/0x220 +> ? wait_woken+0x80/0x80 +> kthread+0xfd/0x130 +> ? kcompactd_do_work+0x260/0x260 +> ? kthread_park+0x80/0x80 +> ret_from_fork+0x35/0x40 +> +> ----- +> +> In guest, there are no other dmesg/journalctl entries other than +> "task...blocked". +> +> On host, there are no dmesg/journalctl entries whatsoever. Everything +> else in host continues to work fine, including other QEMU VM's on the +> same underlying SSD (but obviously different lvm volumes.) +> +> I understand there might not be enough to go on here, and I also +> understand it's possible this isn't a QEMU bug. Happy to run given +> commands or patches to help diagnose what's going on here. +> +> I'm now running a custom compiled QEMU 4.1.0, with debug symbols, so I +> can get a meaningful backtrace from the host point of view. +> +> I've only recently tried this level of I/O, so can't say if this is a +> new issue. +> +> + When writes are hanging, on host, I can connect to the monitor. Running +> + "info block" shows nothing unusual. +> + +> ----- +> +> /usr/bin/qemu-system-x86_64 +> -name arch,process=qemu:arch +> -no-user-config +> -nodefaults +> -nographic +> -uuid 0528162b-2371-41d5-b8da-233fe61b6458 +> -pidfile /tmp/0528162b-2371-41d5-b8da-233fe61b6458.pid +> -machine q35,accel=kvm,vmport=off,dump-guest-core=off +> -cpu SandyBridge-IBRS +> -smp cpus=24,cores=12,threads=1,sockets=2 +> -m 24G +> -drive if=pflash,format=raw,readonly,file=/usr/share/ovmf/x64/OVMF_CODE.fd +> -drive if=pflash,format=raw,readonly,file=/var/qemu/0528162b-2371-41d5-b8da-233fe61b6458.fd +> -monitor telnet:localhost:8000,server,nowait,nodelay +> -spice unix,addr=/tmp/0528162b-2371-41d5-b8da-233fe61b6458.sock,disable-ticketing +> -device ioh3420,id=pcie.1,bus=pcie.0,slot=0 +> -device virtio-vga,bus=pcie.1,addr=0 +> -usbdevice tablet +> -netdev bridge,id=network0,br=br0 +> -device virtio-net-pci,netdev=network0,mac=02:37:de:79:19:09,bus=pcie.0,addr=3 +> -device virtio-scsi-pci,id=scsi1 +> -drive driver=raw,node-name=hd0,file=/dev/lvm/arch_root,if=none,discard=unmap +> -device scsi-hd,drive=hd0,bootindex=1 +> -drive driver=raw,node-name=hd1,file=/dev/lvm/arch_nvme,if=none,discard=unmap +> -device scsi-hd,drive=hd1,bootindex=2 + +Please post backtrace of all QEMU threads when I/O is hung. You can use +"gdb -p $(pidog qemu-system-x86_64)" to connect GDB and "thread apply +all bt" to produce a backtrace of all threads. + +Stefan + + +Apologies, it looks like I ran into two separate bugs, one with XFS, and one with BTRFS, that had the same symptom, initially making me to think this must be a QEMU issue. + +Using blktrace, I was able to see within the VM, that the virtio block device wasn't getting the writes that were going into uninterruptible sleep. + +So, this should be able to be closed. For some reason, virtio-blk seemed to trigger the bugs more rapidly, but at this point, I can't say there is anything at fault with it or virtio-scsi. + + +BTRFS issue was discussed and linked to here https://lore.kernel.org<email address hidden>/ and has been released. I've been able to run it for several days without a lockup, so it seems to have fixed the issue for me. + +I just emailed the XFS list about the separate problems with it. No idea if it's an issue in more recent kernels than 5.1.15-5.1.16, which is what I was running at the time of the XFS errors. (Like the original report said, I was on 5.2.11 at that point.) See https://www.spinics.net/lists/linux-xfs/msg31927.html + +Thanks for updating us on this issue, which turned out not to be a QEMU bug. + |