results/classifier/zero-shot/105/semantic/1670377


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112

semantic: 0.751
graphic: 0.735
instruction: 0.719
assembly: 0.707
device: 0.683
network: 0.681
other: 0.665
vnc: 0.624
boot: 0.513
KVM: 0.505
socket: 0.486
mistranslation: 0.437

 VNC: short read for zlre data/RDR EndOfStream

In openQA we have a custom VNC client (https://github.com/os-autoinst/os-autoinst/tree/master/consoles), which connects to QEMU guest and from there performs actions (sends keys, handles pointer, ...). We have several backends (https://github.com/os-autoinst/os-autoinst/tree/master/backend). With qemu backend we start QEMU guest *locally* on openQA worker which connects to it via VNC and sends commands. That works fine.

However, with svirt backend we start QEMU on a KVM or Xen host and then connect to it remotely from openQA worker - the guest and worker are different systems. In this scenario fairly often happens that while system operates in Grub2, QEMU stops sending data via VNC:

...
15:24:15.5341 Debug: /var/lib/openqa/share/tests/sle-12-SP1/tests/installation/bootloader_uefi.pm:50 called testapi::send_key
15:24:15.5342 27074 <<< testapi::send_key(key='c')
15:24:15.7361 Debug: /var/lib/openqa/share/tests/sle-12-SP1/tests/installation/bootloader_uefi.pm:51 called testapi::type_string
15:24:15.7362 27074 <<< testapi::type_string(string='gfxmode=1024x768; terminal_output console; terminal_output gfxterm
', max_interval=250, wait_screen_changes=0)
15:24:22.2243 Debug: /var/lib/openqa/share/tests/sle-12-SP1/tests/installation/bootloader_uefi.pm:53 called testapi::send_key
15:24:22.2244 27074 <<< testapi::send_key(key='esc')
15:24:22.4255 Debug: /var/lib/openqa/share/tests/sle-12-SP1/tests/installation/bootloader_uefi.pm:79 called testapi::send_key
15:24:22.4256 27074 <<< testapi::send_key(key='e')
15:24:22.6264 Debug: /var/lib/openqa/share/tests/sle-12-SP1/tests/installation/bootloader_uefi.pm:81 called testapi::send_key
15:24:22.6265 27074 <<< testapi::send_key(key='down')
15:24:22.8273 Debug: /var/lib/openqa/share/tests/sle-12-SP1/tests/installation/bootloader_uefi.pm:81 called testapi::send_key
15:24:22.8274 27074 <<< testapi::send_key(key='down')
15:24:23.0282 Debug: /var/lib/openqa/share/tests/sle-12-SP1/tests/installation/bootloader_uefi.pm:81 called testapi::send_key
15:24:23.0283 27074 <<< testapi::send_key(key='down')
DIE short read for zlre data 107132 - 995002 at /usr/lib/os-autoinst/consoles/VNC.pm line 978.

 at /usr/lib/os-autoinst/backend/baseclass.pm line 73.
...

My observation is that it happens only while in Grub, when resolution happened a short while ago. See attached video and log.

Prior to QEMU 2.8.0 I was able to reproduce a similar issue with vncviewer. I started QEMU with SLES JeOS image pressed several times a 'down' key in Grub and vncviewer (Tiger VNC 1.6.0 from openSUSE Leap 42.2) crashed with rdr::EndOfStream exception. This does not happen with QEMU 2.8.0, but I am still able to reproduce similar issue via openQA.

/usr/bin/qemu-system-x86_64 -name guest=openQA-SUT-20,debug-threads=on -S -machine pc-i440fx-2.6,accel=kvm,usb=off -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 87535fc1-e693-41b9-813e-834d6fc4cb5a -no-user-config -nodefaults   -rtc base=utc -no-reboot -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/openQA-SUT-20.img,format=qcow2,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev user,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:12:34:56,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device virtio-tablet-pci,id=input0,bus=pci.0,addr=0x6 -device virtio-keyboard-pci,id=input1,bus=pci.0,addr=0x7 -vnc 0.0.0.0:20,share=force-shared -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on -monitor stdio

Host: openSUSE Leap 42.2 x86_64 KVM or Xen on x86_64 Intel with QEMU 2.6.0.
Guest: Leap 42.2.

I can't reproduce the problem with QEMU 2.5.0, but I can with any QEMU version from 2.6 RC1 on.


It isn't 100% clear from the info provided, but this is almost certainly fixed in 2.9.0 by

commit 537848ee62195fc06c328b1cd64f4218f404a7f1
Author: Michael Tokarev <email address hidden>
Date:   Fri Feb 3 12:52:29 2017 +0300

    vnc: do not disconnect on EAGAIN
    
    When qemu vnc server is trying to send large update to clients,
    there might be a situation when system responds with something
    like EAGAIN, indicating that there's no system memory to send
    that much data (depending on the network speed, client and server
    and what is happening).  In this case, something like this happens
    on qemu side (from strace):
    
    sendmsg(16, {msg_name(0)=NULL,
            msg_iov(1)=[{"\244\"..., 729186}],
            msg_controllen=0, msg_flags=0}, 0) = 103950
    sendmsg(16, {msg_name(0)=NULL,
            msg_iov(1)=[{"lz\346"..., 1559618}],
            msg_controllen=0, msg_flags=0}, 0) = -1 EAGAIN
    sendmsg(-1, {msg_name(0)=NULL,
            msg_iov(1)=[{"lz\346"..., 1559618}],
            msg_controllen=0, msg_flags=0}, 0) = -1 EBADF
    
    qemu closes the socket before the retry, and obviously it gets EBADF
    when trying to send to -1.
    
    This is because there WAS a special handling for EAGAIN, but now it doesn't
    work anymore, after commit 04d2529da27db512dcbd5e99d0e26d333f16efcc, because
    now in all error-like cases we initiate vnc disconnect.
    
    This change were introduced in qemu 2.6, and caused numerous grief for many
    people, resulting in their vnc clients reporting sporadic random disconnects
    from vnc server.
    
    Fix that by doing the disconnect only when necessary, i.e. omitting this
    very case of EAGAIN.
    
    Hopefully the existing condition (comparing with QIO_CHANNEL_ERR_BLOCK)
    is sufficient, as the original code (before the above commit) were
    checking for other errno values too.
    
    Apparently there's another (semi?)bug exist somewhere here, since the
    code tries to write to fd# -1, it probably should check if the connection
    is open before. But this isn't important.
    
    Signed-off-by: Michael Tokarev <email address hidden>
    Reviewed-by: Daniel P. Berrange <email address hidden>
    Message-id: <email address hidden>
    Fixes: 04d2529da27db512dcbd5e99d0e26d333f16efcc
    Cc: Daniel P. Berrange <email address hidden>
    Cc: Gerd Hoffmann <email address hidden>
    Cc: <email address hidden>
    Signed-off-by: Gerd Hoffmann <email address hidden>