1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
instruction: 0.394
runtime: 0.320
syscall: 0.286
SYSRET instruction incorrectly implemented
The Intel architecture manual states that when returning to user mode, the SYSRET instruction will re-load the stack selector (%ss) from the IA32_STAR model specific register using the following logic:
SS.Selector <-- (IA32_STAR[63:48]+8) OR 3; (* RPL forced to 3 *)
Another description of the instruction behavior which shows the same logic in a slightly different form can also be found here:
http://tptp.cc/mirrors/siyobik.info/instruction/SYSRET.html
[...]
SS(SEL) = IA32_STAR[63:48] + 8;
SS(PL) = 0x3;
[...]
In other words, the value of the %ss register is supposed to be loaded from bits 63:48 of the IA32_STAR model-specific register, incremented by 8, and then ORed with 3. ORing in the 3 sets the privilege level to 3 (user). This is done since SYSRET returns to user mode after a system call.
However, helper_sysret() in target-i386/seg_helper.c does not do the "OR 3" step. The code looks like this:
cpu_x86_load_seg_cache(env, R_SS, selector + 8,
0, 0xffffffff,
DESC_G_MASK | DESC_B_MASK | DESC_P_MASK |
DESC_S_MASK | (3 << DESC_DPL_SHIFT) |
DESC_W_MASK | DESC_A_MASK);
It should look like this:
cpu_x86_load_seg_cache(env, R_SS, (selector + 8) | 3,
0, 0xffffffff,
DESC_G_MASK | DESC_B_MASK | DESC_P_MASK |
DESC_S_MASK | (3 << DESC_DPL_SHIFT) |
DESC_W_MASK | DESC_A_MASK);
The code does correctly set the privilege level bits for the code selector register (%cs) but not for the stack selector (%ss).
The effect of this is that when SYSRET returns control to the user-mode caller, %ss will be have the privilege level bits cleared. In my case, it went from 0x2b to 0x28. This caused a crash later: when the user-mode code was preempted by an interrupt, and the interrupt handler would do an IRET, a general protection fault would occur because the %ss value being loaded from the exception frame was not valid for user mode. (At least, I think that's what happened.)
This behavior seems inconsistent with real hardware, and also appears to be wrong with respect to the Intel documentation, so I'm pretty confident in calling this a bug. :)
Note that this issue seems to have been around for a long time. I discovered it while using QEMU 2.2.0, but I happened to have the sources for QEMU 0.10.5, and the problem is there too (in os_helper.c). I am using FreeBSD/amd64 9.1-RELEASE as my host system, without KVM.
The fix is fairly simple. I'm attaching a patch which worked for me. Using this fix, the code that I'm testing now behaves the same on the QEMU virtual machine as on real hardware.
- Bill (<email address hidden>)
|