debug: 0.968
permissions: 0.965
performance: 0.948
peripherals: 0.947
architecture: 0.945
virtual: 0.944
hypervisor: 0.943
semantic: 0.943
graphic: 0.940
user-level: 0.939
PID: 0.938
assembly: 0.936
device: 0.936
vnc: 0.935
x86: 0.932
kernel: 0.931
register: 0.930
operating system: 0.928
network: 0.925
TCG: 0.921
arm: 0.918
KVM: 0.917
risc-v: 0.907
ppc: 0.903
VMM: 0.901
alpha: 0.895
files: 0.890
boot: 0.876
socket: 0.875
mistranslation: 0.854
i386: 0.634
--------------------
virtual: 0.966
x86: 0.963
performance: 0.800
KVM: 0.696
kernel: 0.598
debug: 0.238
hypervisor: 0.107
device: 0.097
graphic: 0.058
operating system: 0.053
i386: 0.045
files: 0.018
user-level: 0.011
register: 0.009
TCG: 0.007
architecture: 0.006
semantic: 0.006
PID: 0.005
alpha: 0.005
socket: 0.004
peripherals: 0.004
VMM: 0.003
network: 0.003
risc-v: 0.002
permissions: 0.002
boot: 0.002
assembly: 0.001
vnc: 0.001
mistranslation: 0.001
ppc: 0.000
arm: 0.000

[BUG] x86/PAT handling severely crippled AMD-V SVM KVM performance

Hi, I maintain an out-of-tree 3D APIs pass-through QEMU device models at
https://github.com/kjliew/qemu-3dfx
that provide 3D acceleration for legacy
32-bit Windows guests (Win98SE, WinME, Win2k and WinXP) with the focus on
playing old legacy games from 1996-2003. It currently supports the now-defunct
3Dfx propriety API called Glide and an alternative OpenGL pass-through based on
MESA implementation.

The basic concept of both implementations create memory-mapped virtual
interfaces consist of host/guest shared memory with guest-push model instead of
a more common host-pull model for typical QEMU device model implementation.
Guest uses shared memory as FIFOs for drawing commands and data to bulk up the
operations until serialization event that flushes the FIFOs into host. This
achieves extremely good performance since virtual CPUs are fast with hardware
acceleration (Intel VT/AMD-V) and reduces the overhead of frequent VMEXITs to
service the device emulation. Both implementations work on Windows 10 with WHPX
and HAXM accelerators as well as KVM in Linux.

On Windows 10, QEMU WHPX implementation does not sync MSR_IA32_PAT during
host/guest states sync. There is no visibility into the closed-source WHPX on
how things are managed behind the scene, but from measuring performance figures
I can conclude that it didn't handle the MSR_IA32_PAT correctly for both Intel
and AMD. Call this fair enough, if you will, it didn't flag any concerns, in
fact games such as Quake2 and Quake3 were still within playable frame rate of
40~60FPS on Win2k/XP guest. Until the same games were run on Win98/ME guest and
the frame rate blew off the roof (300~500FPS) on the same CPU and GPU. In fact,
the later seemed to be more inlined with runnng the games bare-metal with vsync
off.

On Linux (at the time of writing kernel 5.6.7/Mesa 20.0), the difference
prevailed. Intel CPUs (and it so happened that I was on laptop with Intel GPU),
the VMX-based kvm_intel got it right while SVM-based kvm_amd did not.
To put this in simple exaggeration, an aging Core i3-4010U/HD Graphics 4400
(Haswell GT2) exhibited an insane performance in Quake2/Quake3 timedemos that
totally crushed more recent AMD Ryzen 2500U APU/Vega 8 Graphics and AMD
FX8300/NVIDIA GT730 on desktop. Simply unbelievable!

It turned out that there was something to do with AMD-V NPT. By loading kvm_amd
with npt=0, AMD Ryzen APU and FX8300 regained a huge performance leap. However,
AMD NPT issue with KVM was supposedly fixed in 2017 kernel commits. NPT=0 would
actually incur performance loss for VM due to intervention required by
hypervisors to maintain the shadow page tables.  Finally, I was able to find the
pointer that pointed to MSR_IA32_PAT register. By updating the MSR_IA32_PAT to
0x0606xxxx0606xxxxULL, AMD CPUs now regain their rightful performance without
taking the hit of NPT=0 for Linux KVM. Taking the same solution into Windows,
both Intel and AMD CPUs no longer require Win98/ME guest to unleash the full
performance potentials and performance figures based on games measured on WHPX
were not very far behind Linux KVM.

So I guess the problem lies in host/guest shared memory regions mapped as
uncacheable from virtual CPU perspective. As virtual CPUs now completely execute
in hardware context with x86 hardware virtualiztion extensions, the cacheability
of memory types would severely impact the performance on guests. WHPX didn't
handle it for both Intel EPT and AMD NPT, but KVM seems to do it right for Intel
EPT. I don't have the correct fix for QEMU. But what I can do for my 3D APIs
pass-through device models is to implement host-side hooks to reprogram and
restore MSR_IA32_PAT upon activation/deactivation of the 3D APIs. Perhaps there
is also a better solution of having the proper kernel drivers for virtual
interfaces to manage the memory types of host/guest shared memory in kernel
space, but to do that and the needs of Microsoft tools/DDKs, I will just forget
it. The guest stubs uses the same kernel drivers included in 3Dfx drivers for
memory mapping and the virtual interfaces remain driver-less from Windows OS
perspective. Considering the current state of halting progress for QEMU native
virgil3D to support Windows OS, I am just being pragmatic. I understand that
QEMU virgil3D will eventually bring 3D acceleration for Windows guests, but I do
not expect anything to support legacy 32-bit Windows OSes which have out-grown
their commercial usefulness.

Regards,
KJ Liew