other: 0.855 permissions: 0.842 graphic: 0.839 semantic: 0.838 debug: 0.816 device: 0.811 performance: 0.802 PID: 0.782 network: 0.775 vnc: 0.742 files: 0.736 boot: 0.735 socket: 0.735 KVM: 0.709 x86-64 MTTCG Does not update page table entries atomically It seems like the qemu tcg code for x86-64 doesn't write the access and dirty flags of the page table entries atomically. Instead, they first read the entry, see if they need to set the page table entry, and then overwrite the entry. So if you have two threads running at the same time, one accessing the virtual address over and over again, and the other modifying the page table entry, it is possible that after the second thread modifies the page table entry, qemu overwrites the value with the old page table entry value, with the access/dirty flags set. Here's a unit test that reproduces this behavior: https://github.com/mvanotti/kvm-unit-tests/commit/09f9722807271226a714b04f25174776454b19cd You can run it with: ``` /usr/bin/qemu-system-x86_64 --no-reboot -nodefaults \ -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 \ -vnc none -serial stdio -device pci-testdev \ -smp 4 -machine q35 --accel tcg,thread=multi \ -kernel x86/mmu-race.flat # -initrd /tmp/tmp.avvPpezMFf ``` Expected output (failure): ``` kvm-unit-tests$ make && /usr/bin/qemu-system-x86_64 --no-reboot -nodefaults -device pc-testdev -device isa-debug-exit,iobase=0xf4,iosize=0x4 -vnc none -serial stdio -device pci-testdev -smp 4 -machine q35 --accel tcg,thread=multi -kernel x86/mmu-race.flat # -initrd /tmp/tmp.avvPpezMFf enabling apic enabling apic enabling apic enabling apic paging enabled cr0 = 80010011 cr3 = 627000 cr4 = 20 found 4 cpus PASS: Need more than 1 CPU Detected overwritten PTE: want: 0x000000000062e007 got: 0x000000000062d027 FAIL: PTE not overwritten PASS: All Reads were zero SUMMARY: 3 tests, 1 unexpected failures ``` This bug has allows user-to-root privilege escalation inside the guest VM: if the user is able overwrite an entry that belongs to a second-to-last level page table, and is able to allocate the referenced page, then the user would be in control of a last-level page table, being able to map any memory they want. This is not uncommon in situations where memory is being decomitted. Yeah, it's a long standing API deficiency inside QEMU that we don't have a way to do atomic modifications in things like page-table-walk code: mostly you don't notice unless you go looking for it, but we really ought to fix this. Thanks for the unit test. Not strictly i386 specific -- any arch that wants to do read-modify-update to its page tables runs into this. There are some not-yet-implemented Arm architecture extensions that require this, and likely other archs too. We only tested it on x86-64 and aarch64, but we couldn't repro on arm. It is possible that this affects other platforms as well, but note that this is specifically mentioned in the qemu wiki as one of the cases that should be covered when porting mttcg to a new platform: https://wiki.qemu.org/Features/tcg-multithread BTW, the RISC-V MMU code _does_ get this right and the model could be followed by the x86 version - - something like https://github.com/vsrinivas/qemu/commit/1efa7dc689c4572d8fe0880ddbe44ec22f8f4348, (but with more compiling + working) might solve this problem and more closely model h/w. On Tue, 2 Feb 2021 at 05:07, Venkatesh Srinivas