semantic: 0.891 assembly: 0.866 device: 0.864 instruction: 0.835 network: 0.828 graphic: 0.824 mistranslation: 0.813 other: 0.803 KVM: 0.790 vnc: 0.784 socket: 0.777 boot: 0.737 static linux-user ARM emulation has several-second startup time static linux-user emulation has several-second startup time My problem: I'm a Parabola packager, and I'm updating our qemu-user-static package from 2.8 to 2.11. With my new statically-linked 2.11, running `qemu-arm /my/arm-chroot/bin/true` went from taking 0.006s to 3s! This does not happen with the normal dynamically linked 2.11, or the old static 2.8. What happens is it gets stuck in `linux-user/elfload.c:init_guest_space()`. What `init_guest_space` does is map 2 parts of the address space: `[base, base+guest_size]` and `[base+0xffff0000, base+0xffff0000+page_size]`; where it must find an acceptable `base`. Its strategy is to `mmap(NULL, guest_size, ...)` decide where the first range is, and then check if that +0xffff0000 is also available. If it isn't, then it starts trying `mmap(base, ...)` for the entire address space from low-address to high-address. "Normally," it finds an accaptable `base` within the first 2 tries. With a static 2.11, it's taking thousands of tries. ---- Now, from my understanding, there are 2 factors working together to cause that in static 2.11 but not the other builds: - 2.11 increased the default `guest_size` from 0xf7000000 to 0xffff0000 - PIE (and thus ASLR) is disabled for static builds For some reason that I don't understand, with the smaller `guest_size` the initial `mmap(NULL, guest_size, ...)` usually returns an acceptable address range; but larger `guest_size` makes it consistently return a block of memory that butts right up against another already mapped chunk of memory. This isn't just true on the older builds, it's true with the 2.11 builds if I use the `-R` flag to shrink the `guest_size` back down to 0xf7000000. That is with linux-hardened 4.13.13 on x86-64. So then, it it falls back to crawling the entire address space; so it tries base=0x00001000. With ASLR, that probably succeeds. But with ASLR being disabled on static builds, the text segment is at 0x60000000; which is does not leave room for the needed 0xffff1000-size block before it. So then it tries base=0x00002000. And so on, more than 6000 times until it finally gets to and passes the text segment; calling mmap more than 12000 times. ---- I'm not sure what the fix is. Perhaps try to mmap a continuous chunk of size 0xffff1000, then munmap it and then mmap the 2 chunks that we actually need. The disadvantage to that is that it does not support the sparse address space that the current algorithm supports for `guest_size < 0xffff0000`. If `guest_size < 0xffff0000` *and* the big mmap fails, then it could fall back to a sparse search; though I'm not sure the current algorithm is a good choice for it, as we see in this bug. Perhaps it should inspect /proc/self/maps to try to find a suitable range before ever calling mmap? Actually, it seems that the `[base+0xffff0000, base+0xffff0000+page_size]` segment is only mapped on 32-bit ARM. So this is 32-bit ARM-specific. To have a link to it from here, on the 28th I submitted a patchset to fix this: https://lists.nongnu.org/archive/html/qemu-devel/2017-12/msg05237.html From Alistair Buxton (a-j-buxton) on bug 1756807: I just tested the patch from https://bugs.launchpad.net/qemu/+bug/1740219 and it fixes the problem for me. Specifically I only tried the final patch of the series. I duped the bugs onto this one since it is older and has a suggested patch on the ML. Added an qemu(Ubuntu) task to further track this, keeping it incomplete there until this is resolved upstream. Everything except for the final patch (which has the actual fix) is now applied on the master branch. This is now fixed on master, as of 3be2e41b3323169852dca11ffe6ff772c33e5aaa. The sha above is the merge, thanks Luke. The actual change by you is commit 2a53535af471f4bee9d6cb5b363746b8d5ed21dd Author: Luke Shumaker