Error in user-mode calculation of ELF aux vector's AT_PHDR I have an (admittedly strange) statically-linked ELF binary for Linux that runs just fine on top of the Linux kernel in QEMU full-system emulation, but crashes before main in user-mode emulation. Specifically, it crashes when initializing thread-local storage in glibc's _dl_aux_init, because it reads out a strange value from the AT_PHDR entry of the ELF aux vector. The binary has these program headers: Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align EXIDX 0x065874 0x00075874 0x00075874 0x00570 0x00570 R 0x4 PHDR 0x0a3000 0x00900000 0x00900000 0x00160 0x00160 R 0x1000 LOAD 0x0a3000 0x00900000 0x00900000 0x00160 0x00160 R 0x1000 LOAD 0x000000 0x00010000 0x00010000 0x65de8 0x65de8 R E 0x10000 LOAD 0x066b7c 0x00086b7c 0x00086b7c 0x02384 0x02384 RW 0x10000 NOTE 0x000114 0x00010114 0x00010114 0x00044 0x00044 R 0x4 TLS 0x066b7c 0x00086b7c 0x00086b7c 0x00010 0x00030 R 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x8 GNU_RELRO 0x066b7c 0x00086b7c 0x00086b7c 0x00484 0x00484 R 0x1 LOAD 0x07e000 0x00089000 0x00089000 0x03f44 0x03f44 R E 0x1000 LOAD 0x098000 0x00030000 0x00030000 0x01000 0x01000 RW 0x1000 If I build the Linux kernel with the following patch to the very end of create_elf_tables in fs/binfmt_elf.c /* Put the elf_info on the stack in the right place. */ elf_addr_t *my_auxv = (elf_addr_t *) mm->saved_auxv; int i; for (i = 0; i < 15; i++) { printk("0x%x = 0x%x", my_auxv[2*i], my_auxv[(2*i)+ 1]); } if (copy_to_user(sp, mm->saved_auxv, ei_index * sizeof(elf_addr_t))) return -EFAULT; return 0; and run it like this: qemu-system-arm \ -M versatilepb \ -nographic \ -dtb ./dts/versatile-pb.dtb \ -kernel zImage \ -M versatilepb \ -m 128M \ -append "earlyprintk=vga,keep" \ -initrd initramfs after I've built the kernel initramfs like this (where "init" is the binary in question): make ARCH=arm versatile_defconfig make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- all -j10 cp "$1" arch/arm/boot/init cd arch/arm/boot echo init | cpio -o --format=newc > initramfs then I get the following output. This is the kernel's view of the aux vector for this binary: 0x10 = 0x1d7 0x6 = 0x1000 0x11 = 0x64 0x3 = 0x900000 0x4 = 0x20 0x5 = 0xb 0x7 = 0x0 0x8 = 0x0 0x9 = 0x101b8 0xb = 0x0 0xc = 0x0 0xd = 0x0 0xe = 0x0 0x17 = 0x0 0x19 = 0xbec62fb5 However, if I run "qemu-arm -g 12345 binary" and use GDB to peek at the aux vector at the beginning of __libc_start_init (for example, using this Python GDB API script: https://gist.github.com/langston-barrett/5573d64ae0c9953e2fa0fe26847a5e1e), then I see the following values: AT_PHDR = 0xae000 AT_PHENT = 0x20 AT_PHNUM = 0xb AT_PAGESZ = 0x1000 AT_BASE = 0x0 AT_FLAGS = 0x0 AT_ENTRY = 0x10230 AT_UID = 0x3e9 AT_EUID = 0x3e9 AT_GID = 0x3e9 AT_EGID = 0x3e9 AT_HWCAP = 0x1fb8d7 AT_CLKTCK = 0x64 AT_RANDOM = -0x103c0 AT_HWCAP2 = 0x1f AT_NULL = 0x0 The crucial difference is in AT_PHDR (0x3), which is indeed the virtual address of the PHDR segment when the kernel calculates it, but is not when QEMU calculates it. qemu-arm --version qemu-arm version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.26) I just confirmed that this is still a problem on git tag v5.0.0, where I applied the following: diff --git a/linux-user/elfload.c b/linux-user/elfload.c index 619c054cc4..093656d059 100644 --- a/linux-user/elfload.c +++ b/linux-user/elfload.c @@ -2016,6 +2016,7 @@ static abi_ulong create_elf_tables(abi_ulong p, int argc, int envc, /* There must be exactly DLINFO_ITEMS entries here, or the assert * on info->auxv_len will trigger. */ + printf("PHDR: %x\n", (abi_ulong)(info->load_addr + exec->e_phoff)); NEW_AUX_ENT(AT_PHDR, (abi_ulong)(info->load_addr + exec->e_phoff)); NEW_AUX_ENT(AT_PHENT, (abi_ulong)(sizeof (struct elf_phdr))); NEW_AUX_ENT(AT_PHNUM, (abi_ulong)(exec->e_phnum)); and saw: PHDR: ae000 Taking a peek at how Linux and QEMU calculate AT_PHDR for static binaries reveals the following. Both involve the program headers' offset (e_phoff) added to a value I'll call load_addr (as in the kernel). In the kernel, load_addr is elf_ppnt->p_vaddr - elf_ppnt->p_offset where elf_ppnt is the program header entry of the first segment with type LOAD: https://github.com/torvalds/linux/blob/242b23319809e05170b3cc0d44d3b4bd202bb073/fs/binfmt_elf.c#L1120 In QEMU, load_addr is set to an earlier value loaddr, which is set to min_i(phdr[i].p_vaddr - phdr[i].p_offset) where min_i is the minimum over indices "i" of LOAD segments. https://github.com/qemu/qemu/blob/9e7f1469b9994d910fc1b185c657778bde51639c/linux-user/elfload.c#L2407. If you perform this calculation by hand for the program headers posted at the beginning of this thread, you'll get ae000, as expected. The problem here is that QEMU takes a minimum where Linux just takes the first value. Presumably, changing QEMU's behavior to match that of the kernel wouldn't break anything that wouldn't be broken if it really ran on Linux. Unfortunately, Linux's ELF loader is much more picky than the ELF standard, but that's a whole other story... @langston0 Thanks for detailed explanation, got the same problem for qemu-s390. The way to reproduce (linux kernel >= 4.8, for example: Ubuntu 18.04): # Register qemu binfmt_misc handlers $ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes $ cat Dockerfile.s390x FROM s390x/ubuntu RUN apt-get update && \ apt-get install -y \ gcc make libpcre3-dev libreadline-dev RUN cd /home && git clone https://github.com/nginx/njs RUN cd /home/njs && ./configure --cc-opt='-O0 -static -lm -lrt -pthread -Wl,--whole-archive -lpthread -ltinfo -Wl,--no-whole-archive' && make njs $ docker build -t njs/390x -f Dockerfile.s390x . # check the binary (WORKS!) # inside docker s390 binaries are executed using qemu-s390-static from the host $ docker run -t njs/390x /home/njs/build/njs -c 'console.log("hello")' hello # copy binary to host $ docker run -v `pwd`:/m -ti njs/390x cp /home/njs/build/njs /m/njs-s390 # deregister binfmt handler $ sudo bash -c "echo -1 > /proc/sys/fs/binfmt_misc/qemu-s390x" # run qemu gdb $ qemu-s390x -g 12345 ./njs-s390 # in a separate terminal $ gdb-multiarch ./njs-s390 -ex 'target remote localhost:12345' 0x0000000001000520 in _start () (gdb) si 0x0000000001000524 in _start () (gdb) si 0x000000000100052a in _start () (gdb) c Continuing. Program received signal SIGILL, Illegal instruction. 0x00000000011a418c in _dl_aux_init () (gdb) bt #0 0x00000000011a418c in _dl_aux_init () #1 0x00000000011663f0 in __libc_start_main () #2 0x0000000001000564 in _start () qemu-s390x --version qemu-s390x version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.28) BTW, before "sudo bash -c "echo -1 > /proc/sys/fs/binfmt_misc/qemu-s390x" njs-s390 also works on the host: $ ./njs-s390 -c 'console.log("hello")' hello $ file njs-s390 njs-s390: ELF 64-bit MSB executable, IBM S/390, version 1 (GNU/Linux), statically linked, BuildID[sha1]=e37618578fb0a8c60f426826167a800e4f314ef3, for GNU/Linux 3.2.0, with debug_info, not stripped > runs just fine on top of the Linux kernel in QEMU full-system emulation, but crashes before main in user-mode emulation So it seems system vs user-mode is not the issue here, probably it is related to gdb mode in user-mode qemu. @Dimitry To confirm that this is really the same issue (and not an unrelated crash in the same function), could you post: 1. the ELF headers ("readelf -h"), 2. the program headers ("readelf -l"), and 3. the output (the AUX VECTOR section) from this GDB script (suitably modified for your program), when connecting to QEMU's GDB server? https://gist.github.com/langston-barrett/5573d64ae0c9953e2fa0fe26847a5e1e @Langston will do tomorrow. s390x ABI requires heavy changes to the python script. When I switch to armv7 the issue goes away $ cat Dockerfile.armv7 FROM arm32v7/ubuntu RUN apt-get update && \ apt-get install -y \ gcc make libpcre3-dev libreadline-dev git RUN cd /home && git clone https://github.com/nginx/njs RUN cd /home/njs && ./configure --cc-opt='-O0 -static -lm -lrt -pthread -Wl,--whole-archive -lpthread -ltinfo -Wl,--no-whole-archive' && make njs $ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes $ docker build -t njs/armv7 -f Dockerfile.armv7 . $ docker run -v `pwd`:/m -ti njs/armv7 cp /home/njs/build/njs /m/njs-armv7 $ readelf -l ./njs-armv7 Elf file type is EXEC (Executable file) Entry point 0x12fb9 There are 7 program headers, starting at offset 52 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align EXIDX 0x1be338 0x001ce338 0x001ce338 0x009b8 0x009b8 R 0x4 LOAD 0x000000 0x00010000 0x00010000 0x1becf4 0x1becf4 R E 0x10000 LOAD 0x1bedfc 0x001dedfc 0x001dedfc 0x17674 0x1c2cc RW 0x10000 NOTE 0x000114 0x00010114 0x00010114 0x00044 0x00044 R 0x4 TLS 0x1bedfc 0x001dedfc 0x001dedfc 0x00038 0x00060 R 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10 GNU_RELRO 0x1bedfc 0x001dedfc 0x001dedfc 0x0e204 0x0e204 R 0x1 Section to Segment mapping: Segment Sections... 00 .ARM.exidx 01 .note.ABI-tag .note.gnu.build-id .rel.dyn .init .iplt .text __libc_freeres_fn __libc_thread_freeres_fn .fini .rodata .stapsdt.base __libc_subfreeres __libc_IO_vtables __libc_atexit __libc_thread_subfreeres .ARM.extab .ARM.exidx .eh_frame 02 .tdata .init_array .fini_array .data.rel.ro .got .data .bss __libc_freeres_ptrs 03 .note.ABI-tag .note.gnu.build-id 04 .tdata .tbss 05 06 .tdata .init_array .fini_array .data.rel.ro $ readelf -h ./njs-armv7 ELF Header: Magic: 7f 45 4c 46 01 01 01 03 00 00 00 00 00 00 00 00 Class: ELF32 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - GNU ABI Version: 0 Type: EXEC (Executable file) Machine: ARM Version: 0x1 Entry point address: 0x12fb9 Start of program headers: 52 (bytes into file) Start of section headers: 5696248 (bytes into file) Flags: 0x5000400, Version5 EABI, hard-float ABI Size of this header: 52 (bytes) Size of program headers: 32 (bytes) Number of program headers: 7 Size of section headers: 40 (bytes) Number of section headers: 42 Section header string table index: 41 $ qemu-arm -g 12345 ./njs-armv7 -c 'console.log("HH")' $ gdb-multiarch ./njs-armv7 -ex 'source showstack.py' ARGUMENTS --------- argc = 3 arg 0 = ./njs-armv7 arg 1 = -c arg 2 = console.log("HH") ... AUX VECTOR ---------- AT_PHDR = 10034 AT_PHENT = 20 AT_PHNUM = 7 AT_PAGESZ = 1000 AT_BASE = 0 AT_FLAGS = 0 AT_ENTRY = 12fb9 AT_UID = 3e9 AT_EUID = 3e9 AT_GID = 3e9 AT_EGID = 3e9 AT_HWCAP = 1fb8d7 AT_CLKTCK = 64 AT_RANDOM = -104a0 AT_HWCAP2 = 1f AT_NULL = 0 $ qemu-arm --version qemu-arm version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.28) Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers Built the latest QEMU, the issue goes away $ bin/debug/native/s390x-linux-user/qemu-s390x --version qemu-s390x version 5.0.50 (v5.0.0-2358-g6c87d9f311-dirty) Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers $ bin/debug/native/s390x-linux-user/qemu-s390x ../njs/njs-s390 -c 'console.log("HI")' HI So my issue seems unrelated, sorry for bothering. The QEMU project is currently moving its bug tracking to another system. For this we need to know which bugs are still valid and which could be closed already. Thus we are setting the bug state to "Incomplete" now. If the bug has already been fixed in the latest upstream version of QEMU, then please close this ticket as "Fix released". If it is not fixed yet and you think that this bug report here is still valid, then you have two options: 1) If you already have an account on gitlab.com, please open a new ticket for this problem in our new tracker here: https://gitlab.com/qemu-project/qemu/-/issues and then close this ticket here on Launchpad (or let it expire auto- matically after 60 days). Please mention the URL of this bug ticket on Launchpad in the new ticket on GitLab. 2) If you don't have an account on gitlab.com and don't intend to get one, but still would like to keep this ticket opened, then please switch the state back to "New" within the next 60 days (otherwise it will get closed as "Expired"). We will then eventually migrate the ticket auto- matically to the new system (but you won't be the reporter of the bug in the new system and thus won't get notified on changes anymore). Thank you and sorry for the inconvenience. This is an automated cleanup. This bug report has been moved to QEMU's new bug tracker on gitlab.com and thus gets marked as 'expired' now. Please continue with the discussion here: https://gitlab.com/qemu-project/qemu/-/issues/275