graphic: 0.882 device: 0.875 instruction: 0.872 semantic: 0.860 socket: 0.851 KVM: 0.840 assembly: 0.840 other: 0.839 vnc: 0.824 boot: 0.811 mistranslation: 0.783 network: 0.771 Qemu-ppc Memory leak creating threads When creating c++ threads (with c++ std::thread), the resulting binary has memory leaks when running with qemu-ppc. Eg the following c++ program, when compiled with gcc, consumes more and more memory while running at qemu-ppc. (does not have memory leaks when compiling for Intel, when running same binary on real powerpc CPU hardware also no memory leaks). (Note I used function getCurrentRSS to show available memory, see https://stackoverflow.com/questions/669438/how-to-get-memory-usage-at-runtime-using-c; calls commented out here) Compiler: powerpc-linux-gnu-g++ (Debian 8.3.0-2) 8.3.0 (but same problem with older g++ compilers even 4.9) Os: Debian 10.0 ( Buster) (but same problem seen on Debian 9/stetch) qemu: qemu-ppc version 3.1.50 --- #include #include #include using namespace std::chrono_literals; // Create/run and join a 100 threads. void Fun100() { // auto b4 = getCurrentRSS(); // std::cout << getCurrentRSS() << std::endl; for(int n = 0; n < 100; n++) { std::thread t([] { std::this_thread::sleep_for( 10ms ); }); // std::cout << n << ' ' << getCurrentRSS() << std::endl; t.join(); } std::this_thread::sleep_for( 500ms ); // to give OS some time to wipe memory... // auto after = getCurrentRSS(); std::cout << b4 << ' ' << after << std::endl; } int main(int, char **) { Fun100(); Fun100(); // memory used keeps increasing } Forgive my ignorance of the C++ threading semantics but when do these threads end? Inspection shows we do clear-up CPU and thread structures on exit. That said we do have a comment in linux-user that says: /* TODO: Free new CPU state if thread creation failed. */ So I wonder if thread creation is actually failing and and that is where we start leaking? The thread creating is not failing. The thread is just running the function with line: 'std::this_thread::sleep_for( 10ms );' in the thread, thus waiting for 10ms. Once finished, the thread function ends, which should also end and cleanup the thread. (when putting some std::cout console output before the sleep it does show up). The main thread waits for that in in the join function. By running: valgrind --leak-check=yes ./qemu-ppc tests/testthread I can replicate a leak compared to qemu-arm with the same test.... ==25789== at 0x483577F: malloc (vg_replace_malloc.c:299) [13/7729] ==25789== by 0x4D7F8D0: g_malloc (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5800.3) ==25789== by 0x1FC65D: create_new_table (translate_init.inc.c:9252) ==25789== by 0x1FC65D: register_ind_in_table (translate_init.inc.c:9291) ==25789== by 0x1FC971: register_ind_insn (translate_init.inc.c:9325) ==25789== by 0x1FC971: register_insn (translate_init.inc.c:9390) ==25789== by 0x1FC971: create_ppc_opcodes (translate_init.inc.c:9450) ==25789== by 0x1FC971: ppc_cpu_realize (translate_init.inc.c:9819) ==25789== by 0x277263: device_set_realized (qdev.c:834) ==25789== by 0x27BBC6: property_set_bool (object.c:2076) ==25789== by 0x28019E: object_property_set_qobject (qom-qobject.c:26) ==25789== by 0x27DAF4: object_property_set_bool (object.c:1334) ==25789== by 0x27AE4B: cpu_create (cpu.c:62) ==25789== by 0x1C89B8: cpu_copy (main.c:188) ==25789== by 0x1CA44F: do_fork (syscall.c:5604) ==25789== by 0x1D665A: do_syscall1.isra.43 (syscall.c:9160) ==25789== ==25789== 6,656 bytes in 26 blocks are possibly lost in loss record 216 of 238 ==25789== at 0x483577F: malloc (vg_replace_malloc.c:299) ==25789== by 0x4D7F8D0: g_malloc (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5800.3) ==25789== by 0x1FC65D: create_new_table (translate_init.inc.c:9252) ==25789== by 0x1FC65D: register_ind_in_table (translate_init.inc.c:9291) ==25789== by 0x1FC9BA: register_dblind_insn (translate_init.inc.c:9337) ==25789== by 0x1FC9BA: register_insn (translate_init.inc.c:9384) ==25789== by 0x1FC9BA: create_ppc_opcodes (translate_init.inc.c:9450) ==25789== by 0x1FC9BA: ppc_cpu_realize (translate_init.inc.c:9819) ==25789== by 0x277263: device_set_realized (qdev.c:834) ==25789== by 0x27BBC6: property_set_bool (object.c:2076) ==25789== by 0x28019E: object_property_set_qobject (qom-qobject.c:26) ==25789== by 0x27DAF4: object_property_set_bool (object.c:1334) ==25789== by 0x27AE4B: cpu_create (cpu.c:62) ==25789== by 0x17304D: main (main.c:681) ==25789== ==25789== 10,752 (1,024 direct, 9,728 indirect) bytes in 4 blocks are definitely lost in loss record 223 of 238 ==25789== at 0x483577F: malloc (vg_replace_malloc.c:299) ==25789== by 0x4D7F8D0: g_malloc (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5800.3) ==25789== by 0x1FC65D: create_new_table (translate_init.inc.c:9252) ==25789== by 0x1FC65D: register_ind_in_table (translate_init.inc.c:9291) ==25789== by 0x1FC998: register_dblind_insn (translate_init.inc.c:9332) ==25789== by 0x1FC998: register_insn (translate_init.inc.c:9384) ==25789== by 0x1FC998: create_ppc_opcodes (translate_init.inc.c:9450) ==25789== by 0x1FC998: ppc_cpu_realize (translate_init.inc.c:9819) ==25789== by 0x277263: device_set_realized (qdev.c:834) ==25789== by 0x27BBC6: property_set_bool (object.c:2076) ==25789== by 0x28019E: object_property_set_qobject (qom-qobject.c:26) ==25789== by 0x27DAF4: object_property_set_bool (object.c:1334) ==25789== by 0x27AE4B: cpu_create (cpu.c:62) ==25789== by 0x1C89B8: cpu_copy (main.c:188) ==25789== by 0x1CA44F: do_fork (syscall.c:5604) ==25789== by 0x1D665A: do_syscall1.isra.43 (syscall.c:9160) So something funky happens to the PPC translator for each new thread.... Could you try an experiment and put a final 30 second sleep before the program exits. I suspect the RCU cleanup of the per-thread data never gets a chance to cleanup. Nope we think we have identified the leak. On CPU realize (ppc_cpu_realize) the translator sets up its tables (create_ppc_opcodes). This will happen for each thread created. This would be fine but linux_user cpu_copy function then does: memcpy(new_env, env, sizeof(CPUArchState)); which will blindly overwrite the tables in CPUArchState (CPUPPCState) causing the leak. The suggestion is the data should be moved to PowerPCCPU (as it is internal to the translator) and avoid being smashed by the memcpy. However longer term we should replace the memcpy with an arch aware smart copy. The opcode decode tables aren't really part of the CPUPPCState but an internal implementation detail for the translator. This can cause problems with memcpy in cpu_copy as any table created during ppc_cpu_realize get written over causing a memory leak. To avoid this move the tables into PowerPCCPU which is better suited to hold internal implementation details. Attempts to fix: https://bugs.launchpad.net/qemu/+bug/1836558 Cc: