Question/Request: About Box64 x87 reduced precison info and comparison vs new "rosettax87" project hack? Hi, very new to Box32/64 and finding that Box64 supports some kinds of x87 reduced precision.. from docs: https://github.com/ptitSeb/box64/blob/main/docs/USAGE.md https://github.com/ptitSeb/box64/blob/main/docs/box64.pod we have: BOX64_X87_NO80BITS BOX64_DYNAREC_X87DOUBLE I see with BOX64_DYNAREC_X87DOUBLE that you even allow/default to single precision using this option.. 0: Try to use float when possible for x87 emulation. [Default] 1: Only use Double for x87 emulation. 2: Check Precision Control low precision on x87 emulation. so question is if you can share what perf can we gain vs non reduced x87 precision on targeted microtests? I say this because for Rosetta since recently we have project: https://github.com/Lifeisawful/rosettax87 which acceleretes x87 computation in some cases 10X at least on M4: https://github.com/Lifeisawful/rosettax87/issues/2 at least on the simple sample benchmark shared on this project (using x87 for calculating fsqrt seems): clang -v -arch x86_64 -mno-sse -mfpmath=387 ./sample/math.c -o ./build/math at least Rosetta M4: Average time: 123040 ticks Rosettax87 M4: Average time: 12222 ticks part of code: ``` #define TIMES 1000000 #define RUNS 10 #define METHOD run_fsqrt clock_t run_fsqrt() { float sixteen = 16.0f; clock_t start = clock(); // Run fsqrt many times to get measurable time float four; for(int i = 0; i < TIMES; i++) { four = __builtin_sqrtf(sixteen); } clock_t end = clock(); clock_t time_spent = (end - start); printf("Result: %x\n", *(uint32_t*)&four); return time_spent; } int main() { clock_t times[RUNS]; clock_t sum = 0; printf("benchmark %s\n", STRINGIFY(METHOD)); // Perform multiple runs for(int i = 0; i < RUNS; i++) { times[i] = METHOD(); sum += times[i]; printf("Run %d time: %lu ticks\n", i+1, times[i]); } // Calculate average using integer math clock_t avg = sum / RUNS; printf("\nAverage time: %lu ticks\n", avg); return 0; } ```