summary refs log tree commit diff stats
path: root/results/scraper/box64/522
blob: 7660f769427af6a259d60a6903dc23419f4a7276 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
getAlignedMutexWithInit is extremely slow
On a workload I'm testing, Box64 (with #521) is slow. Profiling, an inordinate amount of time is spent within getAlignedMutexWithInit, seemingly due to the application constantly creating and destroying pthread mutexes. That's a silly thing for the application to do, of course, but if I had the application source code to fix, I would just recompile for arm64 :wink: 

The following patch massively improves performance -- 10x fps in some cases -- taking the application from slideshow to a smooth 60fps..

```
 pthread_mutex_t* getAlignedMutex(pthread_mutex_t* m)
 {
-       return getAlignedMutexWithInit(m, 1);
+      return m;
```

Of course, this patch is probably wrong, since now we're potentially doing operations on unaligned mutexes, which I assume could blow up at any moment. Though it does seem to work fine for this application, somehow.

I see potentially two issues here:

1. box64 might be  wrapping mutexes to ensure ABI compatibility in places where it might not actually be necessary (else, I would expect the above patch to blow up spectacularly, maybe I'm just getting lucky)
2. box64's wrapped mutex code itself is inefficient. I don't really understand why the code is doing what it is, so I can't suggest good fixes. I see that NewMutex open-codes an arena allocator based on a global unrolled linked list, I don't know why it does that, if that's supposed to be an optimization over a plain calloc or whether there's some subtle correctness issue here. If this really is the way to go, then `mutexes` should be in **thread-local storage** to eliminate the `mutex_mutexes` locking, which is the heaviest weight operation here, and `taken` should be replaced with a **bitset** to allocate mutexes within the block in constant-time instead of the linear-time search. Neither of these changes should be difficult but I don't want to attempt them before I understand the bigger picture of what this file is for.

---

Bug report aside, it was pretty magical to see this old application come to life on my M1 Linux, with full GPU acceleration and everything. If this is where we are now, I can't wait to see where we'll be once we have 4K pages + TSO sorted in the kernel :smile: