summaryrefslogtreecommitdiffstats
path: root/results/scraper/fex/3364
blob: c1678a692459563541a44fb079e8164f8bf48851 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Using STNT1B to implement MOVNTDQ
As the title says. ASIMD only has STNP which doesn't match semantics. Using STNT1B to match semantics.
Although execution latency is not great on A715 for this instruction. (It gets a bit better on Neoverse-V2).

Found with a hot-loop in d3dcore.dll in Proton doing a non-temporal memcpy. Consuming about 1.5% CPU time.
``` 
1caf3a020  movdqa  xmm4, xmmword [r10+rax]
1caf3a026  movntdq xmmword [rcx+rax], xmm4
1caf3a02b  add     rax, 0x10
1caf3a02f  cmp     rdx, rax
1caf3a032  ja      0x1caf3a020
```
rdx is 0x1500 in the hot loop that I found

```
  0x00007ffac21cc384:  adr     x0, 0x7ffac21cc380
   0x00007ffac21cc388:  str     x0, [x28, #184]
   0x00007ffac21cc38c:  ldr     q20, [x4, x14, sxtx]
   0x00007ffac21cc390:  str     q20, [x4, x5, sxtx]
=> 0x00007ffac21cc394:  add     x4, x4, #0x10
   0x00007ffac21cc398:  sub     x26, x6, x4
   0x00007ffac21cc39c:  eor     x27, x6, x4
   0x00007ffac21cc3a0:  cmp     x6, x4
   0x00007ffac21cc3a4:  cfinv
   0x00007ffac21cc3a8:  cset    x20, cc  // cc = lo, ul, last
   0x00007ffac21cc3ac:  csel    x20, x20, xzr, ne  // ne = any
   0x00007ffac21cc3b0:  cbnz    x20, 0x7ffac21cc38c
   0x00007ffac21cc3b4:  b       0x7ffac21cc3f0
   0x00007ffac21cc3b8:  blr     x0
```