diff options
Diffstat (limited to 'results/classifier/accel-gemma3:12b/tcg/1742')
| -rw-r--r-- | results/classifier/accel-gemma3:12b/tcg/1742 | 96 |
1 files changed, 96 insertions, 0 deletions
diff --git a/results/classifier/accel-gemma3:12b/tcg/1742 b/results/classifier/accel-gemma3:12b/tcg/1742 new file mode 100644 index 000000000..dc6c23839 --- /dev/null +++ b/results/classifier/accel-gemma3:12b/tcg/1742 @@ -0,0 +1,96 @@ + +Arm64 kernel run with qemu-system-aarch64 crashes handling program using SVE and Streaming SVE modes +Description of problem: +The userspace program shown, which switches between SVE/SME states, crashes the kernel on task switch when running under qemu-system-aarch64. This does not reproduce on an Arm Fast Model, but I can't be sure that that is not a timing difference. + +The kernel appears to have no space allocated to save SVE state for this process, but also believes that it should save the state, where it then faults. +Steps to reproduce: +1. Compile the following program: +``` +#include <sys/prctl.h> + +int main() { + asm volatile("msr s0_3_c4_c7_3, xzr" /*smstart*/); + prctl(PR_SVE_SET_VL, 8 * 4); + asm volatile("msr s0_3_c4_c7_3, xzr" /*smstart*/); + while (1) {} // Wait to be preempted? + return 0; +} +``` +With: +``` +$ aarch64-unknown-linux-gnu-gcc main.c -o main.o -g -O3 -march=armv8.6-a+sve +``` +Compiler version does not matter I don't think, but in case: +``` +$ aarch64-unknown-linux-gnu-gcc --version +aarch64-unknown-linux-gnu-gcc (crosstool-NG 1.25.0.85_61c4cca) 10.4.0 +``` +It is a 10.4.0 built with CrossToolNG. + +2. Boot Linux and run the program in the emulated environment. I've found looping it to be more consistent: +``` +$ while true; do ./main.o; done +``` +Though sometimes it will crash after only one run. +Additional information: +Here is the output from the kernel: +``` +$ /mnt/virt_root/sme_crash/main.o +[ 190.813392] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 +[ 190.818912] Mem abort info: +[ 190.819255] ESR = 0x0000000096000046 +[ 190.819727] EC = 0x25: DABT (current EL), IL = 32 bits +[ 190.820391] SET = 0, FnV = 0 +[ 190.820757] EA = 0, S1PTW = 0 +[ 190.821145] FSC = 0x06: level 2 translation fault +[ 190.821635] Data abort info: +[ 190.821978] ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000 +[ 190.822490] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 +[ 190.822991] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 +[ 190.823645] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000475f1000 +[ 190.824269] [0000000000000000] pgd=0800000047645003, p4d=0800000047645003, pud=0800000047641003, pmd=0000000000000000 +[ 190.826225] Internal error: Oops: 0000000096000046 [#1] PREEMPT SMP +[ 190.826996] Modules linked in: +[ 190.827748] CPU: 0 PID: 198 Comm: main.o Not tainted 6.4.0-01761-g6aeadf7896bf #1 +[ 190.828638] Hardware name: linux,dummy-virt (DT) +[ 190.829304] pstate: 234000c5 (nzCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--) +[ 190.830115] pc : sve_save_state+0x4/0xf0 +[ 190.831378] lr : fpsimd_save+0x184/0x1f0 +[ 190.831848] sp : ffff80008047bc70 +[ 190.832223] x29: ffff80008047bc70 x28: ffff0000036c49c0 x27: 0000000000000000 +[ 190.833182] x26: ffff0000036c4f58 x25: ffff0000036c49c0 x24: ffff0000036c5868 +[ 190.834045] x23: 0000000000000020 x22: ffff24441ea31000 x21: 0000000000000001 +[ 190.834894] x20: ffff00003fdc50b0 x19: ffffdbbc213940b0 x18: 0000000000000000 +[ 190.835759] x17: ffff24441ea31000 x16: ffff800080000000 x15: 0000000000000000 +[ 190.836593] x14: 000000000000026c x13: 0000000000000001 x12: 0000000000000020 +[ 190.837436] x11: 0000000000000000 x10: 0000000000000001 x9 : 0000000000000800 +[ 190.838323] x8 : ffff00003fdcffc0 x7 : ffff00003fdcff40 x6 : 0000000002da9c8c +[ 190.839149] x5 : 0000000000000001 x4 : 0000000000000000 x3 : 0000000000000000 +[ 190.839976] x2 : 0000000000000001 x1 : ffff0000036c56a0 x0 : 0000000000000440 +[ 190.840936] Call trace: +[ 190.841406] sve_save_state+0x4/0xf0 +[ 190.841993] fpsimd_thread_switch+0x24/0xd4 +[ 190.842572] __switch_to+0x20/0x1d4 +[ 190.843043] __schedule+0x2a0/0xa7c +[ 190.843488] schedule+0x5c/0xc4 +[ 190.843912] do_notify_resume+0x1a4/0x474 +[ 190.844410] el0_interrupt+0xc4/0xd4 +[ 190.844855] __el0_irq_handler_common+0x18/0x24 +[ 190.845350] el0t_64_irq_handler+0x10/0x1c +[ 190.845824] el0t_64_irq+0x190/0x194 +[ 190.846661] Code: 54000040 d51b4408 d65f03c0 d503245f (e5bb5800) +[ 190.847545] ---[ end trace 0000000000000000 ]--- +[ 190.848125] note: main.o[198] exited with irqs disabled +``` + +I have looked the kernel functions in the backtrace and it seems to be loading memory fine, so it's not obviously a code generation problem. The pointer loaded prior to the crash is definitely a nullptr. + +Removing any of the lines (`while (1) {}` aside) from the example seems to avoid the issue but again, could be timing. + +An important point here is that the kernel syscall ABI states that streaming mode will be exited on +a syscall. I have observed that this does happen as expected. This is why the test case does a syscall, then immediately goes back to streaming mode. And it is perhaps where the confusion starts. + +I have confirmed that SME is supported by the emulated CPU and other SME programs do run correctly. + +I initially thought this was to do with having many cores, but it reproduces on a single core also. |