[SOLVED] KGDB debugging "BUG: unable to handle kernel paging request"

tonywh · 11-24-2017, 05:10 AM

I’m working on developing a system running on a custom board design running a Yocto-built Linux with kernel V4.1.27 running on a relatively new design Intel 4-core SoC in x64 mode. We continuously run a pretty intensive application with a lot of data moving and number crunching.

I want to start with an apology. Sorry for the excessive detail. Feel free to skip over this post if you want to, or to skip straight to my questions at the end if it helps.

I think I'm seeing a hardware issue, but might be misunderstanding something. I've laid out all the detail

so you can see if my understand is incorrect
because it might help someone else understand something about debugging

So let's launch into it...

Now, at unpredictable intervals, about every few days or once a week, my system crashes with a “BUG: unable to handle kernel paging request”. So, I’ve built the kernel with KGDB and KDB enabled and set Oops to panic. I've connected GDB over the console and broken at panic() when the issue occurred.

The messages on the console are as follows

Code:

91774.009789] BUG: unable to handle kernel paging request at 0000000040916000
[91774.017661] IP: [<ffffffff8119bfad>] do_brk+0x10d/0x2f0
[91774.023547] PGD 74de1067 PUD 7629b067 PMD 74cfa067 PTE 0
[91774.029541] Oops: 0002 [#1] PREEMPT SMP 
[91774.033965] Modules linked in: uio iwlwifi cfg80211 hid_sensor_custom hid_sensor_hub intel_ishtp_hid intel_ishtp_clients intel_spi_platform sbi_apl intel_spi mei_dal efivars i2c_i801 pcspkr lpc_ich spi_pxa2xx_platform portmux_intel_drcfg i915 intel_ish_ipc video mei_me intel_ishtp mei
[91774.062459] CPU: 3 PID: 25678 Comm: crunch.x64 Tainted: G     U          4.1.27-yocto-standard #1
[91774.072432] Hardware name: Intel Leafhill/Leafhill, BIOS Tianocore-g93533866 10/02/2017
[91774.081620] task: ffff880076bac890 ti: ffff880073264000 task.ti: ffff880073264000
[91774.090029] RIP: 0010:[<ffffffff8119bfad>]  [<ffffffff8119bfad>] do_brk+0x10d/0x2f0
[91774.098640] RSP: 0018:ffff880073267e78  EFLAGS: 00010246
[91774.104596] RAX: 0000000040916000 RBX: ffff880073245508 RCX: 0000000000000000
[91774.112611] RDX: 0000000000000000 RSI: ffff880076bac890 RDI: 0000000040916000
[91774.120626] RBP: ffff880073267ef8 R08: 0000000000000010 R09: ffffffff81008a50
[91774.128643] R10: 0000000000000000 R11: 0000000040937000 R12: 0000000040916000
[91774.136660] R13: 0000000000021000 R14: ffff880073245500 R15: 0000000040937000
[91774.144679] FS:  00007fe6a1e60740(0000) GS:ffff880079380000(0000) knlGS:0000000000000000
[91774.153770] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[91774.160223] CR2: 0000000040916000 CR3: 0000000074eab000 CR4: 00000000003406e0
[91774.168237] Stack:
[91774.172287]  0000000000000000 00000000000408f5 0000000000000000 0000000000000d6f
[91774.182388]  ffff880076152300 0000000031030034 00000000000408f5 0000000000000000
[91774.192461]  0000000000100073 ffff880073245508 ffff880073267f58 0000000040937000
[91774.202530] Call Trace:
[91774.207017]  [<ffffffff8119c391>] SyS_brk+0x171/0x1b0
[91774.214414]  [<ffffffff81958817>] system_call_fastpath+0x12/0x6a
[91774.222862] Code: 89 7d c8 4d 89 df 0f 1f 84 00 00 00 00 00 48 8b 5d c8 31 d2 45 31 d2 eb 12 0f 1f 44 00 00 4c 39 78 e0 72 22 48 8d 58 10 49 89 c2 <48> 8b 03 48 85 c0 74 3b 4c 3b 60 e8 72 e5 48 8d 58 08 48 89 c2 
[91774.248059] RIP  [<ffffffff8119bfad>] do_brk+0x10d/0x2f0
[91774.255862]  RSP <ffff880073267e78>
[91774.261610] CR2: 0000000040916000
[91774.273758] ---[ end trace 80e7510b39b4ce85 ]---

Using gdb backtrace I get

Code:

#0  panic (fmt=0xffffffff81c8b642 "Fatal exception")
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/kernel/panic.c:73
#1  0xffffffff810066a1 in oops_end (flags=70, regs=<optimised out>, signr=9)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/arch/x86/kernel/dumpstack.c:249
#2  0xffffffff8104a8eb in no_context (regs=0xffff880073267dc8, error_code=<optimised out>, address=1083269120, signal=<optimised out>, 
    si_code=<optimised out>)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/arch/x86/mm/fault.c:731
#3  0xffffffff8104abc0 in __bad_area_nosemaphore (regs=0xffff880073267dc8, error_code=2, address=1083269120, si_code=196609)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/arch/x86/mm/fault.c:811
#4  0xffffffff8104ad63 in bad_area_nosemaphore (regs=<optimised out>, error_code=<optimised out>, address=<optimised out>)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/arch/x86/mm/fault.c:818
#5  0xffffffff8104b02e in __do_page_fault (regs=0xffffffff81c8b642, error_code=18446744071592260301, address=1083269120)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/arch/x86/mm/fault.c:1183
#6  0xffffffff8104b44c in do_page_fault (regs=<optimised out>, error_code=<optimised out>)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/arch/x86/mm/fault.c:1304
#7  <signal handler called>
#8  find_vma_links (rb_parent=<optimised out>, rb_link=<optimised out>, pprev=<optimised out>, end=<optimised out>, addr=<optimised out>, 
    mm=<optimised out>) at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/mm/mmap.c:562
#9  do_brk (addr=1083269120, len=135168)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/mm/mmap.c:2760
#10 0xffffffff8119c391 in SYSC_brk (brk=<optimised out>)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/mm/mmap.c:338
#11 SyS_brk (brk=1083404288)
    at /media/rapt/f1cd4c7f-26cf-4fe7-8cc9-8a658a248389/yocto/yocto_build/build/tmp/work-shared/boarda/kernel-source/mm/mmap.c:285
#12 <signal handler called>
#13 0x0000003375ce1079 in ?? ()
#14 0xffff880073268300 in ?? ()
#15 0xffff880073640000 in ?? ()
#16 0x0000000000000000 in ?? ()

If I’m understanding correctly, it seems to show that the application has invoked the brk() syscall to request more memory. This invoked do_brk() which in turn is using find_vma_links() to check the existing vm allocation. Finally, a page fault has occurred inside find_vma_links() and this page fault could not be resolved.

Looking deeper into the situation by disassembling find_vma_links() in frame 8 I see the instruction pointer at the following point

Code:

   0xffffffff8119bfa6 <+262>:	lea    0x10(%rax),%rbx

573			} else {
574				rb_prev = __rb_parent;
575				__rb_link = &__rb_parent->rb_right;
   0xffffffff8119bfaa <+266>:	mov    %rax,%r10

562		while (*__rb_link) {
=> 0xffffffff8119bfad <+269>:	mov    (%rbx),%rax
   0xffffffff8119bfb0 <+272>:	test   %rax,%rax
   0xffffffff8119bfb3 <+275>:	je     0xffffffff8119bff0 <do_brk+336>

Now, this is the point at which I am uncertain due to lack of familiarity both GDB and Intel processors. (My background is bare-metal ARM). However, the way it looks to me is that the instruction at 0xffffffff8119bfad, mov (%rbx),%rax, caused the page fault. That instruction would read the value from address stored in rbx and store the value in rax. The address in rbx is 0xffff880073245508 (seen in console backtrace and confirmed in GDB) which I have confirmed is a valid vm address. The address that caused the page fault is 0x0000000040916000 = 1083269120 (in CR2 register and page fault calls). This value exists in rax and r12, but I don't see any instructions in the disassembly using an address from those registers.

So, I'm now left with the following questions to ponder...

It looks to me like the instruction at 0xffffffff8119bfad caused the page fault and would be re-executed if the page fault was resolved. Is my understanding correct?
The address in rbx is not the address that that caused the page fault, but that address is present in another register. What could cause this page fault other than a hardware error?
Am I missing something or misunderstanding something?

So there you have it. I'll post a conclusion if I get do actually come to one.

AwesomeMachine · 12-03-2017, 07:31 PM

The important part of all this is, the kernel was attempting to address: 0x0000000040916000, but the system wouldn't allow it. Unable to handle paging request means there's an invalid pointer in the code. You can just report the bug and let the kernel team fix it.

tonywh · 12-04-2017, 09:43 AM

Thanks @AwesomeMachine. That is my understanding of exception too. And thanks for the advice about reporting the bug.

In the meantime...

A. I've read up more about exception handling on Intel. Fault exceptions are arranged so that the RIP saved on the stack is pointing to the instruction that caused the fault. That way if the fault is resolved the instruction will be restarted. So my understanding of that is correct also.

B. Regarding the kernel version, 4.1.27 is quite old. The latest 4.1 release is 4.1.46 and the file where the exception occurred, mm/mmap.c, has a lot of changes between the two versions. I am limited to the specific kernel versions for which Intel have published the patches that I require to run OpenCL for the Intel GPU. However, they have recently published the patches for kernel 4.1.42 so I'll try that out before reporting a kernel bug.

tonywh · 12-05-2017, 09:14 AM

I now have 4.1.42 running soak tests on 3 systems that previously crashed with 4.1.27. Just need to wait now to see what happens.

tonywh · 12-18-2017, 09:26 AM

OK an update to the information above after spending some time studying the issue.

Further results
I have now been studying several systems for the past few weeks and have analysed several similar crashes. The one thing I had in common in them all was that there always seemed to have been a hardware malfunction, most likely an instruction was read incorrectly from RAM resulting in various different failures.

Finally
I have just updated the BIOS for our system with one change - enabled spread-spectrum clocking for the eMMC interface. I don't know why, but this has caused the problem to disappear. We implemented this change to the BIOS for another reason. Anyway - happy days - it's not often you get a fix for free.

sundialsvcs · 12-19-2017, 11:42 AM

You might wish to mark this thread as [SOLVED] ...