OOPS debugging questions - Unable to handle kernel paging request at virtual address

tothphu · 07-11-2016, 08:14 PM

I've had a kernel OOPS the other day running speaker-test on my Freescale i.MX233. Presumably happened after an attempted SIGTERM on speaker-test (it could be any other time). After the OOPS I could see waiting for IO running at all unused cpu time. The process invoking speaker-test couldn't be terminated either. Tried SIGKILL as well. "ps ax" was also hanging after execution.

Luckily I've managed to extract the OOPS from the messages. I've searched all over the internet but couldn't really explain everything that I'm seeing in this OOPS.

What I really can't figure out is what can actually cause this and how can I backtrace it to a specific driver. The mxs audio drivers are built-in in the kernel, so it won't be visible in the drivers list. The driver itself has been heavily modified, on request I can share parts of it. Is just the call backtrace limited?

So the kernel addresses are starting at 0xc0000000, but why is the process stack part of the kernel memory address region? Isn't that supposed to be starting downwards from kernel addresses?

Speaker-test in use is 1.0.11rc2, but I presume even if the program would end abruptly the sound architecture would close everything properly. This version of speaker-test doesn't handle signals and is not attempting to close gracefully, just gives up (though I know that the pcm driver is closed even in this scenario).

What region would 0xe1a0a024 be? Is that an ARM instruction perhaps? Meaning this will be a stack overflow somewhere? I know the memory mapped registers reside in 0x80000000. What region is "pgd = c39dc000" in? Is that kernel stack?

Is it possible to get more stack dump the next time on an OOPS, so that I can possibly go further? I can change the kernel if that's necessary (I guess the I should just go to the OOPS printer to get more), but is there a configuration for this?

Any ideas? Any helps is greatly appreciated, I'm looking at this for several days now.

Code:

<1>[268811.560000] Unable to handle kernel paging request at virtual address e1a0a024
<1>[268811.560000] pgd = c39dc000
<1>[268811.560000] [e1a0a024] *pgd=00000000
<4>[268811.560000] Internal error: Oops: 5 [#1] PREEMPT
<4>[268811.560000] Modules linked in: 
<4>[268811.560000] CPU: 0    Tainted: P            (2.6.31-private #153)
<4>[268811.560000] PC is at vma_prio_tree_next+0x3c/0x6c
<4>[268811.560000] LR is at update_mmu_cache+0x120/0x1c4
<4>[268811.560000] pc : [<c00b98d8>]    lr : [<c00611d4>]    psr: a0000093
<4>[268811.560000] sp : c39c5de0  ip : c5cfa8f8  fp : c7ce7d80
<4>[268811.560000] r10: c7ce7d80  r9 : 401fb000  r8 : 401fb000
<4>[268811.560000] r7 : 00000021  r6 : c5c9b478  r5 : 00000000  r4 : 401fad94
<4>[268811.560000] r3 : e08f7007  r2 : ea00014d  r1 : c39c5dec  r0 : e1a0a000
<4>[268811.560000] Flags: NzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
<4>[268811.560000] Control: 0005317f  Table: 439dc000  DAC: 00000015
<4>[268811.560000] Process speaker-test (pid: 1823, stack limit = 0xc39c4270)
<4>[268811.560000] Stack: (0xc39c5de0 to 0xc39c6000)
<4>[268811.560000] 5de0: 401fad94 c00611d4 c5f351c0 c7d34c84 00000080 00000000 00000000 c747d3c0 
<4>[268811.560000] 5e00: 00000021 00000021 00000000 00000000 4507630f c04a4ec0 00000000 c5c9b478 
<4>[268811.560000] 5e20: 00000000 c00bbc68 c7802060 00000000 00000200 c3a22fec 00000000 00000021 
<4>[268811.560000] 5e40: 401fb000 c04a4ec0 c5c9b108 c3a22800 c39dd000 c5c9b478 c5c9b478 401fb000 
<4>[268811.560000] 5e60: 00000000 00000000 c7ce7d80 c00bc69c 00000021 00000000 00000000 00000000 
<4>[268811.560000] 5e80: 000001fb c39dc000 00000200 000007ec c3a22fec c5c0612c 00000010 00000000 
<4>[268811.560000] 5ea0: 00000000 c749b0b0 0000000a c03b030c c5de0c00 c5c9b478 c7ce7db4 401fb290 
<4>[268811.560000] 5ec0: c39c5fb0 c7ce7d80 00000017 c0060a30 c7d34cb8 00000000 00000200 00000000 
<4>[268811.560000] 5ee0: 00000000 c03b030c 00000006 c03b037c 00000017 c39c5fb0 0000000b 401fb290 
<4>[268811.560000] 5f00: be93295c c005a228 00000000 00000000 c7ce7d80 c00bc69c 0000000a 00000000 
<4>[268811.560000] 5f20: 00000000 00000000 000001b0 c39dc000 00000200 000006c0 c3a22ec0 401c3000 
<4>[268811.560000] 5f40: 00000001 00000000 40025050 c00858f8 00000021 ffffffff c5de0c00 c5c9b948 
<4>[268811.560000] 5f60: 00000000 c01697e8 00000200 c5de0c00 c5c9b948 c0060ac4 c005af84 be932c38 
<4>[268811.560000] 5f80: 00000008 00000000 c39c4000 ffffffff 00000006 ffffffff 00000006 be9329e8 
<4>[268811.560000] 5fa0: be9329e8 0000000c 403004d0 c005ad9c 00000000 00000000 0000000c 00000000 
<4>[268811.560000] 5fc0: 0000000c 00000006 be9329e8 be9329e8 0000000c 0000000b 403004d0 be93295c 
<4>[268811.560000] 5fe0: 0000c718 be9328f8 401fa7bc 401fadb8 20000010 ffffffff 00000000 00000000 
<4>[268811.560000] [<c00b98d8>] (vma_prio_tree_next+0x3c/0x6c) from [<c00611d4>] (update_mmu_cache+0x120/0x1c4)
<4>[268811.560000] [<c00611d4>] (update_mmu_cache+0x120/0x1c4) from [<c00bbc68>] (__do_fault+0x308/0x3ec)
<4>[268811.560000] [<c00bbc68>] (__do_fault+0x308/0x3ec) from [<c00bc69c>] (handle_mm_fault+0x298/0xc14)
<4>[268811.560000] [<c00bc69c>] (handle_mm_fault+0x298/0xc14) from [<c0060a30>] (do_page_fault+0xec/0x234)
<4>[268811.560000] [<c0060a30>] (do_page_fault+0xec/0x234) from [<c005a228>] (do_DataAbort+0x30/0x90)
<4>[268811.560000] [<c005a228>] (do_DataAbort+0x30/0x90) from [<c005ad9c>] (ret_from_exception+0x0/0x10)
<4>[268811.560000] Exception stack(0xc39c5fb0 to 0xc39c5ff8)
<4>[268811.560000] 5fa0:                                     00000000 00000000 0000000c 00000000 
<4>[268811.560000] 5fc0: 0000000c 00000006 be9329e8 be9329e8 0000000c 0000000b 403004d0 be93295c 
<4>[268811.560000] 5fe0: 0000c718 be9328f8 401fa7bc 401fadb8 20000010 ffffffff                   
<4>[268811.560000] Code: e2430024 e5903030 e3530000 1a000001 (e5903024) 
<4>[268811.560000] ---[ end trace c70c22c7b9cf390d ]---
<6>[268811.560000] note: speaker-test[1823] exited with preempt_count 2

Mara · 07-18-2016, 03:17 PM

It seems to me that you have page fault when someone is accessing address 0xe1a0a024, but it's not in the page tables. It seems to me also (but I don't have 2.6.31 kernel code to check it out) that you're getting the trap in update_mmu, so a function that should update the MMU cache. It could mean that you may have some wrong values in the mmu structures. I would start by looking into the structures vma_prio_tree_next is accessing to look for any sign of corruption. You can also activate some kernel debugging options, I'd say mostly for the MMU and VMA structures.

tothphu · 07-24-2016, 11:10 PM

Thanks for your help.

I think that I'm dealing with some kind of physical memory corruption, so that might explain failure around the kernel memory management. In parallel I started to observe double-linkedlist corruption in user space. So that seems like the same issue with different memory regions being corrupted.