LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Desktop
User Name
Password
Linux - Desktop This forum is for the discussion of all Linux Software used in a desktop context.

Notices


Reply
  Search this Thread
Old 09-16-2023, 04:33 AM   #1
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Rep: Reputation: 0
GPU (driver?) issue on AMD integrated graphics and GNOME: freezes and severe glitches


Hi!
I have been experiencing GPU issues on my laptop lately (severe graphical glitches and full GPU resets).
Laptop: Thinkpad E15 gen 4, R5 5625U + Vega 7 integrated graphics. Running latest amdgpu driver. EndeavourOS.

The first time I experienced these issues I noticed severe graphical glitches every time something moved on the screen, flickering, polygon rendering just broke. I have a video but unfortunately it seems like I can't insert attachments yet. logs

Second time was yesterday: this time, no glitches, just a complete freeze with flickering every 5 seconds. I have logs indicating a page fault and then the GPU constantly resetting but failing: crash occurring and GPU failing to reset
Both crashes happened when I was running graphical intensive operations (Minecraft for the first and a browser game for the second)

Considering the nature of the crash (page fault) I immediately tested my hardware using UEFI utilities. All tests passed without any problem detected. Thus, I believe (and strongly hope) this is a driver issue or bug. For some reason during this second crash, I couldn't even access TTY but my music kept playing in the background.
Thank you in advance!

[EDIT] - the second crash is reproducible with a 100% success rate by loading this page. Works for a few seconds then freezes.

Last edited by lol-; 09-16-2023 at 05:06 AM. Reason: added details about when crashes happen
 
Old 09-16-2023, 12:58 PM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,453

Rep: Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342
Quote:
Originally Posted by lol- View Post
Hi!
I have been experiencing GPU issues on my laptop lately (severe graphical glitches and full GPU resets).
Laptop: Thinkpad E15 gen 4, R5 5625U + Vega 7 integrated graphics. Running latest amdgpu driver. EndeavourOS.

The first time I experienced these issues I noticed severe graphical glitches every time something moved on the screen, flickering, polygon rendering just broke. I have a video but unfortunately it seems like I can't insert attachments yet. logs

Second time was yesterday: this time, no glitches, just a complete freeze with flickering every 5 seconds. I have logs indicating a page fault and then the GPU constantly resetting but failing: crash occurring and GPU failing to reset
Both crashes happened when I was running graphical intensive operations (Minecraft for the first and a browser game for the second)

Considering the nature of the crash (page fault) I immediately tested my hardware using UEFI utilities. All tests passed without any problem detected. Thus, I believe (and strongly hope) this is a driver issue or bug. For some reason during this second crash, I couldn't even access TTY but my music kept playing in the background.
Thank you in advance!

[EDIT] - the second crash is reproducible with a 100% success rate by loading this page. Works for a few seconds then freezes.
Hello, lol- & welcome to LQ.

I've an external AMD card and no issues. I wouldn't have recommended integrated graphics for gaming, but it should work, or fail predictably.

I loaded your "100% fault" page and it's unspectacular but fine.

You say 'page fault.' Do you mean a 'Segmentation fault?' If so, that's a memory error, brought about by dodgy memory or software. So you'd try with memtest86, fsck the disks, and then get suspicious of software.

We need to see the exact fault and output. The big "#" icon gives you and paste the vital log bits bang in the middle. The button to the left (of "#") gives a quote pair. The quotes have word wrap, the CODE tags don't, so lines are preserved. Look at the tutorials to find these things out.
 
Old 09-16-2023, 02:02 PM   #3
jayjwa
Member
 
Registered: Jul 2003
Location: NY
Distribution: Slackware, Termux
Posts: 798

Rep: Reputation: 256Reputation: 256Reputation: 256
There's lot of problems with amdgpu and/or Mesa on Lenovo. I've had these problems for years and can't find a solution. The older 'radeon' kernel module was more stable, but now recent Mesa won't work with it. Like you, random crashes and screen freezing. I can login remote, so only the display is shot. If you search, you'll find dozens of people with this problem.

Code:
2023-09-10T22:50:05.056329-04:00 atr2 kernel: [515043.720780] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=17999849, emitted seq=17999851
2023-09-10T22:50:05.056343-04:00 atr2 kernel: [515043.721129] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process librewolf pid 19485 thread librewolf:cs0 pid 19603
2023-09-10T22:50:05.056344-04:00 atr2 kernel: [515043.721357] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
2023-09-10T22:50:05.675340-04:00 atr2 kernel: [515044.340512] amdgpu 0000:09:00.0: amdgpu: PCI CONFIG reset
2023-09-10T22:50:05.679337-04:00 atr2 kernel: [515044.344393] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume
2023-09-10T22:50:05.679346-04:00 atr2 kernel: [515044.344504] [drm] PCIE gen 3 link speeds already enabled
2023-09-10T22:50:05.681336-04:00 atr2 kernel: [515044.346390] amdgpu 0000:09:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400800000).
2023-09-10T22:50:06.314327-04:00 atr2 kernel: [515044.979618] [drm] UVD initialized successfully.
2023-09-10T22:50:06.708323-04:00 atr2 kernel: [515045.372882] amdgpu 0000:09:00.0: amdgpu: recover vram bo from shadow start
2023-09-10T22:50:06.709337-04:00 atr2 kernel: [515045.373978] amdgpu 0000:09:00.0: amdgpu: recover vram bo from shadow done
2023-09-10T22:50:06.709344-04:00 atr2 kernel: [515045.373998] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709347-04:00 atr2 kernel: [515045.374005] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709348-04:00 atr2 kernel: [515045.374010] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709349-04:00 atr2 kernel: [515045.374459] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709351-04:00 atr2 kernel: [515045.374464] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709352-04:00 atr2 kernel: [515045.374467] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.753328-04:00 atr2 kernel: [515045.418016] amdgpu 0000:09:00.0: amdgpu: GPU reset(1) succeeded!
The reason I'm telling you about this is so that you don't go crazy trying solutions that are unrelated to the problem. I'm guessing this is something that has to be fixed by the kernel/amdgpu/Mesa people. The bug reports I've looked at just get dismissed or closed after some time with no fixes.
 
Old 09-16-2023, 04:15 PM   #4
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Original Poster
Rep: Reputation: 0
@business_kid I only play Minecraft, integrated graphics have not failed me and performance is completely satisfactory.
No I didn't mean segmentation fault. The error happening is `Sep 16 00:43:38 lol kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:6 pasid:32776, for process brave pid 1325117 thread brave:cs0 pid 1325143)` as per the logs I linked to in my original post. A page fault is indeed memory access related as well, and I have ran memory diagnostic tools, no errors have been detected. The specific page I linked to makes my system freeze after about 10 seconds on it. I have already given logs of the event, could you please be more specific as to what system logs would be useful? Would it be preferable to directly insert logs in my posts instead of inserting paste.rs links? I have done the latter in this thread because some of the lot snippets are lengthy. Thanks for the reply.https://stackoverflow.com/questions/...-vs-page-fault indicates that a page fault shouldn't raise an exception. In my case it does though.

Last edited by lol-; 09-16-2023 at 04:27 PM.
 
Old 09-16-2023, 04:17 PM   #5
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jayjwa View Post
There's lot of problems with amdgpu and/or Mesa on Lenovo. I've had these problems for years and can't find a solution. The older 'radeon' kernel module was more stable, but now recent Mesa won't work with it. Like you, random crashes and screen freezing. I can login remote, so only the display is shot. If you search, you'll find dozens of people with this problem.

Code:
2023-09-10T22:50:05.056329-04:00 atr2 kernel: [515043.720780] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=17999849, emitted seq=17999851
2023-09-10T22:50:05.056343-04:00 atr2 kernel: [515043.721129] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process librewolf pid 19485 thread librewolf:cs0 pid 19603
2023-09-10T22:50:05.056344-04:00 atr2 kernel: [515043.721357] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
2023-09-10T22:50:05.675340-04:00 atr2 kernel: [515044.340512] amdgpu 0000:09:00.0: amdgpu: PCI CONFIG reset
2023-09-10T22:50:05.679337-04:00 atr2 kernel: [515044.344393] amdgpu 0000:09:00.0: amdgpu: GPU reset succeeded, trying to resume
2023-09-10T22:50:05.679346-04:00 atr2 kernel: [515044.344504] [drm] PCIE gen 3 link speeds already enabled
2023-09-10T22:50:05.681336-04:00 atr2 kernel: [515044.346390] amdgpu 0000:09:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400800000).
2023-09-10T22:50:06.314327-04:00 atr2 kernel: [515044.979618] [drm] UVD initialized successfully.
2023-09-10T22:50:06.708323-04:00 atr2 kernel: [515045.372882] amdgpu 0000:09:00.0: amdgpu: recover vram bo from shadow start
2023-09-10T22:50:06.709337-04:00 atr2 kernel: [515045.373978] amdgpu 0000:09:00.0: amdgpu: recover vram bo from shadow done
2023-09-10T22:50:06.709344-04:00 atr2 kernel: [515045.373998] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709347-04:00 atr2 kernel: [515045.374005] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709348-04:00 atr2 kernel: [515045.374010] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709349-04:00 atr2 kernel: [515045.374459] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709351-04:00 atr2 kernel: [515045.374464] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.709352-04:00 atr2 kernel: [515045.374467] [drm] Skip scheduling IBs!
2023-09-10T22:50:06.753328-04:00 atr2 kernel: [515045.418016] amdgpu 0000:09:00.0: amdgpu: GPU reset(1) succeeded!
The reason I'm telling you about this is so that you don't go crazy trying solutions that are unrelated to the problem. I'm guessing this is something that has to be fixed by the kernel/amdgpu/Mesa people. The bug reports I've looked at just get dismissed or closed after some time with no fixes.
If I understood correctly, the AMD drivers are buggy, and my system isn't at fault here? Well, it this happens to be the case, I guess I'll just live with some crashes...
 
Old 09-16-2023, 04:25 PM   #6
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Original Poster
Rep: Reputation: 0
And yep, googling "non-retry page fault" returns hundreds of forum posts about random GPU related freezes
 
Old 09-16-2023, 06:52 PM   #7
jayjwa
Member
 
Registered: Jul 2003
Location: NY
Distribution: Slackware, Termux
Posts: 798

Rep: Reputation: 256Reputation: 256Reputation: 256
I'm not sure if it's amdgpu or in Mesa, but yes, that's what I'm implying if indeed you do have the same issue as I. Sometimes I get weeks of uptime, sometimes it crashes in a day or two. I'm not sure what new GPU I could get that would fit and not be a problem so I just live with it. Save work often!
 
Old 09-17-2023, 01:34 AM   #8
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Original Poster
Rep: Reputation: 0
Well then, I guess I can live with it, I got to 8 days of uptime lately. I can't know for sure but I have the feeling a newer GPU would make the system even less stable @jayjwa
 
Old 09-17-2023, 04:15 AM   #9
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,453

Rep: Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342
Well, I can see a problem right here
Code:
kernel: [drm] PCIE GART of 1024M enabled.
It's grabbing 1 Gig of memory. Phoronix speaks very poorly of cards that "only" have 8G of memory. By comparison, my 10 year old laptop had Intel HD4000 graphics, and that only grabbed 512MB of memory. When they describe those integrated GPUs there's much fewer of this and much fewer of that, so no wonder it's slower.

I now have my first half decent graphics card, and don't really want more than hdmi out of it. Once you get you're in the slow lane, be conservative in what you expect from it with games. Modern games are inclined to tell the box on startup: "Whatever you've got, I'll have all of it!" So, setting a lower video mode (e.g. 1280x720) might ease a lot of issues. You'd do that with a "PreferredMode" or "Virtual" setting (Or preferably both) in your xorg.conf.d/ settings.

EDIT: Don't be afraid of a new cpu, but don't be bullied either. It's entirely a choice of what you want to do with your hard earned money.The fancy graphics were never worth the price for me, but I'm not a gamer. I wanted decent 3d when I bought my RX6600XT, but google Earth still goes 2D on me in street view sometimes.

Last edited by business_kid; 09-17-2023 at 04:23 AM.
 
Old 09-17-2023, 12:43 PM   #10
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Original Poster
Rep: Reputation: 0
@business_kid
1. If all you say is true and considering most laptops don't have dedicated graphics how can 75% of people get away without any glitches?
2. iGPUs use shared system memory after their VRAM (which is also system ram but a reserved location is full, as far as I know.
3. Why are you talking about poor performance when I have already said I am getting several times the frame rate I need in games I play. FYI system monitor indicates 2gb of VRAM, which is what I have set in the BIOS and used never exceeds 1.7.
4. In your edit you mention a new CPU. I absolutely do not need this, my R5 5625u is fast enough for tasks like compiling or hosting Minecraft servers.
5. A slow system, low VRAM, or anything performance related cannot possibly explain this.
EDIT - I didn't realize this file sharing site was one-time download, my apologies

Last edited by lol-; 09-17-2023 at 12:56 PM.
 
Old 09-17-2023, 01:48 PM   #11
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,453

Rep: Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342
Sorry about typos. My edit meant to refer to a gpu, not as a cpu.

The video looks like a broken driver, but I guess not as it's a kernel issue. I would keep the framerate down as that lightens the workload on the GPU. I get the impression something's running out of time or space. I have a RazPi 4 which has poor gpu support. halving the frame rate improves it x2. Halving the dots per inch improves things x4.

Some (usually manufacturers) count megs & Gigs as even numbers 1,000,000 = 1Meg, and 1000 of those = 1 Gig. Others count them as powers of 1024, which is what is digitally addressed. So 2GB not being 2GB actually is a fact of life.

Lastly, I'd say that driving standards is actually an extremely complicated business and years of development go into the software. You absolutely should not be seeing that, but I've no sure fire cure for you.
 
Old 09-17-2023, 03:44 PM   #12
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Original Poster
Rep: Reputation: 0
@business_kid ah makes more sense. If I wanted serious GPU performance I'd definitely buy a desktop, but my budget doesn't allow for that right now. I have verified my 2G framebuffer is 2048 megabytes, so I guess the line you copied in your before-last message is indeed something going wrong . I am however extremely curious as to what sort of corruption would cause polygon rendering errors like that. Looks like math errors in the rendering pipeline to me.
 
Old 09-18-2023, 09:19 AM   #13
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,453

Rep: Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342
Quote:
Originally Posted by lol-
Looks like math errors in the rendering pipeline to me.
Could easily be. Have you tried running on the minimum acceptable frame rate?
 
Old 09-19-2023, 07:25 AM   #14
lol-
LQ Newbie
 
Registered: Sep 2023
Posts: 9

Original Poster
Rep: Reputation: 0
The minimum gnome settings allows me to select is 60 fps. If my external monitor is plugged in I have 60 and 75. I have noticed no difference between both framerates. If you were talking about game FPS, haven't tested hut I'm almost sure it'll make little difference. Capped at 80 FPS my GPU usage is 35%, CPU load is 8% though, so definitely not maxed out
I have noticed a correlation between high system load and bugs of that nature though (running a VM dramatically increases the probability, especially if I play MC bedrock edition in the VM and Java on the host simultaneously.
Glitches of this nature have also occured without any game running, but with a lower severity (one or two widgets not drawn correctly)
 
Old 09-19-2023, 10:34 AM   #15
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,453

Rep: Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342Reputation: 2342
Looks like stresses all right if you were able to effect some improvement. Sometimes there's a bottleneck slowing things down, like a pcie-4.x bus being held at pcie-3.x. I don't have your specs but if there's anything not new gone into your fairly new laptop, make sure it's up to speed. Even check that access timing for your ram is right. I think this should be a walk in the park for your box, but it's obviously not. I'd run htop in a terminal to see if it throws up any clues.

I've had these kind of faults before and you get an education while sorting them. You'll solve it, or somebody online will. I haven't much chance. The downside is that it's one class you can't drop out of :-P.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Disable discreet AMD GPU with Dedicated AMD GPU nooobeee Linux - Hardware 13 04-19-2022 08:48 PM
Visual glitches on Lubuntu 18.10 i386 with GPU Intel 945GM xD1G0x Linux - Hardware 23 04-17-2019 11:23 PM
Handbrake GPU Acceleration - Inexpensive AMD GPU for Old PC Mr. Macintosh Linux - Software 8 01-03-2018 03:11 PM
how can I setup the amd GPU as a default gpu instead of intel graphics? divinefishersmith Linux - Newbie 33 08-22-2015 06:03 PM
Minecraft graphics glitches in Ubuntu 10.10 w/ Intel Graphics Rotten194 Linux - Games 27 04-22-2012 02:13 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Desktop

All times are GMT -5. The time now is 05:47 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration