SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The issue that the rcu_nocbs option fixes is an idle crash, meaning if the system is left at idle, eventually it will lock up.
Since yours seems tied to system activity (and you're running a 3xxx series when the problem seems to be relegated to just the 1xxx series), I'd imagine there's something else going on with your processor and an RMA might be the best option.
Yeah, with me and the lock-on-idle issue, if it idled for any length of time, it would hang. If I kept vlc streaming my local NPR station overnight, it wouldn't hang.
Are the mitigations still needed in the 5.10 kernel? I haven't tried without, because if it ain't broke, as the saying goes...
The issue that the rcu_nocbs option fixes is an idle crash, meaning if the system is left at idle, eventually it will lock up.
Since yours seems tied to system activity (and you're running a 3xxx series when the problem seems to be relegated to just the 1xxx series), I'd imagine there's something else going on with your processor and an RMA might be the best option.
Quote:
Originally Posted by garpu
Yeah, with me and the lock-on-idle issue, if it idled for any length of time, it would hang. If I kept vlc streaming my local NPR station overnight, it wouldn't hang.
Are the mitigations still needed in the 5.10 kernel? I haven't tried without, because if it ain't broke, as the saying goes...
I see, I didn't read all the thread and from what you say it may not be the same issue. I'll see about my options to get a new CPU.
I did manage to compile a kernel with SMT disabled and rc_nocbs=0-5 but at the same time Firefox tabs crashed several times.
Regarding mitigations, my understanding is it's kernel's strategies to circumvent CPU's vulnerabilities. So as long as the CPU doesn't change they have no reason to become obsolete.
This being a desktop machine and not an internet facing server, I'm not too concerned about mitigating those vlunerabilities, preferring the extra performance instead.
EDIT: not sure what to do about it. Could it be a faulty RAM module? I'll have to investigate more...
Yeah, that doesn't sound like the same problem. Have you done a memtest?
Not yet. I still have the previous hardware around (i5-7500 & 32GB RAM + motherboard and all to make it run) so I'll swap the RAM between the 2 machines. This way I'll check if the problems persist on the Ryzen 3600 with the old (known good) RAM and I'll memtest the new RAM in the old machine. But that'll have to wait until I get back home after some days.
Not yet. I still have the previous hardware around (i5-7500 & 32GB RAM + motherboard and all to make it run) so I'll swap the RAM between the 2 machines. This way I'll check if the problems persist on the Ryzen 3600 with the old (known good) RAM and I'll memtest the new RAM in the old machine. But that'll have to wait until I get back home after some days.
Thank you all.
And, sure enough, the memtest didn't even finish as it complained about finding too much errors... Compiled 5.11-rc1 successfully and generally running fine using other RAM
Time to RMA the new RAM...
My Ryzen 3700X has arrived now and I've been running it for a while in an attempt to see if was the 1700X that was the problem. I continued to get the lockups.
So the only thing on my system left to change (excluding the case!) was the corsair PSU. Having changed that to a silverstone I haven't seen a lock-up since. I want to leave this a good few weeks to check, but I was previously locking up once a night, and it's been three days now without issue, so I think I've found the culprit.
Interestingly, with the Gigabyte mobo (patched) my lock-ups were every week. With the Asrock mobo they changed to every night (same 1700X CPU). Either the Corsair deteriorated over time, or the motherboards have different patterns of power draw (with the same CPU), or the Gigabyte manages better regulation of a slightly suspect power rail. A different graphics card didn't seem to make any difference to the pattern.
The other thing about this, is I am not a gamer. I switched to a single stick of 16GB RAM (I suspected the RAM initially), and only have a single SSD disk. I wouldn't expect this arrangement to tax a 500W PSU. Things seemed to be worse with more RAM, and as a result I thought my RAM was faulty and spent a long time swapping the four sticks around (2x8GB, 2 x 16GB) to see what made a difference.
My lilo.conf has this:
append="idle=nomwait rcu_nocbs=0-15"
(I have a Ryzen 1700, so 16 cores)
as well, I had to set my BIOS power idle setting to something non-default. Whatever the "use most power" was. Not auto, but typical, or something. MSI B350 Gaming Plus. I think it only appeared on one of the more recent BIOS settings. I may have also disabled some power states, can't remember.
I think I also recompiled the kernel (5.4.53) in July, probably with some combination of the rcu/nocbs settings.
It now seems 100% stable. I haven't re-run the GCC 16-core test (I got an RMA after failing this initially in 2019). I now suspect my motherboard / settings / kernel were more to blame.
In any case, I now have 12 cores constantly running FaH with no issues. Have compiled with 16 cores, stressed the system, etc, no issues.
I'm still a little grumpy it took ~2 years before I was able to get to this level, but now I'm reasonably confident / not unhappy with it. 16 cores is overkill, but hey, it's not like I spent that much more compared to like a 4 or 6 core Intel..
The lock-ups are back with the new PSU. So the situation I have now is the following:
PC1:
3200G continues to run flawlessly with Asrock mobo, and a platinum fanless PSU. Just as well as it's my router and DNS I suppose. Funny that the cheapest AMD CPU in my house now is the only reliable one.
PC2:
R7 1700 is now in the GIGABYTE AB350 Gaming 3 mobo (patched) and still locking from time to time. I don't know if RAM has anything to do with it, but that's got 16GB. I already tried the rcu option there, made no difference. I couldn't find the option about the power supply in the BIOS on that motherboard.
PC3:
3700X, 32GB RAM, ASRock B450M Pro4, locking quite frequently (3 times per day). I've set the 'use most power' option in BIOS, set the "idle=nomwait rcu_nocbs=0-15" as well. That's got a better new PSU now, and it seemed to make a difference at first, then went back to its old ways.
I've run overnight memory checks on both these systems, there were no errors. I'm just wondering should I purchase the pro version of memtest
I'm getting fed up with this. Two different CPUs, two different motherboards from different manufacturers, two different cases, a good 10 years of using AMD CPUs and building PCs I've never had anything like this, I'm a couple of weeks away from never buying AMD again. Sick of this bullshit. Sorry for the rant, but seriously WTF? Has the world just turned to sh*t? Do I have to buy a flipping Apple these days to get something that works?
@slackerDude, you are lucky it's just 2 years. 3 years for me and still not solved!
I'm guessing your issue is not related to the issues I, and many Ryzen users, have had. I've never heard of the issues not being solved with the rcu_nocbs option and it seemed to only affect the 1st gen Ryzens (the 1x00 series). The only CPU that should be tied to the idle system lockups should be PC2 since it has the Ryzen 1700x, but if the rcu_nocbs option didn't affect it, then it is likely not the same issue many of us saw. PC3 issues are likely something completely different. My 2200G runs without issue and not requiring any kernel parameters, and AFAIK, my brother's 3700X runs fine on his Linux Mint install and I don't believe he's using any kernel parameters.
It may be worth trying different mobos, RAM, PSUs, and CPUs (I know it isn't that easy if you don't have spare components).
When I was running a 1600x as a stop gap before I built my current machine I was also having lockups. The rcu_nocbs fixed it, but I would still get very rare, but occasional lockups. For me it was my 2666mhz ram. Even though it passed memtest and gave no apparent errors. I had to drop it all the way down to 2133mhz to get the system stable.
I have a B350-based mobo for the 1700, and I could not run a high load until I did:
-append="idle=nomwait rcu_nocbs=0-15"
-disable C and/or P wait / sleep states (I'd have to reboot to look at them)
-set idle current to normal/high instead of auto
I've only tried 12 CPUs busy and haven't repeated the gcc-compile-kernel stress test.
Maybe your ASRock mobo has the options? Is it feasible to swap CPUs?
3700X is 12 cores, right? Then rcu_nocbs=0-11, correct?
Also, with my kernel, there was some weird thing about disabling the rcu_nocbs command-line option. I think I had to re-enable that option during compilation so that the boot option would actually do something - it was disabled by default, maybe? Or maybe I read that it was, but worked ok, can't remember 100%.
@bassmadrigal
That's the option I'm considering now. Pull the PSU out of the only 'working' machine and use it on one of the ones that locks up. The problem with this kind-of mix and match is that these days it seems nobody can sell anything for any length of time. The 3200G that works isn't available any more. Neither is the fanless PSU. I wonder is it a coincidence that the PSU in the system that works is the most expensive, and the only platinum one.
I might also try Mint on the 1700 as it's now the media center so a reinstall won't be too painful. I can't try it on the 3700X unfortunately.
When I was having problems with the lock-on-idle, I could run a CPU stress test, it would pass with flying colors, and then I'd surf, and it would lock. Is this what's happening with your box?
Have you tried the zenstates.py script to turn off c6 states? (I and Willy have to do that, otherwise I get a lock up every couple months. I haven't gotten one that wasn't related to some sort of video card issue since.) https://github.com/r4m0n/ZenStates-Linux (Willy's got directions further up in the thread to add it so it runs as part of the boot process.)
If you keep VLC running streaming something in the backround, is that enough to keep it from locking? (I'd have VLC streaming my local NPR station, and it would be enough to keep the lock ups from happening.)
If your RAM is good, and the PSU is OK...have you tested the voltage of the outlet your computer is plugged into? Or the power strip/surge protector?
Also, have you swapped out the cable to your monitor? I'm serious! A bad monitor cable can look a lot like a failing video card or some other lockup problem.
I think it's probably not the video cable. I do get video freezes, where the mouse continues to work, but I can't click on anything. But I also get random shut-downs as well. That's on the 3700X. Usually in the latter case I see some text on the screen, and then the machine reboots, or just becomes unusable from that point. However I'd say around 50% of the time I'm just dumped out of the KDE session. It's almost as if someone hit ctrl -alt -backspace while I was working.
On the 1700 it's a different pattern. So far on that system I've only seen the screen freezing (but the mouse can still move). However since it's a media centre I don't spend as much time on that system. Perhaps it would exhibit both patterns eventually.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.