LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 02-03-2020, 08:30 AM   #1
asheshambasta
LQ Newbie
 
Registered: Feb 2020
Posts: 21

Rep: Reputation: Disabled
Unhappy New threadripper desktop: random crashes/freezes


Hello,


I recently bought a new Threadripper based system and I've been having trouble with, potentially, the hardware of the machine. The system randomly crashes, seemingly at random:
  1. Sometimes the crash is right after a boot; right at the lightdm login screen or right after logging in: the display freezes, I cannot jump to a tty and I cannot ssh into this machine.
  2. Sometimes the crash is after a wake from sleep: the mouse cursor freezes, but I can ssh into this machine and issue a poweroff command: after which the ssh connection is closed but the screen stays on: further attempts to ssh back into the machine fail.
  3. Sometimes, like in 2, the display gets corrupted with random blocks of pink colour, the rest of the situation remains the same: I can ssh into the machine and issue a poweroff command, which has no effect on the display but the ssh connection gets closed: I can no longer ssh into the machine after this.

The hardware configuration is:
  • Samsung 970 EVO Plus 500GB - Solid state drive
  • ASRock X399M TAICHI
  • Gigabyte Radeon RX 580 GAMING 4GB
  • AMD Ryzen Threadripper 2950X - Processor
  • Corsair Vengeance LPX (32GB)

And I'm currently running:

Code:
Linux quasar-nixos-tr 5.4.6 #1-NixOS SMP Sat Dec 21 10:05:23 UTC 2019 x86_64 GNU/Linux

I've found no reliable way of reproducing this issue and I've not been able to find anything in the logs.

Once the system has booted up, I never experience crashes for the first session after the boot: the crashes of type 2 and 3 only occur once I've suspended the system at least once.


I've also tried multiple kernel versions, none of which have solved this for me.

I'll appreciate any help.

---

Edit: I recently went to the BIOS settings of this machine and found that the RAM's clock speed was set to 2933MHz while this is a 3000MHz RAM.

I've since set that correctly and rebooted the machine. But I'm not sure if that has any bearing on this issue.
 
Old 02-04-2020, 03:12 AM   #2
mrmazda
LQ Guru
 
Registered: Aug 2016
Location: SE USA
Distribution: openSUSE 24/7; Debian, Knoppix, Mageia, Fedora, others
Posts: 5,852
Blog Entries: 1

Rep: Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074
Is Corsair Vengeance LPX a recommended RAM model on the ASRock support page for your motherboard?

Is any BIOS update available?

Which DDX is being used?
Code:
inxi -SGxx
will report it (and more). First priority should be amdgpu, fallback possibly to modesetting, but not radeon for a GPU so recent.
 
Old 02-04-2020, 11:20 AM   #3
greenleaf
Member
 
Registered: Feb 2004
Location: Chester, UK
Distribution: Linux From Scratch. 64 bit. Kernel 5.8.3. Fluxbox.
Posts: 53

Rep: Reputation: 22
When you can use ssh, the processor must still be running, and the RAM must be working. If you do a poweroff, then the daemons will shut down, including sshd, so there will be no chance of talking to the system again without a reboot. I don't know at what stage in your boot sequence the ssh daemon is started, but it may be late enough that ssh won't work in situation number 1, as described in your initial description.

In situation 2, your mouse has frozen, but you can use ssh. So clearly sshd is running. The mouse is very probably controlled by a driver that is part of Xorg.

So I think, at this stage, I would be inclined to check whether the Xorg application (assuming that is what you use for your display) is correctly linked. It might just be that it is trying to link something at run time, and actually the services are not quite what it is expecting, so it crashes. Of course if it does crash, it may not be able to update the Xorg log file, but I would double check that file to see whether there is anything strange. Mine is held at /var/log/Xorg.0.log , but yours may well be somewhere else. So if you have a 'crash', but can log in with ssh, then it would be an idea to have a look at that log file before doing any reboot.

If the log file doesn't provide anything useful, then probably the next step is to start checking the linkage using ldd. If you have any 'not found' messages in the ldd listing then something is definitely wrong on the library side.

Last edited by greenleaf; 02-04-2020 at 11:30 AM. Reason: spelling
 
Old 02-05-2020, 02:01 AM   #4
asheshambasta
LQ Newbie
 
Registered: Feb 2020
Posts: 21

Original Poster
Rep: Reputation: Disabled
@mrmazda

Quote:
Is Corsair Vengeance LPX a recommended RAM model on the ASRock support page for your motherboard?
I'm not sure, this webpage doesn't mention a model, but the RAM specs seem to be compatible.

Quote:
Is any BIOS update available?
I did download the latest firmware from ASRock's website, loaded into a flash drive, as instructed, but when I went into the bios settings to upgrade, for some reason, the bios could detect the usb drive but no upgrade it could apply from it.
As per their instructions; it seems like the upgrade utility will only list upgrades compatible with the motherboard, and in my case, it seems to find none.
I'll retry that and report back.
Edit: Apparently, my motherboard is running a BIOS version that is behind the current version, but after following the instructions and trying numerous times, the loaded upgrade on the usb stick just doesn't seem to be found by the upgrade utility.
The upgrade page downloads a zip archive, I've tried with just the archive as well as with the contents extracted, all to no avail.

There's also a case 4 I forgot to mention that just happened:
4. Wake from sleep and the system undergoes a hard freeze: ssh etc. do no work, and the only option is to power the system down via the power switch.

inxi output:

Code:
System:    Host: quasar-nixos-tr Kernel: 5.4.6 x86_64 bits: 64 compiler: gcc v: 8.3.0 Console: N/A dm: LightDM 
           Distro: NixOS 19.09.1685.e9ef090eb54 (Loris) 
Graphics:  Message: No Device data found. 
           Display: server: X.Org 1.20.5 driver: resolution: 3840x2160~60Hz, 3840x2160~60Hz 
           OpenGL: renderer: Radeon RX 580 Series (POLARIS10 DRM 3.35.0 5.4.6 LLVM 7.1.0) v: 4.5 Mesa 19.1.5 
           direct render: Yes

Last edited by asheshambasta; 02-05-2020 at 02:37 AM.
 
1 members found this post helpful.
Old 02-07-2020, 09:09 AM   #5
greenleaf
Member
 
Registered: Feb 2004
Location: Chester, UK
Distribution: Linux From Scratch. 64 bit. Kernel 5.8.3. Fluxbox.
Posts: 53

Rep: Reputation: 22
Bios caution

Case 4 is interesting, and does make it sound more like a hardware fault.

Suppose that there is a hardware fault. Suppose also that this hardware fault manifests itself whilst attempting to upgrade the BIOS. Could this result in a corrupted BIOS, and, consequently an unbootable system? Some systems have a backup BIOS, and if the upgrade fails then the backup can be used to rescue the system. In any case, it is probably better to plan for the possibility of failure where such an upgrade is concerned. If a failure does occur, there is just the possibility that the motherboard will be no longer bootable.

If there is a memory fault, then it might be possible to show this by running a dedicated memory tester. However even then it is possible to get what looks like memory errors when in fact there is a processor problem. Some years ago I was caught out by an AMD processor that was altered to look faster than it was. The system running with that processor would occasionally fail the GoldMemory test. When I eventually backed off the speed of the processor via the system configuration utility, the errors stopped and I had a reliable machine.

However it might be an idea to run a memory test utility and just leave it going, to see whether there are any errors popping up. If there are, then there is almost certainly a hardware problem. It might be memory, it might be the processor. At that point the next step is to start swapping components to find what works and what doesn't.
 
1 members found this post helpful.
Old 02-07-2020, 11:02 AM   #6
asheshambasta
LQ Newbie
 
Registered: Feb 2020
Posts: 21

Original Poster
Rep: Reputation: Disabled
Hi greenleaf, thanks for replying.
I have, since I last posted here, managed to upgrade the BIOS. I've been running the system since the upgrade and I've put the system to sleep twice since then, there have been no issues so far.
But like I said, this issue manifests itself only after a number of sleep-wake cycles, so it's hard to tell if the BIOS upgrade fixed the issue.
It did seem like my BIOS version was quite far behind: the version I was running was 3.50 and the version available on the ASRock website was 3.80, to which I upgraded.

I will report back with more details.

The system is still going to be under warranty until May this year so I hope to locate the hardware fault, if any, before that.
 
1 members found this post helpful.
Old 02-07-2020, 01:30 PM   #7
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: openSUSE, Raspbian, Slackware. Previous: MacOS, Red Hat, Coherent, Consensys SVR4.2, Tru64, Solaris
Posts: 2,808

Rep: Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550Reputation: 550
Quote:
Originally Posted by mrmazda View Post
Is any BIOS update available?

... First priority should be amdgpu, fallback possibly to modesetting, but not radeon for a GPU so recent.
Data point: I had to go through a BIOS update when I put together a Ryzen-based motherboard with Vega graphics before the video worked reliably (even without Xorg running). In this case the m'board came out only a few months before the CPU but the BIOS wasn't up to snuff. I would not be surprised if the BIOS is not current enough.
 
2 members found this post helpful.
Old 02-08-2020, 04:34 AM   #8
greenleaf
Member
 
Registered: Feb 2004
Location: Chester, UK
Distribution: Linux From Scratch. 64 bit. Kernel 5.8.3. Fluxbox.
Posts: 53

Rep: Reputation: 22
BIOS and RAM

Hi, asheshambasta - I had a look at your motherboard on the ASRock web site. Its good to see the Flashback button. Also I looked at your link for the ASRock BIOS Upgrade Instruction. It seems as if you have got past one of the most delicate hurdles. Of course there are always other things that can go amiss, but that one deserves special care.

You might find the memory testing utilities interesting. Some of them will give the RAM a very thorough work out. If the machine gets through those then your confidence in its reliability may improve. It might be worth looking at Memtest86, which can be found at http://www.memtest.org/
 
2 members found this post helpful.
Old 02-08-2020, 11:12 AM   #9
mrmazda
LQ Guru
 
Registered: Aug 2016
Location: SE USA
Distribution: openSUSE 24/7; Debian, Knoppix, Mageia, Fedora, others
Posts: 5,852
Blog Entries: 1

Rep: Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074Reputation: 2074
IMO, memtest86+ has been inferior to memtest86 since the advent of DDR3 or DDR4. Memtest86+ is FOSS provided by all the Linux distros I use, and probably by all of them. Memtest86, which I find does a better job, has both free and non-free editions. I am actually running the free 7.4 version, as I write this, with a week old Ryzen 3200G, on a second pair of sticks, after one of the first pair of Corsair Vengeance RGB Pro sticks failed miserably.

I do recommend testing RAM first thing after discovering any crashing. In this case, both Firefox and VLC were repeatedly crashing, but I also noted kernel errors in dmesg before the first crash.
 
2 members found this post helpful.
Old 03-26-2020, 02:40 AM   #10
asheshambasta
LQ Newbie
 
Registered: Feb 2020
Posts: 21

Original Poster
Rep: Reputation: Disabled
I finally managed to upgrade the BIOS to a more recent version and the crashes stopped happening, for a while.
Since 2 days ago (after a few reboots since the fix), the system started randomly freezing yet again. The patterns were similar.
During one particular reboot, I also noticed a kernel panic: https://imgur.com/a/Ra2l6oV
But I'm not sure if its related.

So far, I think this issue stands unresolved for me.

---

PS: the problem with the BIOS upgrade was that I was trying to upgrade with the X399 instead of the X399M firmware. That was quite a realisation.
 
Old 03-28-2020, 09:36 AM   #11
asheshambasta
LQ Newbie
 
Registered: Feb 2020
Posts: 21

Original Poster
Rep: Reputation: Disabled
In fact, it seems to me that these crashes do not even have the suspend pattern I had earlier thought. Since yesterday, I've had a crash that occurred hours after a resume. Again, looking through the logs revealed nothing.
It seems to me that for now and for the foreseeable future, I need to give up on suspend and resume.
I'd appreciate any hints though.
 
Old 03-28-2020, 09:43 AM   #12
asheshambasta
LQ Newbie
 
Registered: Feb 2020
Posts: 21

Original Poster
Rep: Reputation: Disabled
memtest also reports ok:
Quote:
Loop 1/5:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok

Loop 2/5:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok

Loop 3/5:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok

Loop 4/5:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok

Loop 5/5:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok

Done.
 
Old 04-01-2020, 12:17 AM   #13
andrew.46
Senior Member
 
Registered: Oct 2007
Distribution: Slackware
Posts: 1,365

Rep: Reputation: 493Reputation: 493Reputation: 493Reputation: 493Reputation: 493
You have not mentioned cooling of your 2950X? I have mentioned on another thread on these forums that you and I are running the same CPU, my own is cooled by an 360 AIO. I am just wondering if at least some of your issues could be caused by overheating...
 
Old 04-01-2020, 02:18 AM   #14
asheshambasta
LQ Newbie
 
Registered: Feb 2020
Posts: 21

Original Poster
Rep: Reputation: Disabled
Arrow

Quote:
Originally Posted by andrew.46 View Post
You have not mentioned cooling of your 2950X? I have mentioned on another thread on these forums that you and I are running the same CPU, my own is cooled by an 360 AIO. I am just wondering if at least some of your issues could be caused by overheating...
Hi Andrew, thanks for replying. I've posted more details about my cooling on the other thread. The cooling of this system seems acceptable, and my lockups are not only during peak workloads but at random. There seems to be no pattern.
 
  


Reply

Tags
asrock, crash, freeze, memtest



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD Threadripper 2990wx freezing on very high load magogo200 Linux - Hardware 2 02-27-2019 02:26 PM
Is ASUS ROG Zenith Extreme AMD Ryzen Threadripper TR4 motherboard compatible with Ubuntu? younglinuxuser Ubuntu 1 05-01-2018 09:38 PM
Have those issues related to Threadripper and PCIe 3.0 been fixed? younglinuxuser Linux - Hardware 0 04-29-2018 11:45 PM
FC6 random crashes and freezes StrikerNL Fedora 16 08-28-2007 10:20 PM
random freezes and crashes Moebius Linux - Hardware 2 07-23-2006 12:46 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 02:40 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration