Machine randomly shuts off/locks up - how to track down?
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Machine randomly shuts off/locks up - how to track down?
One of ten machines seems to be locking up an awful lot. The others are perfectly stable, and, in fact, 5 of the others are exact clones of this one machine (granted, they were cloned almost a year ago, but all the software is the same). The only difference between this one and the five clones is that my router forwards ssh requests to this machine.
How can I track down what's happening? Are there logs I should look at, and if so, which ones?
Other suggestions on how to track down the problem?
FYI: This machine is running RH7.3. It has a Intel 2.8 GHz (with HT) proc running the 2.4.20-24.7smp kernel on an MSI mobo with the Intel 865 chipset. It has 1 gig of Corsair RAM. Need any other info?
It is part of a render farm, so it was more than likely running a rendering program when it crashed/locked/froze/shut down (?) - the same program that all the other machines run most of the day as well.
Shutdowns and lockups are always caused by the a problem at the kernel level, which means that it could be a faulty device driver or a faulty piece of hardware.
If it's the only one running SSH, then it could also be that the problem exists on all the machines but is only being triggered by some combination of kernel accesses used by SSH and not by the other processors. But this is probably unikely.
The first thing you can do is to test the RAM. One of the best ways to do this is to download the gcc source-code, and compile it locally on that machine. If the compile crashes part way through with a segmentation fault, but re-issuing the make command causes it not to crash at the same point, then you have some bad RAM. (This test works well because it dereferences a large number of pointers, and pointers to pointers, and takes some time, so it tends to hit any transient faults if there are any).
Another thing you could try would be to copy the vmlinux/z file from a cloned machine that works and re-install the bootloader (usually lilo or grub) in case the kernel image has been corrupted (eg. a bit flipped) when it was copied to the MBR.
If it is a hardware issue, try removing any hardware that isn't needed for the machine to work, including keyboards, mice and graphics cards and seeing if that fixes the problem.
If nothing else works, you might also try updating to the latest stable kernel you can find, and make sure that you have all the appropriate bug-fix/support options turned on when you compile it. If you're lucky, it may work around the bug for you.
Originally posted by rjlee Shutdowns and lockups are always caused by the a problem at the kernel level, which means that it could be a faulty device driver or a faulty piece of hardware.
If it's the only one running SSH, then it could also be that the problem exists on all the machines but is only being triggered by some combination of kernel accesses used by SSH and not by the other processors. But this is probably unikely.
agreed. all the other machines make pretty extensive use of ssh. The only difference is that this one happens to get requests from the outside world because my routers allows it.
Quote:
The first thing you can do is to test the RAM. One of the best ways to do this is to download the gcc source-code, and compile it locally on that machine. If the compile crashes part way through with a segmentation fault, but re-issuing the make command causes it not to crash at the same point, then you have some bad RAM. (This test works well because it dereferences a large number of pointers, and pointers to pointers, and takes some time, so it tends to hit any transient faults if there are any).
interesting test.. I did go ahead and compile a the new gcc - went all the way through. For sh!ts and giggles, I've done a make clean & am in the process of a remake.
Quote:
Another thing you could try would be to copy the vmlinux/z file from a cloned machine that works and re-install the bootloader (usually lilo or grub) in case the kernel image has been corrupted (eg. a bit flipped) when it was copied to the MBR.
I'll give this one a try, however, I think this is also unlikely. This is the machine from which I cloned the others, not the other way around... though, I suppose there may have been some file corruption at some point in time. That said, there is no one particular event that seemed to trigger this problem - this machine used to run fine.
Quote:
If it is a hardware issue, try removing any hardware that isn't needed for the machine to work, including keyboards, mice and graphics cards and seeing if that fixes the problem.
This, I have done. There's nothing on the machine other than a graphics card. If there's a way to boot a machine without a graphics card, I'd be happy to do it, but I don't know how. It won't post without having some sort of graphics card hooked up. Maybe it's the graphics card? I'll scrounge around for a new one.
Quote:
If nothing else works, you might also try updating to the latest stable kernel you can find, and make sure that you have all the appropriate bug-fix/support options turned on when you compile it. If you're lucky, it may work around the bug for you.
another good suggestion. If all else fails, I'll go this route.
Thanks for the info - it has been helpful.
If anyone else has any other ideas, feel free to chime in.
FYI: it appears that the machine is overheating. I got sensors working on one of the machines that was locking up & started logging the CPU (Intel P4 2.6 GHz) temp.. after about 9 hours at 71-72 degrees C, it locked up.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.