Determine what freezes a server
Hi all :)
We have an old server whose system freezes approximately every week requiring forcing a force restart. So far we have not found the reason. The system is a Debian 9 with kernel 4.9.0-19-amd64. Here is the error from a given day when we had to reboot the machine : Code:
root@server:~# journalctl -p err We ask to ourself if we must parse something else as we suppose that disabling will change nothing. To eliminate the most hardware causes we tested to swap the disk in another perfectly working machine (the same model) but we did not see some difference. Have you some advises to determine the exact problem ? A method we did not apply ? A deeper debug process ? Thank you in advance for your help. With adelphity, lnj |
ACPI errors are mostly harmless. You should concentrate on what's happening immediately before the freeze, not at reboot. What does it write to console? Connect a monitor, if it doesn't have one.
|
Hi and really thank you for your reactivity :) !
Quote:
Too the server is physically accessed by a KVM switch connected to a monitor and when it is in trouble we can not switch to it nor access another server. It blocks all. But as you suggest we will connect a dedicated monitor to it, until the next freeze so we can photography what is displayed before forcing the reboot. Nothing to dig apart from that ? A system monitoring in addition of the logs ? |
Does the server support any services which are still active and responding? It may well be that it is the user interface, whatever it is, which is "not responding." Can you "telnet" or "ssh" into it in command-line mode? (You should prepare to be able to do so ...)
|
Quote:
|
Quote:
IF your hardware has event logging and error detection, do not forget to also check that. I have had HP and IBM hardware (and Dell I believe) in the server and enterprise level equipment that had better hardware fault detection than the OS ever had. |
The dedicated monitor is in place now ...
Quote:
Not yet tested with telnet as SSH fails. So we will test it. Is this command sufficient to test what you propose or we really need access with telnet ? Code:
root@host-in-lan:~# nping -p 23 --tcp tel |
Quote:
Quote:
Quote:
|
Quote:
So for hardware, we deducted the only possible suspected element was the hard disk. Then we tested SMART with smartctl and badblocks without error returned. Thus this indicates the problem seems not hardware relative. Quote:
Quote:
Quote:
Too as previously said : we ever tested to swap the disk in a working computer with the same results. Quote:
|
We tried to install netdata but with no luck for the moment in reason of a missing dependency (zlib1g-dev which gives this error "E: Unable to correct problems, you have held broken packages"). As this server is critical for us, we hesitate to manipulate further.
At any rate we will come back when the system get stuck the next time to give you more information. |
There are several factors to look at.
It still could be a faulty hard drive i.e. power or temperature problems. The slight possibility the KVM switch might be causing a hang up or something else attached to the computer. The wall power as in dropouts, brown outs or spikes. Similar PSUs might exhibit the same symptoms. Laser printers or other devices that might cause noise, dropouts or spikes on the power lines when turned on or in use. Might be time to upgrade the hardware. |
Quote:
Quote:
Quote:
Quote:
|
Quote:
Quote:
Quote:
Quote:
|
Have you verified the UPS is working?
Is the battery good? |
Quote:
|
All times are GMT -5. The time now is 02:53 PM. |