Determine what freezes a server

lenainjaune · 04-16-2024, 09:40 AM

Hi all

We have an old server whose system freezes approximately every week requiring forcing a force restart. So far we have not found the reason. The system is a Debian 9 with kernel 4.9.0-19-amd64.

Here is the error from a given day when we had to reboot the machine :

Code:

root@server:~# journalctl -p err
...
-- Reboot --
avril 11 11:40:55 server.local kernel: ACPI Error: [CAPB] Namespace lookup failure, AE_ALREADY_EXISTS (20160831/dsfield-211)
avril 11 11:40:55 server.local kernel: ACPI Error: Method parse/execution failed [\_SB.PCI0._OSC] (Node ffff957b499bbaa0), AE_ALREADY_EXISTS (20160831/psparse-543)
avril 11 11:40:55 server.local kernel: platform INT0800:00: failed to claim resource 0
avril 11 11:40:55 server.local kernel: acpi INT0800:00: platform device creation failed: -16
avril 11 11:41:15 server.local systemd[1]: Failed to start Network UPS Tools - power device driver controller.
avril 11 11:41:18 server.local ntpd[710]: inappropriate address 127.0.0.1 for the fudge command, line ignored
...

Based on this we disabled temporarily ACPI at boot (here is indicated the ACPI is problematic with Linux) and UPS to see if the trouble continues.

We ask to ourself if we must parse something else as we suppose that disabling will change nothing.

To eliminate the most hardware causes we tested to swap the disk in another perfectly working machine (the same model) but we did not see some difference.

Have you some advises to determine the exact problem ? A method we did not apply ? A deeper debug process ?

Thank you in advance for your help.
With adelphity,
lnj

lvm_ · 04-16-2024, 09:54 AM

ACPI errors are mostly harmless. You should concentrate on what's happening immediately before the freeze, not at reboot. What does it write to console? Connect a monitor, if it doesn't have one.

lenainjaune · 04-16-2024, 10:21 AM

Hi and really thank you for your reactivity

!

Quote:

Originally Posted by lvm_

ACPI errors are mostly harmless. You should concentrate on what's happening immediately before the freeze, not at reboot. What does it write to console? Connect a monitor, if it doesn't have one.

As the problem is not cyclic we can not anticipate on the moment it springs so appart to stare the screen for a week nothing to do.

Too the server is physically accessed by a KVM switch connected to a monitor and when it is in trouble we can not switch to it nor access another server. It blocks all.

But as you suggest we will connect a dedicated monitor to it, until the next freeze so we can photography what is displayed before forcing the reboot.

Nothing to dig apart from that ? A system monitoring in addition of the logs ?

sundialsvcs · 04-16-2024, 10:31 AM

Does the server support any services which are still active and responding? It may well be that it is the user interface, whatever it is, which is "not responding." Can you "telnet" or "ssh" into it in command-line mode? (You should prepare to be able to do so ...)

TB0ne · 04-16-2024, 10:42 AM

Quote:

Originally Posted by lenainjaune

Hi and really thank you for your reactivity

!
As the problem is not cyclic we can not anticipate on the moment it springs so appart to stare the screen for a week nothing to do. Too the server is physically accessed by a KVM switch connected to a monitor and when it is in trouble we can not switch to it nor access another server. It blocks all. But as you suggest we will connect a dedicated monitor to it, until the next freeze so we can photography what is displayed before forcing the reboot.

Nothing to dig apart from that ? A system monitoring in addition of the logs ?

You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky. And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??

wpeckham · 04-16-2024, 10:55 AM

Quote:

Originally Posted by TB0ne

You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky. And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??

I second this. IF the cause or evidence of the cause is logged at all, it will be in the minutes or seconds just BEFORE the freeze, so that is where you need to look. You might also monitor activity levels, temperatures, and other physical and operational statistics from a remote node (or monitoring server: see Nagios etc.) with logging so you can check from a non-frozen device what clues might exist.

IF your hardware has event logging and error detection, do not forget to also check that. I have had HP and IBM hardware (and Dell I believe) in the server and enterprise level equipment that had better hardware fault detection than the OS ever had.

lenainjaune · 04-16-2024, 10:57 AM

The dedicated monitor is in place now ...

Quote:

Originally Posted by sundialsvcs

Does the server support any services which are still active and responding? It may well be that it is the user interface, whatever it is, which is "not responding." Can you "telnet" or "ssh" into it in command-line mode? (You should prepare to be able to do so ...)

We forget to say that from LAN, ping and ssh both fail. The machine seems shutdown.

Not yet tested with telnet as SSH fails. So we will test it.

Is this command sufficient to test what you propose or we really need access with telnet ?

Code:

root@host-in-lan:~# nping -p 23 --tcp tel

VTY are not accessible too.

wpeckham · 04-16-2024, 11:07 AM

Quote:

Originally Posted by lenainjaune

The dedicated monitor is in place now ...

Good, it may display something useful at freeze.

Quote:

We forget to say that from LAN, ping and ssh both fail. The machine seems shutdown.

That might indicate that the entire node is shut down, or that the services were stopped, or that the NIC is no longer talking. Good to know.

Quote:

Not yet tested with telnet as SSH fails. So we will test it.

...
VTY are not accessible too.

If you get that result with ssh and ping, I see no additional value that could be provided by telnet.

lenainjaune · 04-16-2024, 11:39 AM

Quote:

Originally Posted by TB0ne

You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky.

An home PC LENOVO ThinkCentre M58, dmidecode indicates that the BIOS is released on 27/11/11 BUT as we stated before we tried to swap the machine with one which run perfectly ... and which gave the same symptoms, it froze .

So for hardware, we deducted the only possible suspected element was the hard disk. Then we tested SMART with smartctl and badblocks without error returned. Thus this indicates the problem seems not hardware relative.

Quote:

Originally Posted by TB0ne

And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??

Quote:

Originally Posted by wpeckham

I second this. IF the cause or evidence of the cause is logged at all, it will be in the minutes or seconds just BEFORE the freeze, so that is where you need to look.

We look at what's in the log at the time the machine crashed (in the OP the 11 april). What is the difference with BEFORE and AFTER event, it is logged that is it ?

Quote:

Originally Posted by wpeckham

You might also monitor activity levels, temperatures, and other physical and operational statistics from a remote node (or monitoring server: see Nagios etc.) with logging so you can check from a non-frozen device what clues might exist.

No Nagios for our park (it is like using a tank to kill a fly) but maybe we can consider to install a netdata standalone monitoring. But is there no simpler solution than deploying a such infrastructure ?

Too as previously said : we ever tested to swap the disk in a working computer with the same results.

Quote:

Originally Posted by wpeckham

IF your hardware has event logging and error detection, do not forget to also check that. I have had HP and IBM hardware (and Dell I believe) in the server and enterprise level equipment that had better hardware fault detection than the OS ever had.

Not enterprise hardware but a home PC so no luck in this way.

lenainjaune · 04-16-2024, 12:05 PM

We tried to install netdata but with no luck for the moment in reason of a missing dependency (zlib1g-dev which gives this error "E: Unable to correct problems, you have held broken packages"). As this server is critical for us, we hesitate to manipulate further.

At any rate we will come back when the system get stuck the next time to give you more information.

michaelk · 04-16-2024, 12:18 PM

There are several factors to look at.
It still could be a faulty hard drive i.e. power or temperature problems.
The slight possibility the KVM switch might be causing a hang up or something else attached to the computer.
The wall power as in dropouts, brown outs or spikes. Similar PSUs might exhibit the same symptoms. Laser printers or other devices that might cause noise, dropouts or spikes on the power lines when turned on or in use.

Might be time to upgrade the hardware.

TB0ne · 04-16-2024, 12:52 PM

Quote:

Originally Posted by lenainjaune

An home PC LENOVO ThinkCentre M58, dmidecode indicates that the BIOS is released on 27/11/11 BUT as we stated before we tried to swap the machine with one which run perfectly ... and which gave the same symptoms, it froze . So for hardware, we deducted the only possible suspected element was the hard disk. Then we tested SMART with smartctl and badblocks without error returned. Thus this indicates the problem seems not hardware relative.

How, exactly, did you 'swap' things to a new machine, and what kind of machine did you swap TO? And it 100% could be hardware related, since (if you're just moving the hard drive), that IT is the problem.

Quote:

We look at what's in the log at the time the machine crashed (in the OP the 11 april). What is the difference with BEFORE and AFTER event, it is logged that is it?

Errors will obviously show up BEFORE the crash...afterwards, it's showing normal boot/warning messages.

Quote:

No Nagios for our park (it is like using a tank to kill a fly) but maybe we can consider to install a netdata standalone monitoring. But is there no simpler solution than deploying a such infrastructure ?

If you have several servers, why is it a bad idea to put in a monitoring solution that can watch whatever you have now and whatever you ADD??

Quote:

Too as previously said : we ever tested to swap the disk in a working computer with the same results. Not enterprise hardware but a home PC so no luck in this way.

You're omitting a good bit:

What kind of hardware you're moving this hard drive to
What services are running on this server
Has anything changed/been modified/added to this server before this problem started?
How many users?
How much storage?
How much memory? (and have you tested THAT as well??)

There are loads of factors that can cause this, but you've not given us any error messages to work with.

lenainjaune · 04-16-2024, 01:03 PM

Quote:

Originally Posted by michaelk

There are several factors to look at.
It still could be a faulty hard drive i.e. power or temperature problems.

Its power supply is protected by an UPS but in case of hard drive damage, we decided to run again a smartctl and badblocks checks.

Quote:

Originally Posted by michaelk

The slight possibility the KVM switch might be causing a hang up or something else attached to the computer.

Oh ! We omitted this possibility ... But as the machine is now directly connected to a monitor, we will see if the problem arise yet.

Quote:

Originally Posted by michaelk

The wall power as in dropouts, brown outs or spikes. Similar PSUs might exhibit the same symptoms. Laser printers or other devices that might cause noise, dropouts or spikes on the power lines when turned on or in use.

As said above the machine is bound to an UPS.

Quote:

Originally Posted by michaelk

Might be time to upgrade the hardware.

Better : we are migrating the service but until it is operational, the old one must remain

michaelk · 04-16-2024, 01:15 PM

Have you verified the UPS is working?
Is the battery good?

lenainjaune · 04-16-2024, 01:25 PM

Quote:

Originally Posted by michaelk

Have you verified the UPS is working?
Is the battery good?

Yes ! There are a few machines bound to the UPS and only one machine with a problem. But I do not know if it is possible the problem is located on one given supply connector.