Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
We have an old server whose system freezes approximately every week requiring forcing a force restart. So far we have not found the reason. The system is a Debian 9 with kernel 4.9.0-19-amd64.
Here is the error from a given day when we had to reboot the machine :
Code:
root@server:~# journalctl -p err
...
-- Reboot --
avril 11 11:40:55 server.local kernel: ACPI Error: [CAPB] Namespace lookup failure, AE_ALREADY_EXISTS (20160831/dsfield-211)
avril 11 11:40:55 server.local kernel: ACPI Error: Method parse/execution failed [\_SB.PCI0._OSC] (Node ffff957b499bbaa0), AE_ALREADY_EXISTS (20160831/psparse-543)
avril 11 11:40:55 server.local kernel: platform INT0800:00: failed to claim resource 0
avril 11 11:40:55 server.local kernel: acpi INT0800:00: platform device creation failed: -16
avril 11 11:41:15 server.local systemd[1]: Failed to start Network UPS Tools - power device driver controller.
avril 11 11:41:18 server.local ntpd[710]: inappropriate address 127.0.0.1 for the fudge command, line ignored
...
Based on this we disabled temporarily ACPI at boot (here is indicated the ACPI is problematic with Linux) and UPS to see if the trouble continues.
We ask to ourself if we must parse something else as we suppose that disabling will change nothing.
To eliminate the most hardware causes we tested to swap the disk in another perfectly working machine (the same model) but we did not see some difference.
Have you some advises to determine the exact problem ? A method we did not apply ? A deeper debug process ?
Thank you in advance for your help.
With adelphity,
lnj
Last edited by lenainjaune; 04-16-2024 at 10:01 AM.
ACPI errors are mostly harmless. You should concentrate on what's happening immediately before the freeze, not at reboot. What does it write to console? Connect a monitor, if it doesn't have one.
ACPI errors are mostly harmless. You should concentrate on what's happening immediately before the freeze, not at reboot. What does it write to console? Connect a monitor, if it doesn't have one.
As the problem is not cyclic we can not anticipate on the moment it springs so appart to stare the screen for a week nothing to do.
Too the server is physically accessed by a KVM switch connected to a monitor and when it is in trouble we can not switch to it nor access another server. It blocks all.
But as you suggest we will connect a dedicated monitor to it, until the next freeze so we can photography what is displayed before forcing the reboot.
Nothing to dig apart from that ? A system monitoring in addition of the logs ?
Does the server support any services which are still active and responding? It may well be that it is the user interface, whatever it is, which is "not responding." Can you "telnet" or "ssh" into it in command-line mode? (You should prepare to be able to do so ...)
Hi and really thank you for your reactivity !
As the problem is not cyclic we can not anticipate on the moment it springs so appart to stare the screen for a week nothing to do. Too the server is physically accessed by a KVM switch connected to a monitor and when it is in trouble we can not switch to it nor access another server. It blocks all. But as you suggest we will connect a dedicated monitor to it, until the next freeze so we can photography what is displayed before forcing the reboot.
Nothing to dig apart from that ? A system monitoring in addition of the logs ?
You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky. And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??
You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky. And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??
I second this. IF the cause or evidence of the cause is logged at all, it will be in the minutes or seconds just BEFORE the freeze, so that is where you need to look. You might also monitor activity levels, temperatures, and other physical and operational statistics from a remote node (or monitoring server: see Nagios etc.) with logging so you can check from a non-frozen device what clues might exist.
IF your hardware has event logging and error detection, do not forget to also check that. I have had HP and IBM hardware (and Dell I believe) in the server and enterprise level equipment that had better hardware fault detection than the OS ever had.
Does the server support any services which are still active and responding? It may well be that it is the user interface, whatever it is, which is "not responding." Can you "telnet" or "ssh" into it in command-line mode? (You should prepare to be able to do so ...)
We forget to say that from LAN, ping and ssh both fail. The machine seems shutdown.
Not yet tested with telnet as SSH fails. So we will test it.
Is this command sufficient to test what you propose or we really need access with telnet ?
You say it's an old server....how old?? What kind of hardware? Could very well be your server is just OLD, and your hardware is getting flaky.
An home PC LENOVO ThinkCentre M58, dmidecode indicates that the BIOS is released on 27/11/11 BUT as we stated before we tried to swap the machine with one which run perfectly ... and which gave the same symptoms, it froze .
So for hardware, we deducted the only possible suspected element was the hard disk. Then we tested SMART with smartctl and badblocks without error returned. Thus this indicates the problem seems not hardware relative.
Quote:
Originally Posted by TB0ne
And you posted what happens after the server reboots....how about looking at what's in the logs BEFORE it was rebooted??
Quote:
Originally Posted by wpeckham
I second this. IF the cause or evidence of the cause is logged at all, it will be in the minutes or seconds just BEFORE the freeze, so that is where you need to look.
We look at what's in the log at the time the machine crashed (in the OP the 11 april). What is the difference with BEFORE and AFTER event, it is logged that is it ?
Quote:
Originally Posted by wpeckham
You might also monitor activity levels, temperatures, and other physical and operational statistics from a remote node (or monitoring server: see Nagios etc.) with logging so you can check from a non-frozen device what clues might exist.
No Nagios for our park (it is like using a tank to kill a fly) but maybe we can consider to install a netdata standalone monitoring. But is there no simpler solution than deploying a such infrastructure ?
Too as previously said : we ever tested to swap the disk in a working computer with the same results.
Quote:
Originally Posted by wpeckham
IF your hardware has event logging and error detection, do not forget to also check that. I have had HP and IBM hardware (and Dell I believe) in the server and enterprise level equipment that had better hardware fault detection than the OS ever had.
Not enterprise hardware but a home PC so no luck in this way.
We tried to install netdata but with no luck for the moment in reason of a missing dependency (zlib1g-dev which gives this error "E: Unable to correct problems, you have held broken packages"). As this server is critical for us, we hesitate to manipulate further.
At any rate we will come back when the system get stuck the next time to give you more information.
There are several factors to look at.
It still could be a faulty hard drive i.e. power or temperature problems.
The slight possibility the KVM switch might be causing a hang up or something else attached to the computer.
The wall power as in dropouts, brown outs or spikes. Similar PSUs might exhibit the same symptoms. Laser printers or other devices that might cause noise, dropouts or spikes on the power lines when turned on or in use.
An home PC LENOVO ThinkCentre M58, dmidecode indicates that the BIOS is released on 27/11/11 BUT as we stated before we tried to swap the machine with one which run perfectly ... and which gave the same symptoms, it froze . So for hardware, we deducted the only possible suspected element was the hard disk. Then we tested SMART with smartctl and badblocks without error returned. Thus this indicates the problem seems not hardware relative.
How, exactly, did you 'swap' things to a new machine, and what kind of machine did you swap TO? And it 100% could be hardware related, since (if you're just moving the hard drive), that IT is the problem.
Quote:
We look at what's in the log at the time the machine crashed (in the OP the 11 april). What is the difference with BEFORE and AFTER event, it is logged that is it?
Errors will obviously show up BEFORE the crash...afterwards, it's showing normal boot/warning messages.
Quote:
No Nagios for our park (it is like using a tank to kill a fly) but maybe we can consider to install a netdata standalone monitoring. But is there no simpler solution than deploying a such infrastructure ?
If you have several servers, why is it a bad idea to put in a monitoring solution that can watch whatever you have now and whatever you ADD??
Quote:
Too as previously said : we ever tested to swap the disk in a working computer with the same results. Not enterprise hardware but a home PC so no luck in this way.
You're omitting a good bit:
What kind of hardware you're moving this hard drive to
What services are running on this server
Has anything changed/been modified/added to this server before this problem started?
How many users?
How much storage?
How much memory? (and have you tested THAT as well??)
There are loads of factors that can cause this, but you've not given us any error messages to work with.
There are several factors to look at.
It still could be a faulty hard drive i.e. power or temperature problems.
Its power supply is protected by an UPS but in case of hard drive damage, we decided to run again a smartctl and badblocks checks.
Quote:
Originally Posted by michaelk
The slight possibility the KVM switch might be causing a hang up or something else attached to the computer.
Oh ! We omitted this possibility ... But as the machine is now directly connected to a monitor, we will see if the problem arise yet.
Quote:
Originally Posted by michaelk
The wall power as in dropouts, brown outs or spikes. Similar PSUs might exhibit the same symptoms. Laser printers or other devices that might cause noise, dropouts or spikes on the power lines when turned on or in use.
As said above the machine is bound to an UPS.
Quote:
Originally Posted by michaelk
Might be time to upgrade the hardware.
Better : we are migrating the service but until it is operational, the old one must remain
Have you verified the UPS is working?
Is the battery good?
Yes ! There are a few machines bound to the UPS and only one machine with a problem. But I do not know if it is possible the problem is located on one given supply connector.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.