Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
There was a freeze on 23/04 but unfortunately someone reboot before a photo was taken. We just warned users to take a photo before restarting. At least as the problem is always there with the cloned system disk this demonstrates that problem is not relative to the disk.
Yes we will ! We are considering putting 2 systems in redundancy and when the first will be unresponsive the second will take the relay ... and the more important, we will managing our telephony ourself with our Asterisk system.
As we managed to make the screen always on (parameter consoleblank=0 on grub configuration), we noticed that a simulated crash with a kernel panic (we followed this method to achieve it), displays also on the login screen.
Is it sufficient to have information before the freeze ?
We also experimented to make a journalctl command running at boot (so before login) in modifying /etc/rc.local to run this detached command journalctl --follow & and in this case the screen is flooded continuously with no pause (however we discovered that Ctrl + s can stop it and permit access to another tty with Ctrl + Alt + F2 or other). This flood is strange because in logon the command is flood-less. We suppose that is not the right way to do it.
Bolded a piece for emphasis only...you have been told this several times now, and that is the ONLY information that can help diagnose this issue. Not sure of the thought process in your diagnostic methods, but after the system freezes...it's ALREADY FROZEN. Quite obviously nothing will get logged after that point. You don't tell us what version/distro of Linux you're using, but a 4.x kernel is pretty old. You can either look in /var/log for a file (messages, syslog, etc...usual suspects) and inspect those, or you can look at "journalctl --list-boots", for a list of the log files and look at an older one with "journalctl -b <whatever number>", which will have the info in it.
Have you checked to see if disk space is running out? You mention asterisk as a PBX...a full disk can also cause problems, especially with long/undeleted voicemails. Regardless, it sounds like you need to actually hire a consultant to come take care of this, based on what you're posting.
Information from before the freeze, in particular JUST before the freeze so it is likely to capture the cause, is the ONLY information that might be seriously helpful. AT the freeze logging will stop and you will get no information, and AFTER the freeze is also after the reboot and the cause information may be gone for good.
Ah ! We supposed what was displayed when it freeze will indicates the cause ... So what is the aim of what is displayed ?
How to track the problem before the freeze ?
Are external monitoring tools like Nagios is the only solution, or can we do the same locally (change debug level or audit more deeply different targets since the traditional logs not achieved it)
Quote:
Originally Posted by wpeckham
If I understand correctly:
1. if you move the drive ti a different identical machine that one does freeze.
That would eliminate the original machine hardware EXCEPT the drive.
2. IF cloned to a new drive, it will still freeze. That eliminates the drive itself.
If those are both true, we have eliminated all of the hardware and only a software issue can be left.
What has changed about the software or configuration in the few weeks just before this started?
Yes you have well understood. We supposed from a long time the problem is not about hardware. The only thing we are not really sure is about the RAM. Have we used the in place RAM or have we used the RAM of the computer we moved from ? We will testing it again to ensure we did not miss this step and to definitively eliminate the hardware cause.
The problem about freezing is here from years but until now, we let the problem as is, as it occurred about each 1 or 2 month (before we had a PABX which crashed really often so this discomfort turned out to be more acceptable).
But since a few months the problem became more frequent to reach one freeze by week.
---
We also tried to install netdata to monitor what happens in the system but there is a conflict to install it. For now we abandon this idea and we did not dig further to preserve the system to avoid a bigger problem.
Last edited by lenainjaune; 04-26-2024 at 10:23 AM.
Bolded a piece for emphasis only...you have been told this several times now, and that is the ONLY information that can help diagnose this issue.
We do not understand what you say to us ! You talk about bold ... do you refer to the fact we bolded some pieces of code to make it more readable or it is about something else ?
Quote:
Originally Posted by TB0ne
Not sure of the thought process in your diagnostic methods, but after the system freezes...it's ALREADY FROZEN. Quite obviously nothing will get logged after that point.
Ok ! As we said before, we believed that the display when there is a freeze is sufficient to determine the cause.
Quote:
Originally Posted by TB0ne
You don't tell us what version/distro of Linux you're using, but a 4.x kernel is pretty old.
It is indicated on the first post OP : Debian 9 (kernel : 4.9.0-19-amd64)
Yes it is old but when we installed the image, it was the embedded kernel.
Quote:
Originally Posted by TB0ne
You can either look in /var/log for a file (messages, syslog, etc...usual suspects) and inspect those, or you can look at "journalctl --list-boots", for a list of the log files and look at an older one with "journalctl -b <whatever number>", which will have the info in it.
Yes we ever looked the logs with journalctl --err as indicated in OP. As a result we disabled ACPI, UPS monitor and more recently the inventory's agent.
We do not understand why you suggest to look the boot logs ... as, if there is a freeze nothing will be logged in files.
Too, is journalctl centralize all logs, or it is necessary to explore one by one ?
Quote:
Originally Posted by TB0ne
Have you checked to see if disk space is running out? You mention asterisk as a PBX...a full disk can also cause problems, especially with long/undeleted voicemails.
Ah ! We supposed what was displayed when it freeze will indicates the cause ... So what is the aim of what is displayed ?
How to track the problem before the freeze ?
Once the freeze starts NOTHING new will be logged or displayed. So the thing on the monitor will be the VERY LAST THING sent to it before the freeze starts. There is a pretty good chance that message WILL pertain to the cause.
I am glad you will be replacing these systems. I have not seen any mention about what changed to start the problem, but if you have been living with this since the system was new you have far more patience than I.
Troubleshooting these things takes a clear and pretty complete understanding of what the system does (hardware and software) and clear and logical progression of eliminating potential causes until there is only one left. That is not rocket science, but does require training or a deeply analytical mind. Training in higher mathematics seems to help, but you CAN develop a good technique. It may take time.
We had just replaced the RAM by a certified working RAM (as the users did not return a problem) and we are testing the old RAM with MemTest86 (v5.01) to ensure of its reliability.
We strongly suspect a telephony software bug but if we want to work around this we must knowing what is the problem (resources, network, database, etc.)
In parallel we are trying again to install a monitor tool.
You said that :
Quote:
Originally Posted by wpeckham
IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.
Ok ! As we said before, we believed that the display when there is a freeze is sufficient to determine the cause.
And as I said before, *THERE ARE MULTIPLE LOGS*. I bolded a piece of text that I wrote...very obviously, when the system freezes it's going to stop logging. You were given two exact commands to not only show you the previous logs, but how to display them. Did you read/understand/use those commands????
Quote:
Originally Posted by lenainjaune
Yes we ever looked the logs with journalctl --err as indicated in OP. As a result we disabled ACPI, UPS monitor and more recently the inventory's agent. We do not understand why you suggest to look the boot logs ... as, if there is a freeze nothing will be logged in files. Too, is journalctl centralize all logs, or it is necessary to explore one by one ?
No...again, when the system freezes it is *NOT GOING TO LOG ANYTHING*. You need the messages just BEFORE the freeze...can't be more plain than that. And if you need specific instructions to check any/all log files you think may be related, you really should hire someone to do this. These are basic troubleshooting steps.
Quote:
Originally Posted by lenainjaune
We had just replaced the RAM by a certified working RAM (as the users did not return a problem) and we are testing the old RAM with MemTest86 (v5.01) to ensure of its reliability. We strongly suspect a telephony software bug but if we want to work around this we must knowing what is the problem (resources, network, database, etc.)
In parallel we are trying again to install a monitor tool. You said that :
Quote:
Originally Posted by wpeckham
IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.
What to change to better the logging return ?
Did you look in the manuals/documentation for the telephony software???
The /var partition appears to be a bit full...again, if it is FILLED totally with voicemails/etc., it could cause a problem. Those things may be lost/deleted after the crash, which would recover that space.
Quote:
Originally Posted by lenainjaune
Yes but no, as we will abandon this IPBX to replace by an Asterisk. Furthermore we want understand how to diagnose a such breakdown.
We've all been trying to tell you, but it appears you don't have much experience in such things. It would be far better to hire someone local to you and get them to walk through things with you, that are specific to your environment.
Set SystemMaxUse=100M in /etc/systemd/journald.conf
Ensure you have a directory /var/log/journal/
so the log survives a reboot.
Ensure you have some swap configured (1GB is enough).
a)
Assume that I am facing this error I should have performed following steps:
Code:
$ ulimit -c unlimited
$ command_to_start_server
If I have obtained any core dump file I used to share the core dump file to development team to resolve above error
b)
if this is faced only by one client alone out of crore clients, and assume that I am the developer of my server
I remove -O3 replace that with -O0 -g option during compilation
send automatic startup.sh to client
make the server not to accept any new clients
if no clients, stop the server and copy original server backup
copy my server to server location
execute startup.sh
One time client used to get core file
if core file created send an email to client and inform the client to forward that email to my online supervisor(Eg: hill json at UK
supervisor will fwd that to me to resolve that server issue.
c) if the server from external make that client/my automation to use ulimit -c unlimited and fwd that core file to external server development team.
d) if external server and no core file share the server's full log file to server development team when freezing happens.
Last edited by murugesandins; 04-29-2024 at 07:36 PM.
a)
Assume that I am facing this error I should have performed following steps:
Code:
$ ulimit -c unlimited
$ command_to_start_server
If I have obtained any core dump file I used to share the core dump file to development team to resolve above error
Do you know the difference between a core dump and a freeze???
Quote:
b)if this is faced only by one client alone out of crore clients, and assume that I am the developer of my server I remove -O3 replace that with -O0 -g option during compilation
send automatic startup.sh to client make the server not to accept any new clients if no clients, stop the server and copy original server backup copy my server to server location execute startup.sh One time client used to get core file if core file created send an email to client and inform the client to forward that email to my online supervisor(Eg: hill json at UK supervisor will fwd that to me to resolve that server issue.
c) if the server from external make that client/my automation to use ulimit -c unlimited and fwd that core file to external server development team.
d) if external server and no core file share the server's full log file to server development team when freezing happens.
So your 'steps' are:
Get a core dump file (which won't be created if a server freezes)
Send it to someone else and have them tell you what to to
And none of this applies to the OP's situation, they don't have a 'development team' or someone else to fix their problem for them. Can you post anything relevant to the OP's original question?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.