Linux system stalling every few minutes, yet no errors??
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Linux system stalling every few minutes, yet no errors??
I have a Gentoo Linux system that has a VERY annoying and hard to track issue with freezing up for 2-15 secs every minute or so. I have go through a few ideas about watching dmesg/messages and using the performance tools to look for problems.
No log errors of any kind, and I am not sure that this I/O cpu utilization is normal (it is contantly this high )
This system is a net-flow receiver, and our MRTG system so it has a healthy in and out of bursty net traffic. MRTG is run in cron every 5 mins and has about 60 devices it watches. There are about 10 net-flow sources hitting me too.
I would just like some help on learning where I can look next.
You can check to see how your syslog daemon is configured to report errors. You will find that information in the /etc/syslog.conf file. Look for a section like this:
Then you can look into each of these log files. You would be most interested in the log file for the debug level messages because it would report the least critical system errors. If these files don't have enough information then the /etc/syslog.conf file lists the destination files of other system log messages. Check them all. See if there are any errors or something that is being called many times in a small period of time.
Even though the CPU and I/O appear to be in normal range you could still check to see what is eating up these resources using the top utility. Run the top utility in a command line window or in a text console.
You can use Ethereal or tcpdump to see what network traffic is coming to your machine. Network traffic would be handled with a high priority. If you are getting a lot of bogus network connect requests then that would show up as degraded interactive response.
It is entirely possible that you have a broken hardware device. It could be a hard disk with a lot of bad blocks or a NIC that is broken but appears to be working or a bad network cable or a bad power supply or whatever.
I hope that I've provided some useful ideas.
Last edited by stress_junkie; 12-30-2006 at 10:22 AM.
I use syslog_ng (my favorite) and I do have debugs (or really all else than the stuff I want seperate) going to /var/log/debug.log_ng and it be clean.. When I watch top, it stalls as well and then when it free's up, I am never sure if I am missing what stalled it.
I will look at my network card... that's a thought. The 3550G-12 cisco switch shows no port errors.
But as you say, if I am getting a lot of network traffic, that would degrade me, and the spurts of net-flow and outgoing SNMP from MRTG may be in. I will also try blocking that traffic and shutting off MRTG to see if it frees me up)
Thanks!
Nick
Quote:
Originally Posted by stress_junkie
You can check to see how your syslog daemon is configured to report errors. You will find that information in the /etc/syslog.conf file. Look for a section like this:
Then you can look into each of these log files. You would be most interested in the log file for the debug level messages because it would report the least critical system errors. If these files don't have enough information then the /etc/syslog.conf file lists the destination files of other system log messages. Check them all. See if there are any errors or something that is being called many times in a small period of time.
Even though the CPU and I/O appear to be in normal range you could still check to see what is eating up these resources using the top utility. Run the top utility in a command line window or in a text console.
You can use Ethereal or tcpdump to see what network traffic is coming to your machine. Network traffic would be handled with a high priority. If you are getting a lot of bogus network connect requests then that would show up as degraded interactive response.
It is entirely possible that you have a broken hardware device. It could be a hard disk with a lot of bad blocks or a NIC that is broken but appears to be working or a bad network cable or a bad power supply or whatever.
One thing about Linux is that it has terrible memory management. How long has this machine been running? Can you reboot it to see if that imroves the performance? It is possible that your normal work load is simply more than the memory manager can handle. I know that on a workstation if I start Firefox and run some video streaming content and run backblocks from a console then the performance of the machine will eventually degrade. I think it is because the video streaming content puts too much of a strain on the memory manager and this is made visible by the bladblocks utility writing to the disks. In other words running a memory hog (video streaming) plus a real time utility (badblocks) brings out the weaknesses in the job controller and in the memory manager. Your workload may just be more than Linux can handle. After all, Linux is good but it isn't in the same league as Solaris.
Device level troublshooting is fairly simple but very tedious and time consuming. I think that I would start taking hardware devices out of the machine to see if removing one of them fixes the problem. You can start by just unplugging the network cable. See if that helps. If so then replace the network cable. If things are still looking good then throw the original network cable in the trash. And so you go on with all of the machine's hardware. If it is possible to swap one device for another then all the better. If not then just remove the device if possible. You can always boot a live CD when you disconnect the hard disks for example.
Even though the technique is simple it is not so simple to find the problem. If device swapping doesn't help then of course you have to move on to looking at software.
Last edited by stress_junkie; 12-30-2006 at 03:48 PM.
Funny you should mention that, I had just done a rebuild of the entire portage tree (600 apps) and rebooted, it's a weekly routine as this is a work server for just the network operations area. And I noticed the slowdown right away after the reboot as I went straight away to editing a config file for syslog_ng of all things. Trying to seperate out the bash logger to it's own log.
I am going to VPN into work and kill MRTG and Net-Flow for 10 mins, as I normally get stalls every other minute reliably, this ought to tell us something.
Will post in a few.
Nick
Quote:
Originally Posted by stress_junkie
One thing about Linux is that it has terrible memory management. How long has this machine been running? Can you reboot it to see if that imroves the performance? It is possible that your normal work load is simply more than the memory manager can handle. I know that on a workstation if I start Firefox and run some video streaming content and run backblocks from a console then the performance of the machine will eventually degrade. I think it is because the video streaming content puts too much of a strain on the memory manager and this is made visible by the bladblocks utility writing to the disks. In other words running a memory hog (video streaming) plus a real time utility (badblocks) brings out the weaknesses in the job controller and in the memory manager. Your workload may just be more than Linux can handle. After all, Linux is good but it isn't in the same league as Solaris.
Device level troublshooting is fairly simple but very tedious and time consuming. I think that I would start taking hardware devices out of the machine to see if removing one of them fixes the problem. You can start by just unplugging the network cable. See if that helps. If so then replace the network cable. If things are still looking good then throw the original network cable in the trash. And so you go on with all of the machine's hardware. If it is possible to swap one device for another then all the better. If not then just remove the device if possible. You can always boot a live CD when you disconnect the hard disks for example.
Even though the technique is simple it is not so simple to find the problem. If device swapping doesn't help then of course you have to move on to looking at software.
Found a possible problem. Your tip on memory had me try "watch free" while I editing files (that seemed to trip it fastest). My system has 1 Gig, but rarely has over 60 megs free.
I tried turning off a few things, mainly nessusd. went to 109 megs free and it was a long time before I ever saw any sign of a slight stall.
Memory is cheap, perhaps just adding a bit more will help.
The network card was clean from errors, BTW.
I am going to kill a few more unnecessary items, maybe just Apache first while leaving my collectors running.
Mmmmm - I'm be thinking swap contending with normal I/O. Especially with high wait times. Try running vmstat across a time period where you see a slow-down - it'll give you an idea of I/O load and swap load. See if they correlate.
Short answer, Yes.
Paging (swapping if you will) is just part of a well managed system. When it interferes with the real work, then it's a problem. In a normal environment, minimizing paging is a sensible goal.
Easiest way to do that is to provide more (real) memory - on any recent x86 hardware (with PAE), all the way up to 64 Gig unless I'm mistaken.
Whether it's actually necessary, and the cost/benefit is part of the fun ...
Before I bounced the box, I slowy wiped each app running to see which one would release the most memory. I was down to system essentials and still only 157Megs free.
How would I go about tracking down what is eating so much RAM? (I assume this must be a leak?)
Ok, I think I found it. I killed every app that I did not NEED and rebooted. 977 megs free. 90 mins later, I lost 2 megs... Whoo hoo, so it's one of my apps. I started one instance of flow-capture. 15 mins later lost over 100 megs. and it drops 100K every 10 secs.
I am going to reboot with all apps back on minus flow-capture. See what I get.
I have rebooted now with all save flow-capture and the memory drop is faster.. I disabled MRTG and Smokeping (monitoring tools) and the decrease sloowed to a crawl and even went the other way a few times. those three apps are my big network users. Maybe a leak in the nic driver? (Broadcom Corporation NetXtreme BCM5704 Gigabit)
< > Alteon AceNIC/3Com 3C985/NetGear GA620 Gigabit support
< > D-Link DL2000-based Gigabit Ethernet support
<M> Intel(R) PRO/1000 Gigabit Ethernet support
[ ] Use Rx Polling (NAPI)
[ ] Disable Packet Split for PCI express adapters
< > National Semiconductor DP83820 support
< > Packet Engines Hamachi GNIC-II support
< > Packet Engines Yellowfin Gigabit-NIC support (EXPERIMENTAL)
< > Realtek 8169 gigabit ethernet support
< > SiS190/SiS191 gigabit ethernet support
< > New SysKonnect GigaEthernet support
< > SysKonnect Yukon2 support (EXPERIMENTAL)
< > Marvell Yukon Chipset / SysKonnect SK-98xx Support (DEPRECATED)
< > VIA Velocity support
<*> Broadcom Tigon3 support
< > Broadcom NetXtremeII support
< > QLogic QLA3XXX Network Driver Support
I know I do not have a NetExtreme II, so I grabbed the Tigon.. Perhaps I have the wrong one?
It sounds like you're not quite sure which program has the memory leak. I'd recommend using 'top' to help find this out.
(1) Start top
(2) Hit 'G' then '3' to switch to memory view (that must be an uppercase 'G', not lowercase)
(3) Hit 'x' then 'b' to turn on various highlighting
(4) You will probably find that the "%MEM" column is highlighted (the highlighted column is your sort column). You want to sort on %MEM and look (over time) for the process that is consuming more and more memory.
(5) If %MEM is not your default sort column in step (4), use your '<' and '>' keys to move the highlight to the %MEM column.
How would I go about tracking down what is eating so much RAM? (I assume this must be a leak?)
That amount of swap usage over 3 days equates to 1 swap movement every 2 seconds.
IMHO you do not have a memory problem - your issue lies elsewhere. Linux attempt to maximize the memory used for efficiency - and after all what's the point of having it all just laying around idle ???.
May be a driver issue, might be something else - will take some legwork (like you are doing) to determine.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.