Linux system stalling every few minutes, yet no errors??

nellson · 12-30-2006, 09:46 AM

I have a Gentoo Linux system that has a VERY annoying and hard to track issue with freezing up for 2-15 secs every minute or so. I have go through a few ideas about watching dmesg/messages and using the performance tools to look for problems.

No log errors of any kind, and I am not sure that this I/O cpu utilization is normal (it is contantly this high )

This system is a net-flow receiver, and our MRTG system so it has a healthy in and out of bursty net traffic. MRTG is run in cron every 5 mins and has about 60 devices it watches. There are about 10 net-flow sources hitting me too.

I would just like some help on learning where I can look next.

Nick

Every 2.0s: iostat Sat Dec 30 07:36:34 2006

Linux 2.6.19-gentoo-r2 (poindexter) 12/30/06

avg-cpu: %user %nice %system %iowait %steal %idle
11.41 0.00 4.64 37.75 0.00 46.20

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 89.21 98.59 1398.02 866602 12288704
sdb 2.11 33.94 14.54 298318 127808

Every 2.0s: mpstat Sat Dec 30 07:36:53 2006

Linux 2.6.19-gentoo-r2 (poindexter) 12/30/06

07:36:53 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
07:36:53 all 11.40 0.00 4.28 37.85 0.06 0.30 0.00 46.10 409.99

stress_junkie · 12-30-2006, 10:21 AM

You can check to see how your syslog daemon is configured to report errors. You will find that information in the /etc/syslog.conf file. Look for a section like this:

Code:

# Kernel logging
kern.=debug;kern.=info;kern.=notice     /var/log/kernel/info
kern.=warn                              /var/log/kernel/warnings
kern.err                                /var/log/kernel/errors

Then you can look into each of these log files. You would be most interested in the log file for the debug level messages because it would report the least critical system errors. If these files don't have enough information then the /etc/syslog.conf file lists the destination files of other system log messages. Check them all. See if there are any errors or something that is being called many times in a small period of time.

Even though the CPU and I/O appear to be in normal range you could still check to see what is eating up these resources using the top utility. Run the top utility in a command line window or in a text console.

You can use Ethereal or tcpdump to see what network traffic is coming to your machine. Network traffic would be handled with a high priority. If you are getting a lot of bogus network connect requests then that would show up as degraded interactive response.

It is entirely possible that you have a broken hardware device. It could be a hard disk with a lot of bad blocks or a NIC that is broken but appears to be working or a bad network cable or a bad power supply or whatever.

I hope that I've provided some useful ideas.

nellson · 12-30-2006, 01:36 PM

Hey Stress_Junkie,

I use syslog_ng (my favorite) and I do have debugs (or really all else than the stuff I want seperate) going to /var/log/debug.log_ng and it be clean.. When I watch top, it stalls as well and then when it free's up, I am never sure if I am missing what stalled it.

I will look at my network card... that's a thought. The 3550G-12 cisco switch shows no port errors.

But as you say, if I am getting a lot of network traffic, that would degrade me, and the spurts of net-flow and outgoing SNMP from MRTG may be in. I will also try blocking that traffic and shutting off MRTG to see if it frees me up)

Thanks!
Nick

Quote:

Originally Posted by stress_junkie

You can check to see how your syslog daemon is configured to report errors. You will find that information in the /etc/syslog.conf file. Look for a section like this:

Code:

# Kernel logging
kern.=debug;kern.=info;kern.=notice     /var/log/kernel/info
kern.=warn                              /var/log/kernel/warnings
kern.err                                /var/log/kernel/errors

Then you can look into each of these log files. You would be most interested in the log file for the debug level messages because it would report the least critical system errors. If these files don't have enough information then the /etc/syslog.conf file lists the destination files of other system log messages. Check them all. See if there are any errors or something that is being called many times in a small period of time.

Even though the CPU and I/O appear to be in normal range you could still check to see what is eating up these resources using the top utility. Run the top utility in a command line window or in a text console.

You can use Ethereal or tcpdump to see what network traffic is coming to your machine. Network traffic would be handled with a high priority. If you are getting a lot of bogus network connect requests then that would show up as degraded interactive response.

It is entirely possible that you have a broken hardware device. It could be a hard disk with a lot of bad blocks or a NIC that is broken but appears to be working or a bad network cable or a bad power supply or whatever.

I hope that I've provided some useful ideas.

stress_junkie · 12-30-2006, 03:43 PM

One thing about Linux is that it has terrible memory management. How long has this machine been running? Can you reboot it to see if that imroves the performance? It is possible that your normal work load is simply more than the memory manager can handle. I know that on a workstation if I start Firefox and run some video streaming content and run backblocks from a console then the performance of the machine will eventually degrade. I think it is because the video streaming content puts too much of a strain on the memory manager and this is made visible by the bladblocks utility writing to the disks. In other words running a memory hog (video streaming) plus a real time utility (badblocks) brings out the weaknesses in the job controller and in the memory manager. Your workload may just be more than Linux can handle. After all, Linux is good but it isn't in the same league as Solaris.

Device level troublshooting is fairly simple but very tedious and time consuming. I think that I would start taking hardware devices out of the machine to see if removing one of them fixes the problem. You can start by just unplugging the network cable. See if that helps. If so then replace the network cable. If things are still looking good then throw the original network cable in the trash. And so you go on with all of the machine's hardware. If it is possible to swap one device for another then all the better. If not then just remove the device if possible. You can always boot a live CD when you disconnect the hard disks for example.

Even though the technique is simple it is not so simple to find the problem. If device swapping doesn't help then of course you have to move on to looking at software.

nellson · 12-30-2006, 07:08 PM

Funny you should mention that, I had just done a rebuild of the entire portage tree (600 apps) and rebooted, it's a weekly routine as this is a work server for just the network operations area. And I noticed the slowdown right away after the reboot as I went straight away to editing a config file for syslog_ng of all things. Trying to seperate out the bash logger to it's own log.

I am going to VPN into work and kill MRTG and Net-Flow for 10 mins, as I normally get stalls every other minute reliably, this ought to tell us something.

Will post in a few.

Nick

Quote:

Originally Posted by stress_junkie

One thing about Linux is that it has terrible memory management. How long has this machine been running? Can you reboot it to see if that imroves the performance? It is possible that your normal work load is simply more than the memory manager can handle. I know that on a workstation if I start Firefox and run some video streaming content and run backblocks from a console then the performance of the machine will eventually degrade. I think it is because the video streaming content puts too much of a strain on the memory manager and this is made visible by the bladblocks utility writing to the disks. In other words running a memory hog (video streaming) plus a real time utility (badblocks) brings out the weaknesses in the job controller and in the memory manager. Your workload may just be more than Linux can handle. After all, Linux is good but it isn't in the same league as Solaris.

Device level troublshooting is fairly simple but very tedious and time consuming. I think that I would start taking hardware devices out of the machine to see if removing one of them fixes the problem. You can start by just unplugging the network cable. See if that helps. If so then replace the network cable. If things are still looking good then throw the original network cable in the trash. And so you go on with all of the machine's hardware. If it is possible to swap one device for another then all the better. If not then just remove the device if possible. You can always boot a live CD when you disconnect the hard disks for example.

Even though the technique is simple it is not so simple to find the problem. If device swapping doesn't help then of course you have to move on to looking at software.

nellson · 12-30-2006, 07:39 PM

Found a possible problem. Your tip on memory had me try "watch free" while I editing files (that seemed to trip it fastest). My system has 1 Gig, but rarely has over 60 megs free.

I tried turning off a few things, mainly nessusd. went to 109 megs free and it was a long time before I ever saw any sign of a slight stall.

Memory is cheap, perhaps just adding a bit more will help.

The network card was clean from errors, BTW.

I am going to kill a few more unnecessary items, maybe just Apache first while leaving my collectors running.

Nick

syg00 · 12-30-2006, 08:15 PM

Mmmmm - I'm be thinking swap contending with normal I/O. Especially with high wait times. Try running vmstat across a time period where you see a slow-down - it'll give you an idea of I/O load and swap load. See if they correlate.

nellson · 01-01-2007, 08:05 PM

OK, I will look at that when I get into work in the morning. Thanks!

If I do see a correlation, any suggestions? Would more RAM lessen the need for swap? (I am assuming so)

Nick

syg00 · 01-02-2007, 01:57 AM

Short answer, Yes.
Paging (swapping if you will) is just part of a well managed system. When it interferes with the real work, then it's a problem. In a normal environment, minimizing paging is a sensible goal.
Easiest way to do that is to provide more (real) memory - on any recent x86 hardware (with PAE), all the way up to 64 Gig unless I'm mistaken.
Whether it's actually necessary, and the cost/benefit is part of the fun ...

nellson · 01-02-2007, 09:18 AM

Ok, took a real good look at memory this morning.

When I came in:

top - 06:42:23 up 3 days, 1:32, 3 users, load average: 3.07, 2.58, 2.69
Tasks: 85 total, 2 running, 83 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.0%us, 4.0%sy, 0.0%ni, 41.8%id, 45.5%wa, 0.2%hi, 0.5%si, 0.0%st
Mem: 1035208k total, 925428k used, 109780k free, 115548k buffers
Swap: 2008116k total, 144k used, 2007972k free, 629988k cached

After reboot and all service showing up for 10 mins (two MRTG sweeps, several complete net-flows recorded)

top - 07:14:44 up 15 min, 2 users, load average: 0.03, 0.09, 0.15
Tasks: 76 total, 1 running, 75 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 98.8%id, 1.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1035208k total, 369316k used, 665892k free, 39976k buffers
Swap: 2008116k total, 0k used, 2008116k free, 233584k cached

Before I bounced the box, I slowy wiped each app running to see which one would release the most memory. I was down to system essentials and still only 157Megs free.

How would I go about tracking down what is eating so much RAM? (I assume this must be a leak?)

Nick

nellson · 01-02-2007, 10:33 AM

Ok, it's been running for almost 90 mins and I lost another 447M disappeared... I have a cullendar for a memory manager??

Forgive the longer complete TOP posting, but it shows a few of the processes I run.

top - 08:27:42 up 1:27, 2 users, load average: 0.07, 0.05, 0.25
Tasks: 76 total, 1 running, 75 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.0%sy, 0.0%ni, 98.5%id, 1.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1035208k total, 816904k used, 218304k free, 185380k buffers
Swap: 2008116k total, 0k used, 2008116k free, 497544k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8610 root 15 0 6244 4712 480 S 1 0.5 0:23.27 flow-capture
5974 root 15 0 2160 1092 820 R 0 0.1 0:00.34 top
1 root 15 0 1532 520 452 S 0 0.1 0:00.96 init
2 root RT 0 0 0 0 S 0 0.0 0:00.29 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.28 migration/1
5 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1
6 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/0
7 root 10 -5 0 0 0 S 0 0.0 0:00.00 events/1
8 root 10 -5 0 0 0 S 0 0.0 0:00.00 khelper
9 root 12 -5 0 0 0 S 0 0.0 0:00.00 kthread
60 root 10 -5 0 0 0 S 0 0.0 0:00.22 kblockd/0
61 root 10 -5 0 0 0 S 0 0.0 0:00.23 kblockd/1
62 root 16 -5 0 0 0 S 0 0.0 0:00.00 kacpid
139 root 10 -5 0 0 0 S 0 0.0 0:00.00 kseriod
140 root 16 -5 0 0 0 S 0 0.0 0:00.00 ata/0
141 root 17 -5 0 0 0 S 0 0.0 0:00.00 ata/1
142 root 17 -5 0 0 0 S 0 0.0 0:00.00 ata_aux
143 root 17 -5 0 0 0 S 0 0.0 0:00.00 ksuspend_usbd
146 root 10 -5 0 0 0 S 0 0.0 0:00.00 khubd
159 root 16 -5 0 0 0 S 0 0.0 0:00.00 khpsbpkt
182 root 21 0 0 0 0 S 0 0.0 0:00.00 pdflush
183 root 15 0 0 0 0 S 0 0.0 0:01.29 pdflush
184 root 17 -5 0 0 0 S 0 0.0 0:00.00 kswapd0
185 root 17 -5 0 0 0 S 0 0.0 0:00.00 aio/0
186 root 18 -5 0 0 0 S 0 0.0 0:00.00 aio/1
793 root 15 -5 0 0 0 S 0 0.0 0:00.00 kpsmoused
832 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_0
833 root 10 -5 0 0 0 S 0 0.0 0:00.00 scsi_eh_1
880 root 15 -5 0 0 0 S 0 0.0 0:00.00 reiserfs/0
881 root 10 -5 0 0 0 S 0 0.0 0:00.05 reiserfs/1
1061 root 15 -4 1852 600 344 S 0 0.1 0:00.87 udevd
6567 root 15 0 2024 744 444 S 0 0.1 0:00.04 syslog-ng
6641 named 18 0 14336 11m 1896 S 0 1.2 0:01.69 named
6778 mysql 19 0 138m 26m 3788 S 0 2.6 0:00.17 mysqld
6879 root 15 0 3880 1012 712 S 0 0.1 0:00.00 sshd
6957 root 18 0 17992 6008 3076 S 0 0.6 0:00.16 apache2
6959 apache 20 0 17020 2600 836 S 0 0.3 0:00.00 apache2
7189 apache 18 0 17992 3788 844 S 0 0.4 0:00.00 apache2
7190 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2
7191 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2
7192 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2
7193 apache 19 0 17992 3788 844 S 0 0.4 0:00.00 apache2
8008 messageb 15 0 2092 744 604 S 0 0.1 0:00.00 dbus-daemon
8142 root 15 0 1712 600 500 S 0 0.1 0:00.00 crond
8209 haldaemo 18 0 8824 7304 1608 S 0 0.7 0:00.37 hald
8210 root 18 0 2816 1016 864 S 0 0.1 0:00.00 hald-runner
8216 haldaemo 15 0 1928 788 680 S 0 0.1 0:00.00 hald-addon-acpi
8232 root 18 0 1736 596 524 S 0 0.1 0:00.01 hald-addon-stor
8363 root 19 0 5296 2652 1156 S 0 0.3 0:00.00 nessusd
8475 root 18 0 6252 1756 1336 S 0 0.2 0:00.00 master
8517 postfix 18 0 6288 1748 1340 S 0 0.2 0:00.00 pickup
8518 postfix 15 0 6336 1800 1384 S 0 0.2 0:00.00 qmgr
8547 root 18 0 2188 832 668 S 0 0.1 0:00.01 xinetd
8612 root 15 0 3572 1976 480 S 0 0.2 0:06.21 flow-capture
8614 root 18 0 1668 524 392 S 0 0.1 0:00.00 cdp-send
8618 root 24 0 1644 328 256 S 0 0.0 0:00.00 pamsmbd
8619 root 17 0 3636 1040 576 S 0 0.1 0:00.00 mount.smbfs
8626 root 10 -5 0 0 0 S 0 0.0 0:00.10 smbiod
8639 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty
8640 root 18 0 1564 608 528 S 0 0.1 0:00.00 agetty
8641 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty
8642 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty
8643 root 18 0 1564 608 528 S 0 0.1 0:00.00 agetty
8644 root 18 0 1568 612 528 S 0 0.1 0:00.00 agetty
8656 root 18 0 12948 9944 1356 S 0 1.0 0:00.85 smokeping
8657 root 17 0 6696 2152 1732 S 0 0.2 0:00.01 sshd
8662 e19425 15 0 6836 1452 1000 S 0 0.1 0:04.85 sshd
8663 e19425 16 0 2988 1560 1240 S 0 0.2 0:00.00 bash
8672 root 18 0 2260 1008 776 S 0 0.1 0:00.00 su
8673 root 15 0 2608 1576 1260 S 0 0.2 0:00.03 bash
8694 root 17 0 6700 2144 1732 S 0 0.2 0:00.01 sshd
8699 monitor 15 0 6700 1440 1004 S 0 0.1 0:00.00 sshd
8700 monitor 18 0 2864 1284 1068 S 0 0.1 0:00.00 bash
8708 monitor 15 0 1536 428 356 S 0 0.0 0:00.08 tail
8709 monitor 18 0 3368 1600 1292 S 0 0.2 0:00.03 tacacs-watch

nellson · 01-02-2007, 11:56 AM

Ok, I think I found it. I killed every app that I did not NEED and rebooted. 977 megs free. 90 mins later, I lost 2 megs... Whoo hoo, so it's one of my apps. I started one instance of flow-capture. 15 mins later lost over 100 megs. and it drops 100K every 10 secs.

I am going to reboot with all apps back on minus flow-capture. See what I get.

nellson · 01-02-2007, 12:26 PM

I have rebooted now with all save flow-capture and the memory drop is faster.. I disabled MRTG and Smokeping (monitoring tools) and the decrease sloowed to a crawl and even went the other way a few times. those three apps are my big network users. Maybe a leak in the nic driver? (Broadcom Corporation NetXtreme BCM5704 Gigabit)

< > Alteon AceNIC/3Com 3C985/NetGear GA620 Gigabit support
< > D-Link DL2000-based Gigabit Ethernet support
<M> Intel(R) PRO/1000 Gigabit Ethernet support
[ ] Use Rx Polling (NAPI)
[ ] Disable Packet Split for PCI express adapters
< > National Semiconductor DP83820 support
< > Packet Engines Hamachi GNIC-II support
< > Packet Engines Yellowfin Gigabit-NIC support (EXPERIMENTAL)
< > Realtek 8169 gigabit ethernet support
< > SiS190/SiS191 gigabit ethernet support
< > New SysKonnect GigaEthernet support
< > SysKonnect Yukon2 support (EXPERIMENTAL)
< > Marvell Yukon Chipset / SysKonnect SK-98xx Support (DEPRECATED)
< > VIA Velocity support
<*> Broadcom Tigon3 support
< > Broadcom NetXtremeII support
< > QLogic QLA3XXX Network Driver Support

I know I do not have a NetExtreme II, so I grabbed the Tigon.. Perhaps I have the wrong one?

lspci

00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 09)
00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 09)
00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 09)
00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 09)
00:1c.0 PCI bridge: Intel Corporation 6300ESB 64-bit PCI-X Bridge (rev 02)
00:1d.0 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02)
00:1d.1 USB Controller: Intel Corporation 6300ESB USB Universal Host Controller (rev 02)
00:1d.4 System peripheral: Intel Corporation 6300ESB Watchdog Timer (rev 02)
00:1d.5 PIC: Intel Corporation 6300ESB I/O Advanced Programmable Interrupt Controller (rev 02)
00:1d.7 USB Controller: Intel Corporation 6300ESB USB2 Enhanced Host Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 0a)
00:1f.0 ISA bridge: Intel Corporation 6300ESB LPC Interface Controller (rev 02)
00:1f.1 IDE interface: Intel Corporation 6300ESB PATA Storage Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 6300ESB SATA Storage Controller (rev 02)
01:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
01:04.0 System peripheral: Compaq Computer Corporation Integrated Lights Out Controller (rev 01)
01:04.2 System peripheral: Compaq Computer Corporation Integrated Lights Out Processor (rev 01)
02:02.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
02:02.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
06:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09)
06:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09)

haertig · 01-02-2007, 01:48 PM

It sounds like you're not quite sure which program has the memory leak. I'd recommend using 'top' to help find this out.

(1) Start top
(2) Hit 'G' then '3' to switch to memory view (that must be an uppercase 'G', not lowercase)
(3) Hit 'x' then 'b' to turn on various highlighting
(4) You will probably find that the "%MEM" column is highlighted (the highlighted column is your sort column). You want to sort on %MEM and look (over time) for the process that is consuming more and more memory.
(5) If %MEM is not your default sort column in step (4), use your '<' and '>' keys to move the highlight to the %MEM column.

[edit]Fixed spelling error[/edit]

syg00 · 01-02-2007, 04:19 PM

Quote:

Originally Posted by nellson

top - 06:42:23 up 3 days, 1:32, 3 users, load average: 3.07, 2.58, 2.69
Tasks: 85 total, 2 running, 83 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.0%us, 4.0%sy, 0.0%ni, 41.8%id, 45.5%wa, 0.2%hi, 0.5%si, 0.0%st
Mem: 1035208k total, 925428k used, 109780k free, 115548k buffers
Swap: 2008116k total, 144k used, 2007972k free, 629988k cached

...

How would I go about tracking down what is eating so much RAM? (I assume this must be a leak?)

That amount of swap usage over 3 days equates to 1 swap movement every 2 seconds.
IMHO you do not have a memory problem - your issue lies elsewhere. Linux attempt to maximize the memory used for efficiency - and after all what's the point of having it all just laying around idle ???.

May be a driver issue, might be something else - will take some legwork (like you are doing) to determine.