LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 10-04-2023, 04:32 AM   #1
elgholm
LQ Newbie
 
Registered: Oct 2023
Posts: 24

Rep: Reputation: 0
Question Extremely high load average around 03:30 (AM) each night


As the title says, we're experiencing extremely high load averages at around 03:30 each night. Any idea on how to trace what's causing it?

We've checked our logs, cron-jobs, and even written scripts that outputs a lot of processes-information at around that time - but we can't find anything special that's happening. =(

The general feel is that something is spawning a lot of processes at around that time, and they then fade away quickly.

The CPU-usage doesn't really move at the same time, so this is probably some processes/threads that are in the D state, but we can't seem to see and find them!

The problem is that this really makes it hard for us to monitor our system with graphs - since we have spikes at 40-50 in load average at 03:30 each night, so all the other normal values (<1) during the day is barely noticeable in the graphs.

It's been this way since we installed the server fresh, so I guess this is something that the distribution does per default at that time. We haven't been hacked, or gotten a rootkit installed. This is since the fresh install, before we even connected the server to the Internet.

We're on a RHEL8-clone distribution, with all the general server tools and utilities at our disposals. Process accounting information and monitoring tools ar installed, and running, and I can get information from them about the extreme load average at that exact time - using "sa" etc. - but I don't know how to get what process that is causing it (if that's even possible, but I think it should be)...

Any help would be greatly appreciated. We've given up. =(
 
Old 10-05-2023, 12:52 PM   #2
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,707

Rep: Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720
It would make sense to schedule a recording of the process list each night starting before and continuing through the event so you can see what processes are running. If that shows nothing of interest, then it may be that something that is running ALL the time is simply "getting busy with it" at that time and you would have to follow up by logging top processes to detect which was suddenly loading the queue.
 
Old 10-05-2023, 03:23 PM   #3
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,710

Rep: Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972Reputation: 7972
Quote:
Originally Posted by elgholm View Post
As the title says, we're experiencing extremely high load averages at around 03:30 each night. Any idea on how to trace what's causing it?

We've checked our logs, cron-jobs, and even written scripts that outputs a lot of processes-information at around that time - but we can't find anything special that's happening. The general feel is that something is spawning a lot of processes at around that time, and they then fade away quickly.

The CPU-usage doesn't really move at the same time, so this is probably some processes/threads that are in the D state, but we can't seem to see and find them! The problem is that this really makes it hard for us to monitor our system with graphs - since we have spikes at 40-50 in load average at 03:30 each night, so all the other normal values (<1) during the day is barely noticeable in the graphs.

It's been this way since we installed the server fresh, so I guess this is something that the distribution does per default at that time. We haven't been hacked, or gotten a rootkit installed. This is since the fresh install, before we even connected the server to the Internet.

We're on a RHEL8-clone distribution, with all the general server tools and utilities at our disposals. Process accounting information and monitoring tools ar installed, and running, and I can get information from them about the extreme load average at that exact time - using "sa" etc. - but I don't know how to get what process that is causing it (if that's even possible, but I think it should be)...
I'd ask what you mean by "fade away quickly"...that's kind of an open term. That could mean 5-10 minutes, or 30 seconds.

If it's more than a few minutes, I'd start with a low-tech little script, running "top -b -n 100" (change iteration number larger/smaller as needed) and shoving it into a file. You'll have an output list of all the PID's, and can start from there. CRON that to start at 3:30 so you can capture what's going on.
 
Old 10-05-2023, 03:54 PM   #4
metaed
Member
 
Registered: Apr 2022
Location: US
Distribution: Slackware64 15.0
Posts: 371

Rep: Reputation: 172Reputation: 172
There are two cases to consider.

1. The "bad" process or processes are daemons. They run madly for a short time, but they do not terminate.

The tool of choice is one that shows current process CPU time before and after. For example, ps -ef and read the accumulated CPU time in the TIME column.

2. The "bad" process or processes are created, run madly for a short time, and terminate.

The tool of choice is one that shows terminated process CPU time. I think I understood you to say you can see the problem with sa, so probably that's the case you have. But sa only shows you a rollup, unless you run sa -u, in which case you get every terminated process but with very little information. You can get a much more detailed list of terminated processes from dump-acct if you have that. It will include the parent PID. That information may help you a lot. It will let you trace the "bad" process through its ancestor processes to whatever is kicking off the thing at 0330.
 
Old 10-06-2023, 01:29 AM   #5
elgholm
LQ Newbie
 
Registered: Oct 2023
Posts: 24

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by wpeckham View Post
It would make sense to schedule a recording of the process list each night starting before and continuing through the event so you can see what processes are running. If that shows nothing of interest, then it may be that something that is running ALL the time is simply "getting busy with it" at that time and you would have to follow up by logging top processes to detect which was suddenly loading the queue.
That's kind of the utility-script I've already written, and am running, but it doesn't really show anything out of the ordinary (except for the spike in load average). There's not a gazillion new processes started, atleast not at the instant my script fetches that information, and there's not a lot of processes in the D state. That's why I suspect there's some daemon process (on the OS level, enabled by default) that is "getting busy with it" - at exactly 03:30.

There's also not any swap happening, or other high i/o - as far as I can see. I have some Oracle db-processes performing i/o, but those look fairly normal - like they look during the day as well. I suspect that something "wakes up" at 03:30 - spawns a couple of hundred threads, and then within a couple of seconds dies.

This was our load average this night at 03:33:38: 43.61, 11.98, 4.18
This is our load average right now, when users are hammering the server with requests: 0.14, 0.29, 0.33
 
Old 10-06-2023, 01:37 AM   #6
elgholm
LQ Newbie
 
Registered: Oct 2023
Posts: 24

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by TB0ne View Post
I'd ask what you mean by "fade away quickly"...that's kind of an open term. That could mean 5-10 minutes, or 30 seconds.

If it's more than a few minutes, I'd start with a low-tech little script, running "top -b -n 100" (change iteration number larger/smaller as needed) and shoving it into a file. You'll have an output list of all the PID's, and can start from there. CRON that to start at 3:30 so you can capture what's going on.
Within 30 seconds, probably within 10 seconds - since I can't "snapshot" it using my scripts.
The problem is that the instantaneous load average is so extreme so the 5-minute value takes a couple of minutes to "fade away".
This morning the graph peaked at around 9, and then took 15 minutes to fade down to around 0.5.
Sometimes it peaks at 25... =(
Our load average during high load is normally 0.3 to 0.7.

Your idea of a script is the one I'm currently running.
I actually collect this data every 3 seconds at the time of these incidents:
# top -bcn1 -w512
# ps -ax -o pid,ppid,stat,pcpu,psr,pmem,user,comm,wchan:30,args --sort -pcpu
# /usr/sbin/iotop -b -o -t -n 3
# iostat -x 1 4
# vmstat -SM 1 4

Is I wrote above, I get nothing out of the ordinary from the data. No high swaps, no excessive i/o, no extra weird processes, not a gazillion D-states. Just extreme load average. =(
 
Old 10-06-2023, 01:42 AM   #7
elgholm
LQ Newbie
 
Registered: Oct 2023
Posts: 24

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by metaed View Post
There are two cases to consider.

1. The "bad" process or processes are daemons. They run madly for a short time, but they do not terminate.

The tool of choice is one that shows current process CPU time before and after. For example, ps -ef and read the accumulated CPU time in the TIME column.

2. The "bad" process or processes are created, run madly for a short time, and terminate.

The tool of choice is one that shows terminated process CPU time. I think I understood you to say you can see the problem with sa, so probably that's the case you have. But sa only shows you a rollup, unless you run sa -u, in which case you get every terminated process but with very little information. You can get a much more detailed list of terminated processes from dump-acct if you have that. It will include the parent PID. That information may help you a lot. It will let you trace the "bad" process through its ancestor processes to whatever is kicking off the thing at 0330.
I did not know about dump-acct. Thanks for that! I will look into how to use it! And yeah, sa seems like a nice utility but it only shows the summary of stuff - not the actual processes - as far as I understand.

Personally I think it's number 1 that's happening, because I should be able to see all the newly created processes using my scripts. It's probably some kind of default installed daemon that wakes up at exactly 03:30, spawns a couple of hundred threads to do something, which wait for something a short period of time - maybe they lock each other somehow - and then they exit (since there's probably nothing to do). The problem is that this gives me a load average of around 50-100 for a short burst, which pushes the 5-min average to 25-50, which in turn makes my monitoring graphs of the system quite unusable. =(
 
Old 10-06-2023, 01:54 AM   #8
lvm_
Member
 
Registered: Jul 2020
Posts: 961

Rep: Reputation: 340Reputation: 340Reputation: 340Reputation: 340
There are tools for logging system activity e.g. a somewhat clunky standard auditd or a much simpler snoopy.
 
Old 10-06-2023, 02:28 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,976

Rep: Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337Reputation: 7337
that can be a disk check tool which will spawn a lot of processes (for example for every partition) and they look around a bit and do nothing. Or a distro update check. Or something similar. I don't know your setup.
 
Old 10-06-2023, 11:38 AM   #10
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,816

Rep: Reputation: 1211Reputation: 1211Reputation: 1211Reputation: 1211Reputation: 1211Reputation: 1211Reputation: 1211Reputation: 1211Reputation: 1211
Check if a cron job runs at 3:30.
Look in /var/cron/ and /etc/cron.*/

Make a cron job that at 3:30 runs
pidstat >pidstat.out
 
Old 10-07-2023, 06:38 AM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,145

Rep: Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124
Why do you care ?. This is a rabbit hole you probably don't want to enter.
Here is a quote from someone that actually understands what loadavg really is.
Quote:
These three numbers are the 1, 5, and 15 minute load averages. Except they aren't really averages, and they aren't 1, 5, and 15 minutes. As can be seen in the source above, 1, 5, and 15 minutes are constants used in an equation, which calculate exponentially-damped moving sums of a five second average. The resulting 1, 5, and 15 minute load averages reflect load well beyond 1, 5, and 15 minutes.
If you only see these abnormal numbers at 03:30 (and not accumulating over say 30 minutes) something is broken - and that something is likely to be kernel code. Push the issue upstream to your distro maintainers.
 
Old 10-07-2023, 01:52 PM   #12
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,707

Rep: Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720Reputation: 2720
Actually those numbers do not reflect CPU load at all as far as I can tell. They appear to be related to queue loading as measured by stack size and/or response time snapshots. Many things can affect those numbers.

I agree that this does not appear any kind of problem and a detailed investigation may not be a good use of your time, but if the machine were one of mine I would be VERY curious!
Checking the I/O, memory, and CPU usage of the processes running during the event window and comparing against results outside the window might prove instructive. IT is certainly something I would try.
 
Old 10-08-2023, 02:20 AM   #13
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,145

Rep: Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124Reputation: 4124
Quote:
Originally Posted by wpeckham View Post
... but if the machine were one of mine I would be VERY curious!
Yeah, me too.

But if the attempts the OP has already tried don't show anything, I'd be looking at some kernel traces. Hard to convince me it's more important than my weekend slacking off tho' ....

loadavg is probably the most misunderstood (and pointless) metric in linux IMHO.
 
Old 10-09-2023, 02:56 AM   #14
elgholm
LQ Newbie
 
Registered: Oct 2023
Posts: 24

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by MadeInGermany View Post
pidstat >pidstat.out
Did not know about pidstat. Thanks!
 
Old 10-09-2023, 02:58 AM   #15
elgholm
LQ Newbie
 
Registered: Oct 2023
Posts: 24

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by syg00 View Post
Why do you care ?. This is a rabbit hole you probably don't want to enter.
Here is a quote from someone that actually understands what loadavg really is.If you only see these abnormal numbers at 03:30 (and not accumulating over say 30 minutes) something is broken - and that something is likely to be kernel code. Push the issue upstream to your distro maintainers.
Well, I disagree. This is a rabbit hole I do want to enter.
I like to be on top of things when it comes to our servers and infrastructure.

And also, I understand very well what load average is and how it's calculated thank you.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need Help with High Load Average & High Sleeping Task js9028 Linux - Server 3 07-20-2019 08:55 AM
[SOLVED] tesseract-4 (pdfsandwich) and high load average/CPU load kaz2100 Linux - Software 2 08-13-2018 09:02 PM
[SOLVED] Redshift transition from day-night extremely slow Lysander666 Slackware 8 08-05-2018 12:36 PM
Load average stay as high as around 1.00 lawrence_lee_lee Linux - Software 2 09-10-2008 01:22 AM
CPU high load every night invent Linux - Server 2 11-22-2007 10:36 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 05:03 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration