Extremely high load average around 03:30 (AM) each night

elgholm · 10-04-2023, 04:32 AM

As the title says, we're experiencing extremely high load averages at around 03:30 each night. Any idea on how to trace what's causing it?

We've checked our logs, cron-jobs, and even written scripts that outputs a lot of processes-information at around that time - but we can't find anything special that's happening. =(

The general feel is that something is spawning a lot of processes at around that time, and they then fade away quickly.

The CPU-usage doesn't really move at the same time, so this is probably some processes/threads that are in the D state, but we can't seem to see and find them!

The problem is that this really makes it hard for us to monitor our system with graphs - since we have spikes at 40-50 in load average at 03:30 each night, so all the other normal values (<1) during the day is barely noticeable in the graphs.

It's been this way since we installed the server fresh, so I guess this is something that the distribution does per default at that time. We haven't been hacked, or gotten a rootkit installed. This is since the fresh install, before we even connected the server to the Internet.

We're on a RHEL8-clone distribution, with all the general server tools and utilities at our disposals. Process accounting information and monitoring tools ar installed, and running, and I can get information from them about the extreme load average at that exact time - using "sa" etc. - but I don't know how to get what process that is causing it (if that's even possible, but I think it should be)...

Any help would be greatly appreciated. We've given up. =(

wpeckham · 10-05-2023, 12:52 PM

It would make sense to schedule a recording of the process list each night starting before and continuing through the event so you can see what processes are running. If that shows nothing of interest, then it may be that something that is running ALL the time is simply "getting busy with it" at that time and you would have to follow up by logging top processes to detect which was suddenly loading the queue.

TB0ne · 10-05-2023, 03:23 PM

Quote:

Originally Posted by elgholm

As the title says, we're experiencing extremely high load averages at around 03:30 each night. Any idea on how to trace what's causing it?

We've checked our logs, cron-jobs, and even written scripts that outputs a lot of processes-information at around that time - but we can't find anything special that's happening. The general feel is that something is spawning a lot of processes at around that time, and they then fade away quickly.

The CPU-usage doesn't really move at the same time, so this is probably some processes/threads that are in the D state, but we can't seem to see and find them! The problem is that this really makes it hard for us to monitor our system with graphs - since we have spikes at 40-50 in load average at 03:30 each night, so all the other normal values (<1) during the day is barely noticeable in the graphs.

It's been this way since we installed the server fresh, so I guess this is something that the distribution does per default at that time. We haven't been hacked, or gotten a rootkit installed. This is since the fresh install, before we even connected the server to the Internet.

We're on a RHEL8-clone distribution, with all the general server tools and utilities at our disposals. Process accounting information and monitoring tools ar installed, and running, and I can get information from them about the extreme load average at that exact time - using "sa" etc. - but I don't know how to get what process that is causing it (if that's even possible, but I think it should be)...

I'd ask what you mean by "fade away quickly"...that's kind of an open term. That could mean 5-10 minutes, or 30 seconds.

If it's more than a few minutes, I'd start with a low-tech little script, running "top -b -n 100" (change iteration number larger/smaller as needed) and shoving it into a file. You'll have an output list of all the PID's, and can start from there. CRON that to start at 3:30 so you can capture what's going on.

metaed · 10-05-2023, 03:54 PM

There are two cases to consider.

1. The "bad" process or processes are daemons. They run madly for a short time, but they do not terminate.

The tool of choice is one that shows current process CPU time before and after. For example, ps -ef and read the accumulated CPU time in the TIME column.

2. The "bad" process or processes are created, run madly for a short time, and terminate.

The tool of choice is one that shows terminated process CPU time. I think I understood you to say you can see the problem with sa, so probably that's the case you have. But sa only shows you a rollup, unless you run sa -u, in which case you get every terminated process but with very little information. You can get a much more detailed list of terminated processes from dump-acct if you have that. It will include the parent PID. That information may help you a lot. It will let you trace the "bad" process through its ancestor processes to whatever is kicking off the thing at 0330.

elgholm · 10-06-2023, 01:29 AM

Quote:

Originally Posted by wpeckham

It would make sense to schedule a recording of the process list each night starting before and continuing through the event so you can see what processes are running. If that shows nothing of interest, then it may be that something that is running ALL the time is simply "getting busy with it" at that time and you would have to follow up by logging top processes to detect which was suddenly loading the queue.

That's kind of the utility-script I've already written, and am running, but it doesn't really show anything out of the ordinary (except for the spike in load average). There's not a gazillion new processes started, atleast not at the instant my script fetches that information, and there's not a lot of processes in the D state. That's why I suspect there's some daemon process (on the OS level, enabled by default) that is "getting busy with it" - at exactly 03:30.

There's also not any swap happening, or other high i/o - as far as I can see. I have some Oracle db-processes performing i/o, but those look fairly normal - like they look during the day as well. I suspect that something "wakes up" at 03:30 - spawns a couple of hundred threads, and then within a couple of seconds dies.

This was our load average this night at 03:33:38: 43.61, 11.98, 4.18
This is our load average right now, when users are hammering the server with requests: 0.14, 0.29, 0.33

elgholm · 10-06-2023, 01:37 AM

Quote:

Originally Posted by TB0ne

I'd ask what you mean by "fade away quickly"...that's kind of an open term. That could mean 5-10 minutes, or 30 seconds.

If it's more than a few minutes, I'd start with a low-tech little script, running "top -b -n 100" (change iteration number larger/smaller as needed) and shoving it into a file. You'll have an output list of all the PID's, and can start from there. CRON that to start at 3:30 so you can capture what's going on.

Within 30 seconds, probably within 10 seconds - since I can't "snapshot" it using my scripts.
The problem is that the instantaneous load average is so extreme so the 5-minute value takes a couple of minutes to "fade away".
This morning the graph peaked at around 9, and then took 15 minutes to fade down to around 0.5.
Sometimes it peaks at 25... =(
Our load average during high load is normally 0.3 to 0.7.

Your idea of a script is the one I'm currently running.
I actually collect this data every 3 seconds at the time of these incidents:
# top -bcn1 -w512
# ps -ax -o pid,ppid,stat,pcpu,psr,pmem,user,comm,wchan:30,args --sort -pcpu
# /usr/sbin/iotop -b -o -t -n 3
# iostat -x 1 4
# vmstat -SM 1 4

Is I wrote above, I get nothing out of the ordinary from the data. No high swaps, no excessive i/o, no extra weird processes, not a gazillion D-states. Just extreme load average. =(

elgholm · 10-06-2023, 01:42 AM

Quote:

Originally Posted by metaed

There are two cases to consider.

1. The "bad" process or processes are daemons. They run madly for a short time, but they do not terminate.

The tool of choice is one that shows current process CPU time before and after. For example, ps -ef and read the accumulated CPU time in the TIME column.

2. The "bad" process or processes are created, run madly for a short time, and terminate.

The tool of choice is one that shows terminated process CPU time. I think I understood you to say you can see the problem with sa, so probably that's the case you have. But sa only shows you a rollup, unless you run sa -u, in which case you get every terminated process but with very little information. You can get a much more detailed list of terminated processes from dump-acct if you have that. It will include the parent PID. That information may help you a lot. It will let you trace the "bad" process through its ancestor processes to whatever is kicking off the thing at 0330.

I did not know about dump-acct. Thanks for that! I will look into how to use it! And yeah, sa seems like a nice utility but it only shows the summary of stuff - not the actual processes - as far as I understand.

Personally I think it's number 1 that's happening, because I should be able to see all the newly created processes using my scripts. It's probably some kind of default installed daemon that wakes up at exactly 03:30, spawns a couple of hundred threads to do something, which wait for something a short period of time - maybe they lock each other somehow - and then they exit (since there's probably nothing to do). The problem is that this gives me a load average of around 50-100 for a short burst, which pushes the 5-min average to 25-50, which in turn makes my monitoring graphs of the system quite unusable. =(

lvm_ · 10-06-2023, 01:54 AM

There are tools for logging system activity e.g. a somewhat clunky standard auditd or a much simpler snoopy.

pan64 · 10-06-2023, 02:28 AM

that can be a disk check tool which will spawn a lot of processes (for example for every partition) and they look around a bit and do nothing. Or a distro update check. Or something similar. I don't know your setup.

MadeInGermany · 10-06-2023, 11:38 AM

Check if a cron job runs at 3:30.
Look in /var/cron/ and /etc/cron.*/

Make a cron job that at 3:30 runs
pidstat >pidstat.out

syg00 · 10-07-2023, 06:38 AM

Why do you care ?. This is a rabbit hole you probably don't want to enter.
Here is a quote from someone that actually understands what loadavg really is.

Quote:

These three numbers are the 1, 5, and 15 minute load averages. Except they aren't really averages, and they aren't 1, 5, and 15 minutes. As can be seen in the source above, 1, 5, and 15 minutes are constants used in an equation, which calculate exponentially-damped moving sums of a five second average. The resulting 1, 5, and 15 minute load averages reflect load well beyond 1, 5, and 15 minutes.

If you only see these abnormal numbers at 03:30 (and not accumulating over say 30 minutes) something is broken - and that something is likely to be kernel code. Push the issue upstream to your distro maintainers.

wpeckham · 10-07-2023, 01:52 PM

Actually those numbers do not reflect CPU load at all as far as I can tell. They appear to be related to queue loading as measured by stack size and/or response time snapshots. Many things can affect those numbers.

I agree that this does not appear any kind of problem and a detailed investigation may not be a good use of your time, but if the machine were one of mine I would be VERY curious!
Checking the I/O, memory, and CPU usage of the processes running during the event window and comparing against results outside the window might prove instructive. IT is certainly something I would try.

syg00 · 10-08-2023, 02:20 AM

Quote:

Originally Posted by wpeckham

... but if the machine were one of mine I would be VERY curious!

Yeah, me too.

But if the attempts the OP has already tried don't show anything, I'd be looking at some kernel traces. Hard to convince me it's more important than my weekend slacking off tho' ....

loadavg is probably the most misunderstood (and pointless) metric in linux IMHO.

elgholm · 10-09-2023, 02:56 AM

Quote:

Originally Posted by MadeInGermany

pidstat >pidstat.out

Did not know about pidstat. Thanks!

elgholm · 10-09-2023, 02:58 AM

Quote:

Originally Posted by syg00

Why do you care ?. This is a rabbit hole you probably don't want to enter.
Here is a quote from someone that actually understands what loadavg really is.If you only see these abnormal numbers at 03:30 (and not accumulating over say 30 minutes) something is broken - and that something is likely to be kernel code. Push the issue upstream to your distro maintainers.

Well, I disagree. This is a rabbit hole I do want to enter.
I like to be on top of things when it comes to our servers and infrastructure.

And also, I understand very well what load average is and how it's calculated thank you.