process is hanging on high cpu load even if its priority is set to the maximum
Hi Experts,
Two applications (app1 and app2) are running on a server with 64-bits 4 quad core CPU and RHEL OS Code:
# uname -a I set nice of these applications to -5 and rtprio to RT: Code:
top - 12:30:33 up 49 days, 11:15, 3 users, load average: 6.34, 3.33, 1.95 even with the high priority of app1 and app2, when I run a dd command that consumes 100% of its core, these two applications will hang, shouldn't this "dd" command free the CPU for these applications when it's running with them on the same core?! P.S: app1 and app2 are applications that interact with the sctp stack and time response is critical for them. |
I think there will always be some interruptions of app1 & app2, even if they are relatively smaller/less frequent with highest priority.
If you don't want any interruption, a variation of realtime linux http://www.linuxfordevices.com/c/a/L...ference-Guide/ could be more suitable than a (more general purpose) redhat box. Of course, if you can manage it, it could be useful to dedicate 2 of your cores to app1 & app2 and use the other 2 for most of the other apps, like dd. |
thx timmeke, there's a procedure to isolate cores from RHEL http://kbase.redhat.com/faq/docs/DOC-15596 and when I will start my applications I'll force them to start on the isolated processors.
AFAIK, irqbalance is responsible of distributing jobs on the CPU cores, cannot I configure this process to never run a job on my applications core? (in case I don't want to totally isolate them) |
If I were you, I'd focus on the posted procedure.
I can understand the reluctance to completely disabling irqbalance, as proposed in the procedure, and rather somehow "reconfigure" it. Do keep in mind that irqbalance does not balance jobs/processes, only the handling of hardware interrupts by the different cores - assigning your jobs to the cores (and keeping other jobs from running on the same core) is just part of the story. From a quick search, there doesn't seem to be an option to configure irqbalance - it only seems to be designed for a sophisticated "equal load" balance. That is, except by disabling it and setting the irq affinity to cpus yourself as outlined in the procedure. But that's a question better suited for the hardware forum... It's probably either irqbalanced or manually set (in /proc/...), not both. As I understood your goal, you may want to dedicate cores not just for running app1 and/or app2, but maybe even for handling the device interrupts that need to go into the sctp stack (which in turn is polled by your apps). If you limit these interrupts to just one core, and keep other interrupts away from the same core, should help you process them more timely. Question remains which cores to pick - this would be architecture related (as the 4 cores are not always completely working independently from each other). So, in practice, I'd recommend: 1- Just to be sure, make sure you have a bootable CD (Live CD,...) on stand-by in case it does get messed up. 2- Having a think about the which cores to dedicate and how, then try the posted procedure. 3- If you run into trouble with the irqs, post to the hardware forum here on LQ. |
Quote:
Quote:
http://www.redhat.com/docs/en-US/Red...s_Binding.html Quote:
Code:
Question remains which cores to pick - this would be architecture related (as the 4 cores are not always completely working independently from each other). I'll do my tests and get back to you :) Anyway, many thanks for your reply |
I've seen that article before - and I don't like it. cgroups (aka cpusets) is a better option IMHO.
Let's see the result of this - "grep -i ^processor /proc/cpuinfo" |
Quote:
processor : 1 processor : 2 processor : 3 processor : 4 processor : 5 processor : 6 processor : 7 processor : 8 processor : 9 processor : 10 processor : 11 processor : 12 processor : 13 processor : 14 processor : 15 |
Good - just making sure. Try this to see which tasks are in uninterruptible sleep
Code:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}' |
Quote:
Code:
top - 11:19:45 up 52 days, 10:04, 4 users, load average: 6.81, 3.96, 2.54 |
Noticed that always these applications are hanging after entering in this uninterruptible sleep status, will isolating their CPUs solve the issue?
P.S the IO activities (dd command...) are running on the same disk |
Maybe you can tell me a little more on what your apps will be doing on the hard disk? Do they impose a heavy IO load as well?
|
Quote:
Basically the traffic they are receiving is huge and they impose heavy IO load, but what I see from the output of "top" the total CPU load on their own processor do not go more than 40% and the iowait percentage is 0% as seen below: Code:
top - 19:01:16 up 52 days, 17:46, 3 users, load average: 1.53, 1.64, 1.79 |
One more info, if I run the below shell script that consumes a lot of CPU but without any IO activities, my applications keep working fine, so basically the problem is encountered only during the presence of high IO activities on the machine.
Code:
#!/bin/bash Code:
top - 10:24:03 up 53 days, 9:09, 2 users, load average: 2.37, 1.76, 1.45 |
As IO seems the bottleneck, try sorting that out first (as this will have the biggest impact).
A suggestion would be to use the (somewhat old-style) ramdisk to store the logfiles. You'll need to figure out a way to sync them to the hard drive occasionally. When IO has improved, the bottleneck may shift to CPU, in case the isolation may become necessary to improve further. Bottom line, don't write it off just yet... it doesn't look as promising now, but it may still come in handy later. |
I/O isn't necessarily the problem - uninterruptible sleep is generally thought to be caused by (disk) I/O, but not necessarily. It just an attribute of a process. And as stated the %wa is zero - that means no tasks are waiting to use the (any) CPU whilst I/O is outstanding. Could be hard to track anyway with that many online CPUs.
In this case I'd say poor code - presumably one of the applications under discussion or a device driver. kjournald and pdflush are kernel threads - I wouldn't expect them to be in "D" state under a heavy I/O load. The fact that there are so many pdflush processes might indicate the (disk) I/O is very bursty. pdflush is spawned as needed to write the data to disk (after a sync say). I would expect them to go away after a period of I/O inactivity. I would guess the SCTP driver is holding up the apps whilst decryprting (or whatever), and then dumping a heap of I/O, then doing it all again. For single threaded code with that many CPUs, I can't see trying to bind processes to CPUs is going to help at all. Just guessing of course. |
All times are GMT -5. The time now is 01:10 PM. |