[SOLVED] Where do initial task/kernel/cpu scheduler values after cold boot come from?

ams_tschoening · 04-19-2018, 09:09 AM

To make a long story short, I have two physical servers hosting two almost identical VMs, but one of these scales very badly in some workloads. And only some, not always in all workloads and even the problematic workloads work e.g. always after restarts. It starts to not work anymore after some time only. I'm unable to reproduce this problem in the other VM and am now comparing differences of the output of "sysctl -a". One of those differences addresses the task/kernel/cpu scheduler.

So I'm wandering where those different values come from initially?

E.g. if they are calculated, which facts they depend on, maybe the VM-host, if they change on runtime automatically etc. Estimations about if these concrete differences in my values could have any reasonable impact on overall scaling of a system most likely are welcome as well.

Good vs. Bad VM:

Code:

--- C:/Users/tschoening/Desktop/Good VM.txt Mi 18. Apr 19:24:47 2018
+++ C:/Users/tschoening/Desktop/Bad VM.txt  Mi 18. Apr 19:24:44 2018
@@ -8,3 +8,3 @@ kernel.sched_domain.cpu0.domain0.imbalance_pct = 1
-kernel.sched_domain.cpu0.domain0.max_interval = 4
-kernel.sched_domain.cpu0.domain0.max_newidle_lb_cost = 75519
-kernel.sched_domain.cpu0.domain0.min_interval = 2
+kernel.sched_domain.cpu0.domain0.max_interval = 16
+kernel.sched_domain.cpu0.domain0.max_newidle_lb_cost = 155384
+kernel.sched_domain.cpu0.domain0.min_interval = 8
@@ -15 +15 @@ kernel.sched_domain.cpu0.domain0.wake_idx = 0
-kernel.sched_latency_ns = 12000000
+kernel.sched_latency_ns = 24000000
@@ -17 +17 @@ kernel.sched_migration_cost_ns = 500000
-kernel.sched_min_granularity_ns = 1500000
+kernel.sched_min_granularity_ns = 3000000
@@ -25 +25 @@ kernel.sched_tunable_scaling = 1
-kernel.sched_wakeup_granularity_ns = 2000000
+kernel.sched_wakeup_granularity_ns = 4000000

Good VM:

Code:

kernel.sched_domain.cpu0.domain0.busy_factor = 32
kernel.sched_domain.cpu0.domain0.busy_idx = 2
kernel.sched_domain.cpu0.domain0.cache_nice_tries = 1
kernel.sched_domain.cpu0.domain0.flags = 4143
kernel.sched_domain.cpu0.domain0.forkexec_idx = 0
kernel.sched_domain.cpu0.domain0.idle_idx = 1
kernel.sched_domain.cpu0.domain0.imbalance_pct = 125
kernel.sched_domain.cpu0.domain0.max_interval = 4
kernel.sched_domain.cpu0.domain0.max_newidle_lb_cost = 75519
kernel.sched_domain.cpu0.domain0.min_interval = 2
kernel.sched_domain.cpu0.domain0.name = DIE
kernel.sched_domain.cpu0.domain0.newidle_idx = 0
kernel.sched_domain.cpu0.domain0.wake_idx = 0

kernel.sched_latency_ns = 12000000
kernel.sched_migration_cost_ns = 500000
kernel.sched_min_granularity_ns = 1500000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 25
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_shares_window_ns = 10000000
kernel.sched_time_avg_ms = 1000
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 2000000

Bad VM:

Code:

kernel.sched_domain.cpu0.domain0.busy_factor = 32
kernel.sched_domain.cpu0.domain0.busy_idx = 2
kernel.sched_domain.cpu0.domain0.cache_nice_tries = 1
kernel.sched_domain.cpu0.domain0.flags = 4143
kernel.sched_domain.cpu0.domain0.forkexec_idx = 0
kernel.sched_domain.cpu0.domain0.idle_idx = 1
kernel.sched_domain.cpu0.domain0.imbalance_pct = 125
kernel.sched_domain.cpu0.domain0.max_interval = 16
kernel.sched_domain.cpu0.domain0.max_newidle_lb_cost = 155384
kernel.sched_domain.cpu0.domain0.min_interval = 8
kernel.sched_domain.cpu0.domain0.name = DIE
kernel.sched_domain.cpu0.domain0.newidle_idx = 0
kernel.sched_domain.cpu0.domain0.wake_idx = 0

kernel.sched_latency_ns = 24000000
kernel.sched_migration_cost_ns = 500000
kernel.sched_min_granularity_ns = 3000000
kernel.sched_nr_migrate = 32
kernel.sched_rr_timeslice_ms = 25
kernel.sched_rt_period_us = 1000000
kernel.sched_rt_runtime_us = 950000
kernel.sched_shares_window_ns = 10000000
kernel.sched_time_avg_ms = 1000
kernel.sched_tunable_scaling = 1
kernel.sched_wakeup_granularity_ns = 4000000

AwesomeMachine · 04-20-2018, 02:49 AM

The initial values are based on your hardware and suggested default values. If the servers are different, ie. different memory, processor; the default values can be different. None of what I see here is going to make much real-world difference. I know the numbers look a lot different, but in real-world terms they're not.

ams_tschoening · 04-20-2018, 03:56 AM

In theory the VM-hosts are completely identical, CPUs, memory, HDDs etc., only the load and number of VMs are different. I guess that has an influence as well at least on some different numbers and is taken into account when a VM starts?

Besides that, I know for sure now that most of the differences simply come from the fact that one VM had 2 and the other 8 vCPUs at the moment I executed "sysctl -a". There's most likely no wrong global setting or such, like I assumed. I see exactly the same different values for e.g. "*_interval" and "sched_*_ns" using some Ubuntu 14.04 I had on my desktop in VMware Workstation with 2 and 8 vCPUs. Complete different hardware, VMs etc., same numbers.

syg00 · 04-20-2018, 04:15 AM

Huh ?.
You said the guests were "almost identical" !. I was going to ask if that was like "almost pregnant".
300% more (nominal) compute power is not almost identical.

What about memory ?. Swapping, I/O contention ... You need to look (initially) at the macro, not micro knobs like sysctls.

ams_tschoening · 04-20-2018, 05:02 AM

I had tested the VMs with exactly the same hardware and wasn't able to reproduce my problem, afterwards I reduced the hardware of the test-VM by purpose to 2 vCPUs, its default RAM etc. again to see what happens if I put the same load as with 8 vCPUs on it. Nothing happened, the same strange "slowness" I see in the production VM didn't occur even with far less computing power. Things only took longer of course, but the system was responding as expected. And because of that I decided to compare the settings I had at that moment, because I really thought that it would make spotting important differences easier and the problem doesn't seem to depend on raw computing power.

htop, sar, iostat etc. didn't reveal any obvious bottleneck. No swapping occurred, plenty of RAM available at all and free, used for caches and buffers etc. The only thing looking somewhat strange sometimes are the numbers of context switches of both VM-hosts under load when the problem occurs, but that's exactly why I had a look at sysctl and compared things.

syg00 · 04-20-2018, 06:11 AM

All that info should have been in the initial post.

No hard data, so impossible to hazard a guess as to what is happening. Abnormal context switch counts I usually suspect as driver/interrupt handler issues. But it can be CPU cache miss, TCP queuing, who knows.
But I don't use hipervisors unless I have to - and you haven't even indicated which you use.

ams_tschoening · 04-20-2018, 07:31 AM

I wasn't asking for general debugging help by purpose, because it's very likely that such an unspecific question brings me nowhere. I prefer to collect data on my own for now and ask very specific questions about things I don't understand or seem to be strange.