Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

From: Chen Yu
Date: Wed May 22 2024 - 10:49:19 EST


On 2024-05-14 at 20:53:16 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 5/14/2024 2:48 PM, Chen Yu wrote:
> >>> [..snip..]
> >>> /*
> >>> * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >>> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> >>> @@ -7384,10 +7402,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>> if (sched_feat(SIS_UTIL)) {
> >>> sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> >>> if (sd_share) {
> >>> - /* because !--nr is the condition to stop scan */> - nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> >>> + nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan));
> >>> /* overloaded LLC is unlikely to have idle cpu/core */
> >>> - if (nr == 1)
> >>> + if (nr <= 0)
> >>
> >> I was wondering if this would preserve the current behavior with
> >> SIS_FAST toggled off? Since the implementation below still does a
> >> "--nr <= 0" , wouldn't it effectively visit one CPU less overall now?
> >>
> >> Have you tried something similar to the below hunk?
> >>
> >> /* because !--nr is the condition to stop scan */
> >> nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan)) + 1;
> >> if (nr == 1)
> >> return -1;
> >>
> >
> > Yeah, right, to keep the scan depth consistent, the "+1" should be kept.
> >
> >> I agree with Mike that looking at slice to limit scan-depth seems odd.
> >> My experience with netperf is that the workload cares more about the
> >> server-client being co-located on the closest cache domain and by
> >> limiting scan-depth using slice, this is indirectly achieved since all
> >> the wakeups carry the WF_SYNc flag.
> >>
> >
> > Exactly. This is the original motivation.
> >
> >> P.S. have you tried using the slice in __select_idle_cpu()? Similar to
> >> sched_idle_cpu() check, perhaps an additional sched_preempt_short_cpu()
> >> which compares rq->curr->se.slice with the waking task's slice and
> >> returs that cpu if SIS_SHORT can help run the workload quicker?
> >
> > This is a good idea, it seems to be benefit PREEMPT_SHORT. If the customized
> > task slice is introduced, we can leverage this hint for latency related
> > optimization. Task wakeup is one thing, I can also think of other aspects,
> > like idle load balance, etc. I'm not sure what is the proper usage of the
> > task slice though, this is why I sent this RFC.
> >
> >> Note:
> >> This will not work if the SIS scan itself is the largest overhead in the
> >> wakeup cycle and not the task placement itself. Previously during
> >> SIS_UTIL testing, to measure the overheads of scan vs placement, we
> >> would do a full scan but return the result that SIS_UTIL would have
> >> returned to determine the overhead of the search itself.
> >>
> >
> > Regarding the task placement, do you mean the time between a task is enqueued
> > and picked up? Do you have any recommendation which workload can expose the
> > scan overhead most?
>
> Sorry for not being clear here. From what I've observed in the past,
> there are two dimensions to slect_idle_sibling():
>
> i) Placement: Final CPU select_idle_sibling() returns
> ii) Search: Do we find an idle core/CPU in select_idle_sibling()
>
> In case of netperf, I've observed that i) is more important than ii)
> wherin a placement of client on same core/thread as that of the server
> results in better performance vs finding an idle CPU on a remote LLC.
> For hackbench/tbench, when runqueues are under high utilization (~75%),
> reduction in search time ii) seems to be more beneficial.
>

Usually task stacking is not preferred because it hurts latency. But as we
discovered on AMD and Intel's large servers, sometimes the overhead of task
migration offset the benefit of running an idle CPU.

Inspired by your description on SIS_SHORT, and also Mike's feedback, I respin
the SIS_SHORT patch, to consider the task's customized slice:

Leverage the WF_SYNC to achieve better cache locality.
If both the waker and the wakee are short duration tasks, wake up the wakee on
waker's CPU directly. To avoid task stacking as much as possible:

1. Only when has_idle_core is false, SIS_SYNC takes effect.

2. Only if the user has shrinked the task slice, SIS_SYNC takes effect.

3. The waker must be the only runnable task on the runqueue.

4. The duration threshold is based on the CPU number within the LLC.
The more CPUs there are, the lower the bar for the task to be treated as
a short duration one, and vice versa.

threshold = sysctl_sched_migration_cost * llc_weight^2 / 256^2

threshold
LLC_WEIGHT=8 0.5 usec
LLC_WEIGHT=16 2 usec
LLC_WEIGHT=32 8 usec
LLC_WEIGHT=64 31 usec
LLC_WEIGHT=128 125 usec
LLC_WEIGHT=256 500 usec

5. Honor idle-CPU-first for CPUs share the L2 cache(Cluster). Only
there is no idle CPU within the Cluster domain, SIS_SYNC takes
effect.

Benchmark:
Tested on 4 platforms, significant throughput improvement on tbench, netperf,
stress-ng, will-it-scale, and latency reduced of lmbench. No performance
difference was observed in an Online Transaction Processing (OLTP) test.

Platform1, 240 CPUs, 2 sockets Intel(R) Xeon(R)
========================================================================
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 60-threads 1.00 ( 1.11) +8.22 ( 1.06)
TCP_RR 120-threads 1.00 ( 1.74) +38.71 ( 0.74)
TCP_RR 180-threads 1.00 ( 1.21) +108.12 ( 1.07)
TCP_RR 240-threads 1.00 ( 4.75) +160.98 ( 4.01)
UDP_RR 60-threads 1.00 ( 11.70) +7.17 ( 10.89)
UDP_RR 120-threads 1.00 ( 10.82) +5.54 ( 7.17)
UDP_RR 180-threads 1.00 ( 9.77) +10.29 ( 10.87)
UDP_RR 240-threads 1.00 ( 12.38) +4.45 ( 16.69)

tbench
======
case load baseline(std%) compare%( std%)
loopback 60-threads 1.00 ( 0.22) +8.67 ( 0.12)
loopback 120-threads 1.00 ( 0.20) +42.12 ( 1.16)
loopback 180-threads 1.00 ( 0.16) +96.36 ( 1.34)
loopback 240-threads 1.00 ( 0.23) +122.68 ( 4.87)

schbench
========
No noticeable difference of 99.0th wakeup/request latency, 50.0th RPS percentiles.
schbench -m 2 -r 100

baseline sis_sync

Wakeup Latencies 99.0th usec 27 25
Request Latencies 99.0th usec 15376 15376
RPS percentiles 50.0th 16608 16608

Platform2, 48 CPUs 2 sockets Intel(R) Xeon(R) CPU E5-2697
========================================================================
lmbench3: lmbench3.PIPE.latency.us 33.8% improvement
lmbench3: lmbench3.AF_UNIX.sock.stream.latency.us 30.6% improvement

Platform3: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480
=======================================================================
stress-ng: stress-ng.vm-rw.ops_per_sec 250.8% improvement
will-it-scale: will-it-scale.per_process_ops 42.1% improvement

Platform4: 288 CPUs, 2 sockets, 4 CPUs share the L2 cache, Intel(R) Xeon(R)
=========================================================================
netperf
=======
case load baseline(std%) compare%( std%)
TCP_RR 72-threads 1.00 ( 0.89) +6.14 ( 0.95)
TCP_RR 144-threads 1.00 ( 1.53) +5.18 ( 1.24)
TCP_RR 216-threads 1.00 ( 10.76) +23.31 ( 1.67)
TCP_RR 288-threads 1.00 ( 2.28) +1.72 ( 2.10)
UDP_RR 72-threads 1.00 ( 3.44) +0.42 ( 3.11)
UDP_RR 144-threads 1.00 ( 14.53) +1.40 ( 16.11)
UDP_RR 216-threads 1.00 ( 17.30) +1.18 ( 23.00)
UDP_RR 288-threads 1.00 ( 21.30) -0.15 ( 20.14)


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1437c276c915..b2b838d499ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1003,7 +1003,7 @@ static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
#include "pelt.h"
#ifdef CONFIG_SMP

-static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu, int sync);
static unsigned long task_h_load(struct task_struct *p);
static unsigned long capacity_of(int cpu);

@@ -7410,16 +7410,36 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd

#endif /* CONFIG_SCHED_SMT */

+static int short_task(struct task_struct *p, int llc)
+{
+ if (!p->se.custom_slice || p->se.slice >= sysctl_sched_base_slice)
+ return 0;
+
+ /*
+ * Scale the threshold by the LLC CPU number.
+ * The more CPUs there are, the more likely the
+ * task is regarded as a small one.
+ *
+ */
+ if ((p->duration_avg << 16) >=
+ (sysctl_sched_migration_cost * llc * llc))
+ return 0;
+
+ return 1;
+}
+
/*
* Scan the LLC domain for idle CPUs; this is dynamically regulated by
* comparing the average scan cost (tracked in sd->avg_scan_cost) against the
* average idle time for this rq (as found in rq->avg_idle).
*/
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target,
+ int sync)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
int i, cpu, idle_cpu = -1, nr = INT_MAX;
struct sched_domain_shared *sd_share;
+ int llc_weight = per_cpu(sd_llc_size, target);

cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);

@@ -7458,6 +7478,20 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}

+ /*
+ * When reached here, next step is to scan idle CPUs that do not share L2.
+ * However, the Core-to-Core cache latency could be large on big system.
+ * Give target another chance if waker and wakee are mutually waking up
+ * each other.
+ */
+ if (sched_feat(SIS_SYNC) &&
+ target == smp_processor_id() && !has_idle_core &&
+ sync && this_rq()->nr_running <= 1 &&
+ short_task(p, llc_weight) &&
+ short_task(current, llc_weight)) {
+ return target;
+ }
+
for_each_cpu_wrap(cpu, cpus, target + 1) {
if (has_idle_core) {
i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -7550,7 +7584,7 @@ static inline bool asym_fits_cpu(unsigned long util,
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
-static int select_idle_sibling(struct task_struct *p, int prev, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target, int sync)
{
bool has_idle_core = false;
struct sched_domain *sd;
@@ -7659,7 +7693,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
}

- i = select_idle_cpu(p, sd, has_idle_core, target);
+ i = select_idle_cpu(p, sd, has_idle_core, target, sync);
if ((unsigned)i < nr_cpumask_bits)
return i;

@@ -8259,7 +8293,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
new_cpu = sched_balance_find_dst_cpu(sd, p, cpu, prev_cpu, sd_flag);
} else if (wake_flags & WF_TTWU) { /* XXX always ? */
/* Fast path */
- new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
+ new_cpu = select_idle_sibling(p, prev_cpu, new_cpu, sync);
}
rcu_read_unlock();

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..7e5968d01dcb 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -50,6 +50,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
* When doing wakeups, attempt to limit superfluous scans of the LLC domain.
*/
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_SYNC, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1



Preparation patch to record the task's duration:


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6..9e8dfdc81048 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
struct callback_head cid_work;
#endif

+ u64 prev_sleep_sum_runtime;
+ u64 duration_avg;
+
struct tlbflush_unmap_batch tlb_ubc;

/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcf2c4cc0522..c4288d613374 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4566,6 +4566,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->migration_pending = NULL;
#endif
init_sched_mm_cid(p);
+ p->prev_sleep_sum_runtime = 0;
+ p->duration_avg = 0;
}

DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..1437c276c915 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6833,6 +6833,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

static void set_next_buddy(struct sched_entity *se);

+static inline void dur_avg_update(struct task_struct *p)
+{
+ u64 dur;
+
+ dur = p->se.sum_exec_runtime - p->prev_sleep_sum_runtime;
+ p->prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+ update_avg(&p->duration_avg, dur);
+}
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+ if (task_sleep)
+ dur_avg_update(p);
+
hrtick_update(rq);
}

--
2.25.1