• [RFC] sched/fair: Use wake_q length as a hint for wake_wide

    From Brendan Jackman@21:1/5 to Brendan Jackman on Mon Oct 2 12:50:04 2017
    Hi PeterZ,

    I just got this in my inbox and noticed I didn't adress it to anyone. I
    meant to address it to you.

    On Fri, Sep 29 2017 at 17:05, Brendan Jackman wrote:
    There has been a bit of discussion on this RFC, but before I do any
    more work I'd really like your input on the basic idea.

    Does the approach of feeding outside info like wake_q length into the scheduler's decisions seem basically acceptable?

    Cheers,
    Brendan


    On Fri, Aug 11 2017 at 09:45, Brendan Jackman wrote:
    This patch adds a parameter to select_task_rq, sibling_count_hint
    allowing the caller, where it has this information, to inform the
    sched_class the number of tasks that are being woken up as part of
    the same event.

    The wake_q mechanism is one case where this information is available.

    select_task_rq_fair can then use the information to detect that it
    needs to widen the search space for task placement in order to avoid
    overloading the last-level cache domain's CPUs.

    * * *

    The reason I am investigating this change is the following use case
    on ARM big.LITTLE (asymmetrical CPU capacity): 1 task per CPU, which
    all repeatedly do X amount of work then
    pthread_barrier_wait (i.e. sleep until the last task finishes its X
    and hits the barrier). On big.LITTLE, the tasks which get a "big" CPU
    finish faster, and then those CPUs pull over the tasks that are still
    running:

    v CPU v ->time->

    -------------
    0 (big) 11111 /333
    -------------
    1 (big) 22222 /444|
    -------------
    2 (LITTLE) 333333/
    -------------
    3 (LITTLE) 444444/
    -------------

    Now when task 4 hits the barrier (at |) and wakes the others up,
    there are 4 tasks with prev_cpu=<big> and 0 tasks with
    prev_cpu=<little>. want_affine therefore means that we'll only look
    in CPUs 0 and 1 (sd_llc), so tasks will be unnecessarily coscheduled
    on the bigs until the next load balance, something like this:

    v CPU v ->time->

    ------------------------
    0 (big) 11111 /333 31313\33333
    ------------------------
    1 (big) 22222 /444|424\4444444
    ------------------------
    2 (LITTLE) 333333/ \222222
    ------------------------
    3 (LITTLE) 444444/ \1111
    ------------------------
    ^^^
    underutilization

    So, I'm trying to get want_affine = 0 for these tasks.

    I don't _think_ any incarnation of the wakee_flips mechanism can help
    us here because which task is waker and which tasks are wakees
    generally changes with each iteration.

    However pthread_barrier_wait (or more accurately FUTEX_WAKE) has the
    nice property that we know exactly how many tasks are being woken, so
    we can cheat.

    It might be a disadvantage that we "widen" _every_ task that's woken in
    an event, while select_idle_sibling would work fine for the first
    sd_llc_size - 1 tasks.

    IIUC, if wake_affine() behaves correctly this trick wouldn't be
    necessary on SMP systems, so it might be best guarded by the presence
    of SD_ASYM_CPUCAPACITY?

    * * *

    Final note..

    In order to observe "perfect" behaviour for this use case, I also had
    to disable the TTWU_QUEUE sched feature. Suppose during the wakeup
    above we are working through the work queue and have placed tasks 3
    and 2, and are about to place task 1:

    v CPU v ->time->

    --------------
    0 (big) 11111 /333 3
    --------------
    1 (big) 22222 /444|4
    --------------
    2 (LITTLE) 333333/ 2
    --------------
    3 (LITTLE) 444444/ <- Task 1 should go here
    --------------

    If TTWU_QUEUE is enabled, we will not yet have enqueued task
    2 (having instead sent a reschedule IPI) or attached its load to CPU
    2. So we are likely to also place task 1 on cpu 2. Disabling
    TTWU_QUEUE means that we enqueue task 2 before placing task 1,
    solving this issue. TTWU_QUEUE is there to minimise rq lock
    contention, and I guess that this contention is less of an issue on
    big.LITTLE systems since they have relatively few CPUs, which
    suggests the trade-off makes sense here.

    Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Josef Bacik <josef@toxicpanda.com>
    Cc: Joel Fernandes <joelaf@google.com>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Matt Fleming <matt@codeblueprint.co.uk>
    ---
    include/linux/sched/wake_q.h | 2 ++
    kernel/sched/core.c | 34 +++++++++++++++++++++++-----------
    kernel/sched/deadline.c | 3 ++-
    kernel/sched/fair.c | 17 +++++++++++------
    kernel/sched/idle_task.c | 3 ++-
    kernel/sched/rt.c | 3 ++-
    kernel/sched/sched.h | 3 ++-
    kernel/sched/stop_task.c | 3 ++-
    8 files changed, 46 insertions(+), 22 deletions(-)

    diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
    index d03d8a9047dc..607a888eb35b 100644
    --- a/include/linux/sched/wake_q.h
    +++ b/include/linux/sched/wake_q.h
    @@ -33,6 +33,7 @@
    struct wake_q_head {
    struct wake_q_node *first;
    struct wake_q_node **lastp;
    + int count;
    };

    #define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
    @@ -44,6 +45,7 @@ static inline void wake_q_init(struct wake_q_head *head) >> {
    head->first = WAKE_Q_TAIL;
    head->lastp = &head->first;
    + head->count = 0;
    }

    extern void wake_q_add(struct wake_q_head *head,
    diff --git a/kernel/sched/core.c b/kernel/sched/core.c
    index 0869b20fba81..ddf9257b1467 100644
    --- a/kernel/sched/core.c
    +++ b/kernel/sched/core.c
    @@ -438,6 +438,8 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
    if (cmpxchg(&node->next, NULL, WAKE_Q_TAIL))
    return;

    + head->count++;
    +
    get_task_struct(task);

    /*
    @@ -447,6 +449,10 @@ void wake_q_add(struct wake_q_head *head, struct task_struct *task)
    head->lastp = &node->next;
    }

    +static int
    +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags, >> + int sibling_count_hint);
    +
    void wake_up_q(struct wake_q_head *head)
    {
    struct wake_q_node *node = head->first;
    @@ -461,10 +467,10 @@ void wake_up_q(struct wake_q_head *head)
    task->wake_q.next = NULL;

    /*
    - * wake_up_process() implies a wmb() to pair with the queueing >> + * try_to_wake_up() implies a wmb() to pair with the queueing >> * in wake_q_add() so as not to miss wakeups.
    */
    - wake_up_process(task);
    + try_to_wake_up(task, TASK_NORMAL, 0, head->count);
    put_task_struct(task);
    }
    }
    @@ -1527,12 +1533,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
    * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
    */
    static inline
    -int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
    +int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
    + int sibling_count_hint)
    {
    lockdep_assert_held(&p->pi_lock);

    if (p->nr_cpus_allowed > 1)
    - cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
    + cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
    + sibling_count_hint);
    else
    cpu = cpumask_any(&p->cpus_allowed);

    @@ -1944,6 +1952,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
    * @p: the thread to be awakened
    * @state: the mask of task states that can be woken
    * @wake_flags: wake modifier flags (WF_*)
    + * @sibling_count_hint: A hint at the number of threads that are being woken up
    + * in this event.
    *
    * If (@state & @p->state) @p->state = TASK_RUNNING.
    *
    @@ -1956,7 +1966,8 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
    * %false otherwise.
    */
    static int
    -try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) >> +try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags, >> + int sibling_count_hint)
    {
    unsigned long flags;
    int cpu, success = 0;
    @@ -2042,7 +2053,8 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
    atomic_dec(&task_rq(p)->nr_iowait);
    }

    - cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
    + cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
    + sibling_count_hint);
    if (task_cpu(p) != cpu) {
    wake_flags |= WF_MIGRATED;
    set_task_cpu(p, cpu);
    @@ -2130,13 +2142,13 @@ static void try_to_wake_up_local(struct task_struct *p, struct rq_flags *rf)
    */
    int wake_up_process(struct task_struct *p)
    {
    - return try_to_wake_up(p, TASK_NORMAL, 0);
    + return try_to_wake_up(p, TASK_NORMAL, 0, 1);
    }
    EXPORT_SYMBOL(wake_up_process);

    int wake_up_state(struct task_struct *p, unsigned int state)
    {
    - return try_to_wake_up(p, state, 0);
    + return try_to_wake_up(p, state, 0, 1);
    }

    /*
    @@ -2442,7 +2454,7 @@ void wake_up_new_task(struct task_struct *p)
    * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq, >> * as we're not fully set-up yet.
    */
    - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); >> + __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
    #endif
    rq = __task_rq_lock(p, &rf);
    update_rq_clock(rq);
    @@ -2893,7 +2905,7 @@ void sched_exec(void)
    int dest_cpu;

    raw_spin_lock_irqsave(&p->pi_lock, flags);
    - dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
    + dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
    if (dest_cpu == smp_processor_id())
    goto unlock;

    @@ -3582,7 +3594,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
    int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
    void *key)
    {
    - return try_to_wake_up(curr->private, mode, wake_flags);
    + return try_to_wake_up(curr->private, mode, wake_flags, 1);
    }
    EXPORT_SYMBOL(default_wake_function);

    diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
    index 755bd3f1a1a9..69a9dd407267 100644
    --- a/kernel/sched/deadline.c
    +++ b/kernel/sched/deadline.c
    @@ -1516,7 +1516,8 @@ static void yield_task_dl(struct rq *rq)
    static int find_later_rq(struct task_struct *task);

    static int
    -select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags) >> +select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags, >> + int sibling_count_hint)
    {
    struct task_struct *curr;
    struct rq *rq;
    diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
    index c95880e216f6..0a9d706b62bf 100644
    --- a/kernel/sched/fair.c
    +++ b/kernel/sched/fair.c
    @@ -5332,15 +5332,18 @@ static void record_wakee(struct task_struct *p)
    * whatever is irrelevant, spread criteria is apparent partner count exceeds
    * socket size.
    */
    -static int wake_wide(struct task_struct *p)
    +static int wake_wide(struct task_struct *p, int sibling_count_hint)
    {
    unsigned int master = current->wakee_flips;
    unsigned int slave = p->wakee_flips;
    - int factor = this_cpu_read(sd_llc_size);
    + int llc_size = this_cpu_read(sd_llc_size);
    +
    + if (sibling_count_hint >= llc_size)
    + return 1;

    if (master < slave)
    swap(master, slave);
    - if (slave < factor || master < slave * factor)
    + if (slave < llc_size || master < slave * llc_size)
    return 0;
    return 1;
    }
    @@ -5869,7 +5872,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
    * preempt must be disabled.
    */
    static int
    -select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
    +select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
    + int sibling_count_hint)
    {
    struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
    int cpu = smp_processor_id();
    @@ -5879,8 +5883,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

    if (sd_flag & SD_BALANCE_WAKE) {
    record_wakee(p);
    - want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
    - && cpumask_test_cpu(cpu, &p->cpus_allowed);
    + want_affine = !wake_wide(p, sibling_count_hint) &&
    + !wake_cap(p, cpu, prev_cpu) &&
    + cpumask_test_cpu(cpu, &p->cpus_allowed);
    }

    rcu_read_lock();
    diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
    index 0c00172db63e..3c343e055110 100644
    --- a/kernel/sched/idle_task.c
    +++ b/kernel/sched/idle_task.c
    @@ -9,7 +9,8 @@

    #ifdef CONFIG_SMP
    static int
    -select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags) >> +select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags, >> + int sibling_count_hint)
    {
    return task_cpu(p); /* IDLE tasks as never migrated */
    }
    diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
    index 45caf937ef90..b9937dccb8b3 100644
    --- a/kernel/sched/rt.c
    +++ b/kernel/sched/rt.c
    @@ -1387,7 +1387,8 @@ static void yield_task_rt(struct rq *rq)
    static int find_lowest_rq(struct task_struct *task);

    static int
    -select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags) >> +select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags, >> + int sibling_count_hint)
    {
    struct task_struct *curr;
    struct rq *rq;
    diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
    index eeef1a3086d1..56ae525618e9 100644
    --- a/kernel/sched/sched.h
    +++ b/kernel/sched/sched.h
    @@ -1419,7 +1419,8 @@ struct sched_class {
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

    #ifdef CONFIG_SMP
    - int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
    + int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
    + int subling_count_hint);
    void (*migrate_task_rq)(struct task_struct *p);

    void (*task_woken) (struct rq *this_rq, struct task_struct *task);
    diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
    index 9f69fb630853..d0ce4fbb18ef 100644
    --- a/kernel/sched/stop_task.c
    +++ b/kernel/sched/stop_task.c
    @@ -11,7 +11,8 @@

    #ifdef CONFIG_SMP
    static int
    -select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags) >> +select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags, >> + int sibling_count_hint)
    {
    return task_cpu(p); /* stop tasks as never migrate */
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)