• [RFC PATCH] sched/topology: Introduce NUMA identity node sched doma

    From Peter Zijlstra@21:1/5 to Suravee Suthikulpanit on Mon Oct 2 14:50:03 2017
    Sorry, lost track of this thing..

    On Mon, Aug 14, 2017 at 02:44:59PM +0700, Suravee Suthikulpanit wrote:

    Why does the degenerate code fail to remove things?


    Sorry for confusion. Actually, the degenerate code does remove the duplicate NODE sched-domain.

    The difference is, without !cpumask_equal(), now the MC sched-domain
    would have the SD_PREFER_SIBLING flag set by the degenerate code since
    the flag got transferred down from the NODE to MC sched-domain. Would
    this be the preferred behavior for MC sched-domain?

    Right, but the 'normal' x86_topology setup will have SD_PREFER_SIBLING
    set on the MC domain by virtue of merging the MC and CPU domains. So the
    first layer of NUMA spreads.

    Now, that appears not be the case for x86_numa_in_package_topology,
    since that doesn't do the CPU domain, but your patch would effectively
    'fix' that if it were to not have that conditional. But it would keep
    the PREFER_SIBLING at the NODE level, MC would not get it.

    But I'm wondering if we're not doing this all a bit backwards. Having SD_PREFER_SIBLING on the L3 makes sense in that it will spread workload
    across the machine and maximize cache utilization. But we don't set it directly, instead we inherit it through merging.

    So commit 8e7fbcbc22c1 ("sched: Remove stale power aware scheduling
    remnants and dysfunctional knobs") made a fail by not preserving the SD_PREFER_SIBLING for the !power_saving case on both CPU and MC.

    Then commit 6956dc568f34 ("sched/numa: Add SD_PERFER_SIBLING to CPU
    domain") adds it back to the CPU but not MC.

    So I'm thinking it makes sense restore the explicit MC one and make your
    NODE domain unconditional.

    That would make behaviour consistent between the various modes;
    something like the below. I'll go try and update your Changelog so we
    can get this merged.

    --- a/kernel/sched/topology.c
    +++ b/kernel/sched/topology.c
    @@ -1157,6 +1157,7 @@ sd_init(struct sched_domain_topology_lev
    sd->smt_gain = 1178; /* ~15% */

    } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
    + sd->flags |= SD_PREFER_SIBLING;
    sd->imbalance_pct = 117;
    sd->cache_nice_tries = 1;
    sd->busy_idx = 2;
    @@ -1331,6 +1332,10 @@ void sched_init_numa(void)
    if (!sched_domains_numa_distance)
    return;

    + /* Includes NUMA identity node at level 0. */
    + sched_domains_numa_distance[level++] = curr_distance;
    + sched_domains_numa_levels = level;
    +
    /*
    * O(nr_nodes^2) deduplicating selection sort -- in order to find the
    * unique distances in the node_distance() table.
    @@ -1378,8 +1383,7 @@ void sched_init_numa(void)
    return;

    /*
    - * 'level' contains the number of unique distances, excluding the
    - * identity distance node_distance(i,i).
    + * 'level' contains the number of unique distances
    *
    * The sched_domains_numa_distance[] array includes the actual di