• [v8 0/4] cgroup-aware OOM killer

    From Tetsuo Handa@21:1/5 to Shakeel Butt on Mon Oct 2 14:00:01 2017
    Shakeel Butt wrote:
    I think Tim has given very clear explanation why comparing A & D makes perfect sense. However I think the above example, a single user system
    where a user has designed and created the whole hierarchy and then
    attaches different jobs/applications to different nodes in this
    hierarchy, is also a valid scenario. One solution I can think of, to
    cater both scenarios, is to introduce a notion of 'bypass oom' or not
    include a memcg for oom comparision and instead include its children
    in the comparison.

    I'm not catching up to this thread because I don't use memcg.
    But if there are multiple scenarios, what about offloading memcg OOM
    handling to loadable kernel modules (like there are many filesystems
    which are called by VFS interface) ? We can do try and error more casually.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michal Hocko@21:1/5 to Shakeel Butt on Mon Oct 2 14:30:01 2017
    On Sun 01-10-17 16:29:48, Shakeel Butt wrote:

    Going back to Michal's example, say the user configured the following:

    root
    / \
    A D
    / \
    B C

    A global OOM event happens and we find this:
    - A > D
    - B, C, D are oomgroups

    What the user is telling us is that B, C, and D are compound memory consumers. They cannot be divided into their task parts from a memory
    point of view.

    However, the user doesn't say the same for A: the A subtree summarizes
    and controls aggregate consumption of B and C, but without groupoom
    set on A, the user says that A is in fact divisible into independent
    memory consumers B and C.

    If we don't have to kill all of A, but we'd have to kill all of D,
    does it make sense to compare the two?


    I think Tim has given very clear explanation why comparing A & D makes perfect sense. However I think the above example, a single user system
    where a user has designed and created the whole hierarchy and then
    attaches different jobs/applications to different nodes in this
    hierarchy, is also a valid scenario.

    Yes and nobody is disputing that, really. I guess the main disconnect
    here is that different people want to have more detailed control over
    the victim selection while the patchset tries to handle the most
    simplistic scenario when a no userspace control over the selection is
    required. And I would claim that this will be a last majority of setups
    and we should address it first.

    A more fine grained control needs some more thinking to come up with a
    sensible and long term sustainable API. Just look back and see at the oom_score_adj story and how it ended up unusable in the end (well apart
    from never/always kill corner cases). Let's not repeat that again now.

    I strongly believe that we can come up with something - be it priority
    based, BFP based or module based selection. But let's start simple with
    the most basic scenario first with a most sensible semantic implemented.

    I believe the latest version (v9) looks sensible from the semantic point
    of view and we should focus on making it into a mergeable shape.
    --
    Michal Hocko
    SUSE Labs

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Shakeel Butt@21:1/5 to All on Mon Oct 2 22:50:13 2017
    Yes and nobody is disputing that, really. I guess the main disconnect
    here is that different people want to have more detailed control over
    the victim selection while the patchset tries to handle the most
    simplistic scenario when a no userspace control over the selection is required. And I would claim that this will be a last majority of setups
    and we should address it first.

    IMHO the disconnect/disagreement is which memcgs should be compared
    with each other for oom victim selection. Let's forget about oom
    priority and just take size into the account. Should the oom selection algorithm, compare the leaves of the hierarchy or should it compare
    siblings? For the single user system, comparing leaves makes sense
    while in a multi user system, siblings should be compared for victim
    selection.

    Coming back to the same example:

    root
    / \
    A D
    / \
    B C

    Let's view it as a multi user system and some central job scheduler
    has asked a node controller on this system to start two jobs 'A' &
    'D'. 'A' then went on to create sub-containers. Now, on system oom,
    IMO the most simple sensible thing to do from the semantic point of
    view is to compare 'A' and 'D' and if 'A''s usage is higher then
    killall 'A' if oom_group or recursively find victim memcg taking 'A'
    as root.

    I have noted before that for single user systems, comparing 'B', 'C' &
    'D' is the most sensible thing to do.

    Now, in the multi user system, I can kind of force the comparison of
    'A' & 'D' by setting oom_group on 'A'. IMO that is abuse of
    'oom_group' as it will get double meanings/semantics which are
    comparison leader and killall. I would humbly suggest to have two
    separate notions instead. Let's say oom_gang (if you prefer just
    'oom_group' is fine too) and killall.

    For the single user system example, 'B', 'C' and 'D' will have
    'oom_gang' set and if the user wants killall semantics too, he can set
    it separately.

    For the multi user, 'A' and 'D' will have 'oom_gang' set. Now, lets
    say 'A' was selected on system oom, if 'killall' was set on 'A' then
    'A' will be selected as victim otherwise the oom selection algorithm
    will recursively take 'A' as root and try to find victim memcg.

    Another major semantic of 'oom_gang' is that the leaves will always be
    treated as 'oom_gang'.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)