Re: [PATCH] mm/oom: Add killed process selection information

From: Edward Chron
Date: Thu Aug 08 2019 - 18:15:27 EST


In our experience far more (99.9%+) OOM events are not kernel issues,
they're user task memory issues.
Properly maintained Linux kernel only rarely have issues.
So useful information about the killed task, displayed in a manner
that can be quickly digested, is very helpful.
But it turns out the totalpages parameter is also critical to make
sense of what is shown.

So if we report the fooWidget task was using ~15% of memory (I know
this is just an approximation but it is often an adequate metric) we
often can tell just from that the number is larger than expected so we
can start there.
Even though the % is a ballpark number, if you are familiar with the
tasks on your system and approximately how much memory you expect them
to use you can often tell if memory usage is excessive.
This is not always the case but it is a fair amount of the time.
So the % of memory field is helpful. But we've found we need totalpages as well.
The totalpages effects the % of memory the task uses.

You're an OOM expert so I don't need to tell you, but just for
clarity, that if a system we're expecting to have swap space has it's
swap diminshed or removed, that can have a significant effect both on
the available memory and the % of memory/swap the task consumes.
Just as if you run a task of a fixed memory size on a system with half
the memory it's % of memory jumps up as does it's oom_score.
Often we know the Memory Size and how much swap space a system is
expected to have, printing totalpages allows us to confirm that this
was in fact the case at the time of the oom event.
Also the size of totalpages is very important to being to tell
approximately how much memory/swap, the task was using because we have
the % and we can quickly get an idea of usage.
For systems that come in a variety of sizes that is important, the
percentage number in conjunction with totalpages is essential since
they are dependent on each other.

With memcg usage increasing, having totalpages readily available is
even more important because the memory container caps the value and it
is helpful to know the value that was in use at the time of the oom
event.

The oom_score tells us how Linux calculated the score for the task,
the oom_score_adj effects this so it is helpful to have that in
conjunction with the oom_score.
If the adjust is high it can tell us that the task was acting as a
canary and so it's oom_score is high even though it's memory
utilization can be modest or low.
In that case we may need more information about what was going on
because the task selected was not necessarily using much memory.
But at least we know why it was selected. The kill message with a high
oom_score_adjust and high oom_score makes that obvious.

Just by adding a few values to the kill message we're often able to
quickly get an idea of what the cause of an oom event was, or at least
we have a better idea where to start looking.

Since we're running a business and so are our customers anything we
can do speed up the triage process saves money and makes people more
productive, so we find it valuable.

What other justification is needed? Let me know.

Thanks!

P.S.

By the way, just for feedback, the recent reorganization of the OOM
sections and print output, for those of us that do have to wade
through OOM output, was appreciated:

commit ef8444ea01d7442652f8e1b8a8b94278cb57eafd (v5.0-rc1-107^2~63)
Author: yuzhoujian <yuzhoujian@xxxxxxxxxxxxxxx>
Date: Fri Dec 28 00:36:07 2018 -0800

mm, oom: reorganize the oom report in dump_header



On Thu, Aug 8, 2019 at 1:07 PM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
> [please do not top-post]
>
> On Thu 08-08-19 12:21:30, Edward Chron wrote:
> > It is helpful to the admin that looks at the kill message and records this
> > information. OOMs can come in bunches.
> > Knowing how much resource the oom selected process was using at the time of
> > the OOM event is very useful, these fields document key process and system
> > memory/swap values and can be quite helpful.
>
> I do agree and we already print that information. rss with a break down
> to anonymous, file backed and shmem, is usually a large part of the oom
> victims foot print. It is not a complete information because there might
> be a lot of memory hidden by other resource (open files etc.). We do not
> print that information because it is not considered in the oom
> selection. It is also not guaranteed to be freed upon the task exit.
>
> > Also can't you disable printing the oom eligible task list? For systems
> > with very large numbers of oom eligible processes that would seem to be
> > very desirable.
>
> Yes that is indeed the case. But how does the oom_score and
> oom_score_adj alone without comparing it to other eligible tasks help in
> isolation?
>
> [...]
>
> > I'm not sure that change would be supported upstream but again in our
> > experience we've found it helpful, since you asked.
>
> Could you be more specific about how that information is useful except
> for recording it? I am all for giving an useful information in the OOM
> report but I would like to hear a sound justification for each
> additional piece of information.
>
> E.g. this helped us to understand why the task has been selected - this
> is usually dump_tasks portion of the report because it gives a picture
> of what the OOM killer sees when choosing who to kill.
>
> Then we have the summary to give us an estimation on how much
> memory will get freed when the victim dies - rss is a very rough
> estimation. But is a portion of the overal memory or oom_score{_adj}
> important to print as well? Those are relative values. Say you get
> memory-usage:10%, oom_score:42 and oom_score_adj:0. What are you going
> to tell from that information?
> --
> Michal Hocko
> SUSE Labs