Re: [PATCH] psi: annotate refault stalls from IO submission
From: Johannes Weiner
Date: Mon Jul 22 2019 - 19:35:33 EST
On Mon, Jul 22, 2019 at 03:26:07PM -0700, Andrew Morton wrote:
> On Mon, 22 Jul 2019 16:13:37 -0400 Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> > psi tracks the time tasks wait for refaulting pages to become
> > uptodate, but it does not track the time spent submitting the IO. The
> > submission part can be significant if backing storage is contended or
> > when cgroup throttling (io.latency) is in effect - a lot of time is
> > spent in submit_bio(). In that case, we underreport memory pressure.
>
> It's a somewhat broad patch. How significant is this problem in the
> real world? Can we be confident that the end-user benefit is worth the
> code changes?
The error scales with how aggressively IO is throttled compared to the
device's capability.
For example, we have system maintenance software throttled down pretty
hard on IO compared to the workload. When memory is contended, the
system software starts thrashing cache, but since the backing device
is actually pretty fast, the majority of "io time" is from injected
throttling delays during submit_bio().
As a result we barely see memory pressure, when the reality is that
there is almost no progress due to the thrashing and we should be
killing misbehaving stuff.
> > Annotate the submit_bio() paths (or the indirection through readpage)
> > for refaults and swapin to get proper psi coverage of delays there.
> >
> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > ---
> > fs/btrfs/extent_io.c | 14 ++++++++++++--
> > fs/ext4/readpage.c | 9 +++++++++
> > fs/f2fs/data.c | 8 ++++++++
> > fs/mpage.c | 9 +++++++++
> > mm/filemap.c | 20 ++++++++++++++++++++
> > mm/page_io.c | 11 ++++++++---
> > mm/readahead.c | 24 +++++++++++++++++++++++-
>
> We touch three filesystems. Why these three? Are all other
> filesystems OK or will they need work as well?
These are the ones that I found open-coding add_to_page_cache_lru()
followed by submit_bio() instead of going through generic code like
mpage, use read_cache_pages(), implement ->readpage only.
> > @@ -2753,11 +2763,14 @@ static struct page *do_read_cache_page(struct address_space *mapping,
> > void *data,
> > gfp_t gfp)
> > {
> > + bool refault = false;
> > struct page *page;
> > int err;
> > repeat:
> > page = find_get_page(mapping, index);
> > if (!page) {
> > + unsigned long pflags;
> > +
>
> That was a bit odd. This?
>
> --- a/mm/filemap.c~psi-annotate-refault-stalls-from-io-submission-fix
> +++ a/mm/filemap.c
> @@ -2815,12 +2815,12 @@ static struct page *do_read_cache_page(s
> void *data,
> gfp_t gfp)
> {
> - bool refault = false;
> struct page *page;
> int err;
> repeat:
> page = find_get_page(mapping, index);
> if (!page) {
> + bool refault = false;
> unsigned long pflags;
>
> page = __page_cache_alloc(gfp);
> _
>
It's so that when we jump to 'filler:' from outside the branch, the
'refault' variable is initialized from the first time through:
bool refault = false;
struct page *page;
page = find_get_page(mapping, index);
if (!page) {
__page_cache_alloc()
add_to_page_cache_lru()
refault = PageWorkingset(page);
filler:
if (refault)
psi_memstall_enter(&pflags);
readpage()
if (refault)
psi_memstall_leave(&pflags);
}
lock_page()
if (PageUptodate())
goto out;
goto filler;