Re: [PATCH v2 0/4] Dump off-cpu samples directly

From: Howard Chu
Date: Thu May 23 2024 - 12:34:53 EST


Hello,

On Thu, May 23, 2024 at 12:34 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Wed, May 15, 2024 at 9:56 PM Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >
> > On Wed, May 15, 2024 at 9:24 PM Howard Chu <howardchu95@xxxxxxxxx> wrote:
> > >
> > > Hello,
> > >
> > > Here is a little update on --off-cpu.
> > >
> > > > > It would be nice to start landing this work so I'm wondering what the
> > > > > minimal way to do that is. It seems putting behavior behind a flag is
> > > > > a first step.
> > >
> > > The flag to determine output threshold of off-cpu has been implemented.
> > > If the accumulated off-cpu time exceeds this threshold, output the sample
> > > directly; otherwise, save it later for off_cpu_write.
> > >
> > > But adding an extra pass to handle off-cpu samples introduces performance
> > > issues, here's the processing rate of --off-cpu sampling(with the
> > > extra pass to extract raw
> > > sample data) and without. The --off-cpu-threshold is in nanoseconds.
> > >
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | comm | type
> > > | process rate |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 4999 -a | regular
> > > samples (w/o extra pass) | 13128.675 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 1 -a --off-cpu --off-cpu-threshold 100 | offcpu samples
> > > (extra pass) | 2843.247 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 4999 -a --off-cpu --off-cpu-threshold 100 | offcpu &
> > > regular samples (extra pass) | 3910.686 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 4999 -a --off-cpu --off-cpu-threshold 1000000000 | few offcpu &
> > > regular (extra pass) | 4661.229 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
>
> What does the process rate mean? Is the sample for the
> off-cpu event or other (cpu-cycles)? Is it from a single CPU
> or system-wide or per-task?

Process rate is just a silly name for the time record__pushfn() takes
to write data from the ring buffer.
record__pushfn() is where I added the extra pass to strip the off-cpu
samples from the original raw
samples that eBPF's perf_output collected.

With -a option it runs on all cpu, system-wide. Sorry that I only
tested on extreme cases.

I ran perf record on `-F 4999 -a `, `-F 1 -a --off-cpu
--off-cpu-threshold 100`, `-F 4999 -a --off-cpu
--off-cpu-threshold 100`, and `-F 4999 -a --off-cpu
--off-cpu-threshold 1000000000`.
`-F 4999 -a` is only cpu-cycles samples which is the fastest(13128.675
samples/ms)
when it comes to writing samples to perf.data, because there's no
extra pass for stripping
extra data from BPF's raw samples.

`-F 1 -a --off-cpu --off-cpu-threshold 100` is mostly off-cpu samples,
which requires considerably
more time to strip the data, being the slowest(2843.247 samples/ms).

`-F 4999 -a --off-cpu --off-cpu-threshold 100` is half and half, lots
of cpu-cycle samples so
a little faster than the former one(3910.686 samples/ms). Because for cpu-cycles
samples, there's no extra handling(but there's still cost on copying
memory back and forth).

`-F 4999 -a --off-cpu --off-cpu-threshold 1000000000` is a blend of a
large amount of cpu-cycles
samples and only a couple of off-cpu samples. It is the
fastest(4661.229 samples/ms) but still
nowhere near the original one, which doesn't have the extra pass of
off_cpu_strip().

What I'm trying to say is just, stripping/handling off-cpu samples at
runtime is a bad idea, the extra
pass of off_cpu_strip() should be reconsidered. Reading events one by
one, put together samples,
and checking sample_id and stuff introduces lots of overhead. It
should be done at save time.

By the way, the default off_cpu_write() is perfectly fine.

Sorry about the horrible data table and explanation; they will be more
readable next time.

>
> > >
> > > It's not ideal. I will find a way to reduce overhead. For example
> > > process them samples
> > > at save time as Ian mentioned.
> > >
> > > > > To turn the bpf-output samples into off-cpu events there is a pass
> > > > > added to the saving. I wonder if that can be more generic, like a save
> > > > > time perf inject.
> > >
> > > And I will find a default value for such a threshold based on performance
> > > and common use cases.
> > >
> > > > Sounds good. We might add an option to specify the threshold to
> > > > determine whether to dump the data or to save it for later. But ideally
> > > > it should be able to find a good default.
> > >
> > > These will be done before the GSoC kick-off on May 27.
> >
> > This all sounds good. 100ns seems like quite a low threshold and 1s
> > extremely high, shame such a high threshold is marginal for the

> > context switch performance change. I wonder 100 microseconds may be a
> > more sensible threshold. It's 100 times larger than the cost of 1
> > context switch but considerably less than a frame redraw at 60FPS (16
> > milliseconds).
>
> I don't know what's the sensible default. But 1 msec could be
> another candidate for the similar reason. :)

Sure, I'll give them all a test and see the overhead they cause.

I understand that all I'm talking about is optimization, and that premature
optimization is the root of all evil. However, being almost three times
slower for only a few dozen direct off-CPU samples sounds weird to me.

Thanks,
Howard
>
> Thanks,
> Namhyung