Re: [PATCH 1/2] cpuidle : auto-promotion for cpuidle states

From: Abhishek
Date: Thu Apr 04 2019 - 07:10:57 EST

Next message: Jann Horn: "[PATCH] x86/microcode: Refactor Intel microcode loading"
Previous message: Peter Zijlstra: "Re: WARN_ON_ONCE() hit at kernel/events/core.c:330"
In reply to: Gautham R Shenoy: "Re: [PATCH 1/2] cpuidle : auto-promotion for cpuidle states"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 04/04/2019 03:51 PM, Daniel Lezcano wrote:

Hi Abhishek,

thanks for taking the time to test the different scenario and give us
the numbers.

On 01/04/2019 07:11, Abhishek wrote:

On 03/22/2019 06:56 PM, Daniel Lezcano wrote:

On 22/03/2019 10:45, Rafael J. Wysocki wrote:

On Fri, Mar 22, 2019 at 8:31 AM Abhishek Goel
<huntbag@xxxxxxxxxxxxxxxxxx> wrote:

Currently, the cpuidle governors (menu /ladder) determine what idle
state
an idling CPU should enter into based on heuristics that depend on the
idle history on that CPU. Given that no predictive heuristic is
perfect,
there are cases where the governor predicts a shallow idle state,
hoping
that the CPU will be busy soon. However, if no new workload is
scheduled
on that CPU in the near future, the CPU will end up in the shallow
state.

In case of POWER, this is problematic, when the predicted state in the
aforementioned scenario is a lite stop state, as such lite states will
inhibit SMT folding, thereby depriving the other threads in the core
from
using the core resources.

I can understand an idle state can prevent other threads to use the core
resources. But why a deeper idle state does not prevent this also?

To address this, such lite states need to be autopromoted. The cpuidle-
core can queue timer to correspond with the residency value of the next
available state. Thus leading to auto-promotion to a deeper idle
state as
soon as possible.

Isn't the tick stopping avoidance sufficient for that?

I was about to ask the same :)

Thanks for the review.
I performed experiments for three scenarios to collect some data.

case 1 :
Without this patch and without tick retained, i.e. in a upstream kernel,
It would spend more than even a second to get out of stop0_lite.

case 2 : With tick retained(as suggested) -

Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
it to take 8 sched tick to get out of stop0_lite. Experimentally,
observation was

===================================
minÂÂÂ ÂÂÂ ÂÂÂ maxÂÂÂ ÂÂÂ ÂÂÂ 99percentile
4msÂÂÂ ÂÂÂ ÂÂÂ 12msÂÂÂ ÂÂÂ Â 4ms
===================================
*ms = milliseconds

It would take atleast one sched tick to get out of stop0_lite.

case 2 :Â With this patch (not stopping tick, but explicitly queuing a
timer)

minÂÂÂ ÂÂÂ ÂÂÂ maxÂÂ ÂÂÂ ÂÂÂ ÂÂ 99.5percentile
===============================
144us ÂÂ ÂÂ 192us Â ÂÂÂ ÂÂ ÂÂÂ 144us
===============================
*us = microseconds

In this patch, we queue a timer just before entering into a stop0_lite
state. The timer fires at (residency of next available state + exit
latency of next available state * 2).

So for the context, we have a similar issue but from the power
management point of view where a CPU can stay in a shallow state for a
long period, thus consuming a lot of energy.

The window was reduced by preventing stopping the tick when a shallow
state is selected. Unfortunately, if the tick is stopped and we
exit/enter again and we select a shallow state, the situation is the same.

A solution was previously proposed with a timer some years ago, like
this patch does, and merged but there were complains about bad
performance impact, so it has been reverted.

Let's say if next state(stop0) is available which has residency of 20us, it
should get out in as low as (20+2*2)*8 [Based on the forumla (residency +
2xlatency)*history length] microseconds = 192us. Ideally we would expect 8
iterations, it was observed to get out in 6-7 iterations.

Can you explain the formula? I don't get the rational. Why using the
exit latency and why multiply it by 2?

Why the timer is not set to the next state's target residency value ?

The idea behind multiplying by 2 is, entry latency + exit latency = 2* exit latency, i.e.,
using exit latency = entry latency
So in effect, we are using target residency + 2 * exit latency for timeout of timer.
Latency is generally <=10% of residency. I have tried to be conservative by including latency
factor in computation for timeout. Thus, this formula will give slightly greater value compared
to directly using residency of target state.

--Abhishek

Next message: Jann Horn: "[PATCH] x86/microcode: Refactor Intel microcode loading"
Previous message: Peter Zijlstra: "Re: WARN_ON_ONCE() hit at kernel/events/core.c:330"
In reply to: Gautham R Shenoy: "Re: [PATCH 1/2] cpuidle : auto-promotion for cpuidle states"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]