Re: [PATCH v1 3/7] sched/isolation: Adjust affinity of hrtimers according to change of housekeeping cpumask
From: Thomas Gleixner
Date: Fri May 17 2024 - 19:42:51 EST
On Thu, May 16 2024 at 22:04, Costa Shulyupin wrote:
> Adjust affinity of watchdog_cpumask, hrtimers according to
> change of housekeeping.cpumasks[HK_TYPE_TIMER].
>
> Function migrate_hrtimer_list_except() is prototyped from
> migrate_hrtimer_list() and is more generic.
>
> Potentially it can be used instead of migrate_hrtimer_list.
>
> Function hrtimers_resettle_from_cpu() is blindly prototyped
> from hrtimers_cpu_dying(). local_irq_disable() is used because
> cpuhp_thread_fun() uses it before cpuhp_invoke_callback().
I'm again impressed by the depth of analysis.
Q: What the heck has cpuhp_thread_fun() to do with hrtimers_cpu_dying()?
A: Absolutely nothing.
> Core test snippets without infrastructure:
>
> 1. Create hrtimer on specific cpu with:
>
> set_cpus_allowed_ptr(current, cpumask_of(test_cpu));
> hrtimer_init(&test_hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> test_hrtimer.function = test_hrtimer_cb;
> hrtimer_start(&test_hrtimer, -1, HRTIMER_MODE_REL);
>
> 2. Call housekeeping_update()
>
> 3. Assure that there is only tick_nohz_handler on specified cpu
> in /proc/timer_list manually or with script:
>
> grep -E 'cpu| #[0-9]' /proc/timer_list | \
> awk "/cpu:/{y=0};/cpu: $test_cpu\$/{y=1};y"
>
> Another alternative solution to migrate hrtimers:
> 1. Use cpuhp to set sched_timer offline
> 2. Resettle all hrtimers likewise migrate_hrtimer_list
> 3. Use cpuhp to set sched_timer online
Another pile of incomprehensible word salad which pretends to be a
change log.
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -126,10 +126,12 @@ static void resettle_all_timers(cpumask_var_t enable_mask, cpumask_var_t disable
>
> for_each_cpu(cpu, enable_mask) {
> timers_prepare_cpu(cpu);
> + hrtimers_prepare_cpu(cpu);
And yet another lockless re-initialization of an active in use data
structure ...
> +/*
> + * migrate_hrtimer_list_except - migrates hrtimers from one base to another,
> + * except specified one.
> + */
> +static void migrate_hrtimer_list_except(struct hrtimer_clock_base *old_base,
> + struct hrtimer_clock_base *new_base, struct hrtimer *except)
> +{
> + struct hrtimer *timer;
> + struct timerqueue_node *node;
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations
> + node = timerqueue_getnext(&old_base->active);
> + while (node) {
> + timer = container_of(node, struct hrtimer, node);
> + node = timerqueue_iterate_next(node);
> + if (timer == except)
> + continue;
What's the rationale that there is only a single timer to except from
the migration?
> + BUG_ON(hrtimer_callback_running(timer));
Q: What prevents that a hrtimer callback is running on the CPU which was
not isolated before?
A: Nothing. Ergo this is prone to kill a perfectly correct system just
because of blindly copying something.
At least your attempt of a changelog is clear about that...
> + debug_deactivate(timer);
> +
> + /*
> + * Mark it as ENQUEUED not INACTIVE otherwise the
> + * timer could be seen as !active and just vanish away
> + * under us on another CPU
> + */
> + __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
> + timer->base = new_base;
> + /*
> + * Enqueue the timers on the new cpu. This does not
> + * reprogram the event device in case the timer
> + * expires before the earliest on this CPU, but we run
> + * hrtimer_interrupt after we migrated everything to
> + * sort out already expired timers and reprogram the
> + * event device.
> + */
> + enqueue_hrtimer(timer, new_base, HRTIMER_MODE_ABS);
> + }
> +}
> +
> +/**
> + * hrtimers_resettle_from_cpu - resettles hrtimers from
> + * specified cpu to housekeeping cpus.
> + */
> +void hrtimers_resettle_from_cpu(unsigned int isol_cpu)
> +{
> + int ncpu, i;
> + struct tick_sched *ts = tick_get_tick_sched(isol_cpu);
> + struct hrtimer_cpu_base *old_base, *new_base;
> +
> + local_irq_disable();
> + ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
> +
> + old_base = &per_cpu(hrtimer_bases, isol_cpu);
> + new_base = &per_cpu(hrtimer_bases, ncpu);
> +
> + /*
> + * The caller is globally serialized and nobody else
> + * takes two locks at once, deadlock is not possible.
> + */
> + raw_spin_lock(&old_base->lock);
> + raw_spin_lock_nested(&new_base->lock, SINGLE_DEPTH_NESTING);
> + for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) {
> + migrate_hrtimer_list_except(&old_base->clock_base[i],
> + &new_base->clock_base[i],
> + &ts->sched_timer);
> + }
> +
> + /*
> + * The migration might have changed the first expiring softirq
> + * timer on this CPU. Update it.
> + */
> + __hrtimer_get_next_event(new_base, HRTIMER_ACTIVE_SOFT);
> +
> + raw_spin_unlock(&new_base->lock);
> + raw_spin_unlock(&old_base->lock);
> + local_irq_enable();
> +
> + /* Tell the other CPU to retrigger the next event */
> + smp_call_function_single(ncpu, retrigger_next_event, NULL, 0);
> +}
We clearly need another copy of hrtimers_cpu_dying() and
migrate_hrtimer_list() to add a local_irq_dis/enable() pair and a
completely ill defined exclusion mechanism which assumes that the tick
hrtimer is the only one which has to be excluded from migration.
Even if the exlusion mechanism would be correct there is ZERO
justification for this copy and pasta orgy even if you marked this
series RFC. RFC != POC hackery.
Thanks,
tglx