Re: [RESEND PATCH v6 08/12] x86/fsgsbase/64: Use the per-CPU base as GSBASE at the paranoid_entry

From: Thomas Gleixner
Date: Fri Apr 05 2019 - 04:36:29 EST


On Mon, 25 Mar 2019, Thomas Gleixner wrote:
> On Fri, 15 Mar 2019, Chang S. Bae wrote:
> > ENTRY(paranoid_exit)
> > UNWIND_HINT_REGS
> > DISABLE_INTERRUPTS(CLBR_ANY)
> > TRACE_IRQS_OFF_DEBUG
> > + ALTERNATIVE "jmp .Lparanoid_exit_no_fsgsbase", "nop",\
> > + X86_FEATURE_FSGSBASE
> > + wrgsbase %rbx
> > + jmp .Lparanoid_exit_no_swapgs;
>
> Again. A few newlines would make it more readable.
>
> This modifies the semantics of paranoid_entry and paranoid_exit. Looking at
> the usage sites there is the following code in the nmi maze:
>
> /*
> * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
> * as we should not be calling schedule in NMI context.
> * Even with normal interrupts enabled. An NMI should not be
> * setting NEED_RESCHED or anything that normal interrupts and
> * exceptions might do.
> */
> call paranoid_entry
> UNWIND_HINT_REGS
>
> /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
> movq %rsp, %rdi
> movq $-1, %rsi
> call do_nmi
>
> /* Always restore stashed CR3 value (see paranoid_entry) */
> RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
>
> testl %ebx, %ebx /* swapgs needed? */
> jnz nmi_restore
> nmi_swapgs:
> SWAPGS_UNSAFE_STACK
> nmi_restore:
> POP_REGS
>
> I might be missing something, but how is that supposed to work when
> paranoid_entry uses FSGSBASE? I think it's broken, but if it's not then
> there is a big fat comment missing explaining why.

So this _is_ broken.

On entry:

rbx = rdgsbase()
wrgsbase(KERNEL_GS)

On exit:

if (ebx == 0)
swapgs

The resulting matrix:

| ENTRY GS | RBX | EXIT | GS on IRET | RESULT
| | | | |
1 | KERNEL_GS | KERNEL_GS | EBX == 0 | USER_GS | FAIL
| | | | |
2 | KERNEL_GS | KERNEL_GS | EBX != 0 | KERNEL_GS | ok
| | | | |
3 | USER_GS | USER_GS | EBX == 0 | USER_GS | ok
| | | | |
4 | USER_GS | USER_GS | EBX != 0 | KERNEL_GS | FAIL


#1 Just works by chance because it's unlikely that the lower 32bits of a
per CPU kernel GS are all 0.

But it's just a question of probability that this turns into a
non-debuggable once per year crash (think KASLR).

#4 This can happen when the NMI hits the kernel in some other entry code
_BEFORE_ or _AFTER_ swapgs.

User space using GS addressing with GS[31:0] != 0 will crash and burn.

IIRC FSGSBASE is about fast user space GS switching with (almost) no
limits on the value ...

Oh well.

tglx