Re: [PATCH v2] KVM: SEV-ES: Don't intercept MSR_IA32_DEBUGCTLMSR for SEV-ES guests
From: Ravi Bangoria
Date: Fri May 17 2024 - 02:19:02 EST
Hi Sean,
Apologies for the delay in reply.
On 08-May-24 12:37 AM, Sean Christopherson wrote:
> On Mon, May 06, 2024, Ravi Bangoria wrote:
>> On 03-May-24 5:21 AM, Sean Christopherson wrote:
>>> On Tue, Apr 16, 2024, Ravi Bangoria wrote:
>>>> Currently, LBR Virtualization is dynamically enabled and disabled for
>>>> a vcpu by intercepting writes to MSR_IA32_DEBUGCTLMSR. This helps by
>>>> avoiding unnecessary save/restore of LBR MSRs when nobody is using it
>>>> in the guest. However, SEV-ES guest mandates LBR Virtualization to be
>>>> _always_ ON[1] and thus this dynamic toggling doesn't work for SEV-ES
>>>> guest, in fact it results into fatal error:
>>>>
>>>> SEV-ES guest on Zen3, kvm-amd.ko loaded with lbrv=1
>>>>
>>>> [guest ~]# wrmsr 0x1d9 0x4
>>>> KVM: entry failed, hardware error 0xffffffff
>>>> EAX=00000004 EBX=00000000 ECX=000001d9 EDX=00000000
>>>> ...
>>>>
>>>> Fix this by never intercepting MSR_IA32_DEBUGCTLMSR for SEV-ES guests.
>>>
>>> Uh, what? I mean, sure, it works, maybe, I dunno. But there's a _massive_
>>> disconnect between the first paragraph and this statement.
>>>
>>> Oh, good gravy, it "works" because SEV already forces LBR virtualization.
>>>
>>> svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
>>>
>>> (a) the changelog needs to call that out.
>>
>> Sorry, I should have called that out explicitly.
>>
>>> (b) KVM needs to disallow SEV-ES if
>>> LBR virtualization is disabled by the admin, i.e. if lbrv=false.
>>
>> That's what I initially thought. But since KVM currently allows booting SEV-ES
>> guests even when lbrv=0 (by silently ignoring lbrv value), erroring out would
>> be a behavior change.
>
> IMO, that's totally fine. There are no hard guarantees regarding module params,
Sure. I will prepare a patch to remove lbrv module parameter.
>>> Alternatively, I would be a-ok simply deleting lbrv, e.g. to avoid yet more
>>> printks about why SEV-ES couldn't be enabled.
>>>
>>> Hmm, I'd probably be more than ok. Because AMD (thankfully, blessedly) uses CPUID
>>> bits for SVM features, the admin can disable LBRV via clear_cpuid (or whatever it's
>>> called now). And there are hardly any checks on the feature, so it's not like
>>> having a boolean saves anything. AMD is clearly committed to making sure LBRV
>>> works, so the odds of KVM really getting much value out of a module param is low.
>>
>> Currently, lbrv is not enabled by default with model specific -cpu profiles in
>> qemu. So I guess this is not backward compatible?
>
> I am talking about LBRV being disabled in the _host_ kernel, not guest CPUID.
> QEMU enabling LBRV only affects nested SVM, which is out of scope for SEV-ES.
Got it.
>>> And then when you delete lbrv, please add a WARN_ON_ONCE() sanity check in
>>> sev_hardware_setup() (if SEV-ES is supported), because like the DECODEASSISTS
>>> and FLUSHBYASID requirements, it's not super obvious that LBRV is a hard
>>> requirement for SEV-ES (that's an understatment; I'm curious how some decided
>>> that LBR virtualization is where the line go drawn for "yeah, _this_ is mandatory").
>>
>> I'm not sure. Some ES internal dependency.
>>
>> In any case, the patch simply fixes 'missed clearing MSR Interception' for
>> SEV-ES guests. So, would it be okay to apply this patch as is and do lbrv
>> cleanup as a followup series?
>
> No.
>
> (a) the lbrv module param mess needs to be sorted out.
> (b) this is not a complete fix.
> (c) I'm not convinced it's the right way to fix this, at all.
> (d) there's a big gaping hole in KVM's handling of MSRs that are passed through
> to SEV-ES guests.
> (e) it's not clear to me that KVM needs to dynamically toggle LBRV for _any_ guest.
> (f) I don't like that sev_es_init_vmcb() mucks with the LBRV intercepts without
> using svm_enable_lbrv().
>
> Unless I'm missing something, KVM allows userspace to get/set MSRs for SEV-ES
> guests, even after the VMSA is encrypted. E.g. a naive userspace could attempt
> to migrate MSR_IA32_DEBUGCTLMSR and end up unintentionally disabling LBRV on the
> target. The proper fix for VMSA being encrypted is to likely to disallow
> KVM_{G,S}ET_MSR on MSRs that are contexted switched via the VMSA.
>
> But that doesn't address the issue where KVM will disable LBRV if userspace
> sets MSR_IA32_DEBUGCTLMSR before the VMSA is encrypted. The easiest fix for
> that is to have svm_disable_lbrv() do nothing for SEV-ES guests, but I'm not
> convinced that's the best fix.
Agreed, 1) KVM_GET/SET_MSR for SEV-ES guest after VMSA encrypted and 2) the
window between setting LBRV to VMSA encryption, both are valid issues. I've
prepared a draft patch, attached at the end, can you please review.
> AFAICT, host perf doesn't use the relevant MSRs, and even if host perf did use
> the MSRs, IIUC there is no "stack", and #VMEXIT retains the guest values for
> non-SEV-ES guests. I.e. functionally, running with and without LBRV would be
> largely equivalent as far as perf is concerned. The guest could scribble an MSR
> with garbage, but overall, host perf wouldn't be meaningfully affected by LBRV.
FWIW, AMD has multiple versions of LBRs with virt support:
- Legacy LBR (1 deep, No freeze support on PMI)
- LBR Stack (16 deep, Has freeze support on PMI).
Both are independent and perf uses only LBR Stack.
> So unless I'm missing something, the only reason to ever disable LBRV would be
> for performance reasons. Indeed the original commits more or less says as much:
>
> commit 24e09cbf480a72f9c952af4ca77b159503dca44b
> Author: Joerg Roedel <joerg.roedel@xxxxxxx>
> AuthorDate: Wed Feb 13 18:58:47 2008 +0100
>
> KVM: SVM: enable LBR virtualization
>
> This patch implements the Last Branch Record Virtualization (LBRV) feature of
> the AMD Barcelona and Phenom processors into the kvm-amd module. It will only
> be enabled if the guest enables last branch recording in the DEBUG_CTL MSR. So
> there is no increased world switch overhead when the guest doesn't use these
> MSRs.
>
> but what it _doesn't_ say is what the world switch overhead is when LBRV is
> enabled. If the overhead is small, e.g. 20 cycles?, then I see no reason to
> keep the dynamically toggling.
>
> And if we ditch the dynamic toggling, then this patch is unnecessary to fix the
> LBRV issue. It _is_ necessary to actually let the guest use the LBRs, but that's
> a wildly different changelog and justification.
The overhead might be less for legacy LBR. But upcoming hw also supports
LBR Stack Virtualization[1]. LBR Stack has total 34 MSRs (two control and
16*2 stack). Also, Legacy and Stack LBR virtualization both are controlled
through the same VMCB bit. So I think I still need to keep the dynamic
toggling for LBR Stack virtualization.
[1] AMD64 Architecture Programmer's Manual Pub. 40332, Rev. 4.07 - June
2023, Vol 2, 15.23 Last Branch Record Virtualization
> And if we _don't_ ditch the dynamic toggling, then sev_es_init_vmcb() should be
> using svm_enable_lbrv(), not open coding the exact same thing.
Agreed. The patch below covers this change.
---
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 759581bb2128..7e549ca0a4e9 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -666,6 +666,14 @@ static int __sev_launch_update_vmsa(struct kvm *kvm, struct kvm_vcpu *vcpu,
return ret;
vcpu->arch.guest_state_protected = true;
+
+ /*
+ * SEV-ES guest mandates LBR Virtualization to be _always_ ON. Enable
+ * it after setting guest_state_protected because KVM_SET_MSRS allows
+ * dynamic toggeling of LBRV (for performance reason) on write access
+ * to MSR_IA32_DEBUGCTLMSR when guest_state_protected is not set.
+ */
+ svm_enable_lbrv(vcpu);
return 0;
}
@@ -3034,7 +3042,6 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
struct kvm_vcpu *vcpu = &svm->vcpu;
svm->vmcb->control.nested_ctl |= SVM_NESTED_CTL_SEV_ES_ENABLE;
- svm->vmcb->control.virt_ext |= LBR_CTL_ENABLE_MASK;
/*
* An SEV-ES guest requires a VMSA area that is a separate from the
@@ -3086,10 +3093,6 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm)
/* Clear intercepts on selected MSRs */
set_msr_interception(vcpu, svm->msrpm, MSR_EFER, 1, 1);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_CR_PAT, 1, 1);
- set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1);
- set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
- set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
- set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
}
void sev_init_vmcb(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9aaf83c8d57d..4a8bd32dfa96 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -99,6 +99,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_IA32_SPEC_CTRL, .always = false },
{ .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_FLUSH_CMD, .always = false },
+ { .index = MSR_IA32_DEBUGCTLMSR, .always = false },
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP, .always = false },
@@ -990,7 +991,7 @@ void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb)
vmcb_mark_dirty(to_vmcb, VMCB_LBR);
}
-static void svm_enable_lbrv(struct kvm_vcpu *vcpu)
+void svm_enable_lbrv(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
@@ -1000,6 +1001,9 @@ static void svm_enable_lbrv(struct kvm_vcpu *vcpu)
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
+ if (sev_es_guest(vcpu->kvm))
+ set_msr_interception(vcpu, svm->msrpm, MSR_IA32_DEBUGCTLMSR, 1, 1);
+
/* Move the LBR msrs to the vmcb02 so that the guest can see them. */
if (is_guest_mode(vcpu))
svm_copy_lbrs(svm->vmcb, svm->vmcb01.ptr);
@@ -1009,6 +1013,8 @@ static void svm_disable_lbrv(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ KVM_BUG_ON(sev_es_guest(vcpu->kvm), vcpu->kvm);
+
svm->vmcb->control.virt_ext &= ~LBR_CTL_ENABLE_MASK;
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0);
set_msr_interception(vcpu, svm->msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
@@ -2821,10 +2827,24 @@ static int svm_get_msr_feature(struct kvm_msr_entry *msr)
return 0;
}
+static bool
+sev_es_prevent_msr_access(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ return sev_es_guest(vcpu->kvm) &&
+ vcpu->arch.guest_state_protected &&
+ svm_msrpm_offset(msr_info->index) != MSR_INVALID &&
+ !msr_write_intercepted(vcpu, msr_info->index);
+}
+
static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ if (sev_es_prevent_msr_access(vcpu, msr_info)) {
+ msr_info->data = 0;
+ return 0;
+ }
+
switch (msr_info->index) {
case MSR_AMD64_TSC_RATIO:
if (!msr_info->host_initiated &&
@@ -2975,6 +2995,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
u32 ecx = msr->index;
u64 data = msr->data;
+
+ if (sev_es_prevent_msr_access(vcpu, msr))
+ return 0;
+
switch (ecx) {
case MSR_AMD64_TSC_RATIO:
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 33878efdebc8..7b2c55dd8242 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -30,7 +30,7 @@
#define IOPM_SIZE PAGE_SIZE * 3
#define MSRPM_SIZE PAGE_SIZE * 2
-#define MAX_DIRECT_ACCESS_MSRS 47
+#define MAX_DIRECT_ACCESS_MSRS 48
#define MSRPM_OFFSETS 32
extern u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
extern bool npt_enabled;
@@ -543,6 +543,7 @@ u32 *svm_vcpu_alloc_msrpm(void);
void svm_vcpu_init_msrpm(struct kvm_vcpu *vcpu, u32 *msrpm);
void svm_vcpu_free_msrpm(u32 *msrpm);
void svm_copy_lbrs(struct vmcb *to_vmcb, struct vmcb *from_vmcb);
+void svm_enable_lbrv(struct kvm_vcpu *vcpu);
void svm_update_lbrv(struct kvm_vcpu *vcpu);
int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer);
---
This is just a draft patch. I'll add logic to bail out SEV-ES guest creation
when LBRV is not supported by host, remove lbrv module parameter etc.
Thanks,
Ravi