On Fri, May 10, 2024 at 10:13:02AM -0700, Yang Shi wrote:
On 5/10/24 5:11 AM, Catalin Marinas wrote:It may be arch independent but it's not a great choice. If you run this
On Tue, May 07, 2024 at 03:35:58PM -0700, Yang Shi wrote:It is not about whether atomic is more efficient than plain store on our
The atomic RMW instructions, for example, ldadd, actually does load +I'd also argue that this should be optimised in openjdk. Is an LDADD
add + store in one instruction, it may trigger two page faults, the
first fault is a read fault, the second fault is a write fault.
Some applications use atomic RMW instructions to populate memory, for
example, openjdk uses atomic-add-0 to do pretouch (populate heap memory
at launch time) between v18 and v22.
more efficient on your hardware than a plain STR? I hope it only does
one operation per page rather than per long. There's also MAP_POPULATE
that openjdk can use to pre-fault the pages with no additional fault.
This would be even more efficient than any store or atomic operation.
hardware or not. It is arch-independent solution used by openjdk.
on pre-LSE atomics hardware (ARMv8.0), this operation would involve
LDXR+STXR and there's no way for the kernel to "upgrade" it to a write
operation on the first LDXR fault.
It would be good to understand why openjdk is doing this instead of a
plain write. Is it because it may be racing with some other threads
already using the heap? That would be a valid pattern.
As you noticed, even if we change the spec, we still have the oldNot sure the reason for the architecture to report a read fault only onYeah, I'm confused too. Triggering write fault in the first place should be
atomics. Looking at the pseudocode, it checks for both but the read
permission takes priority. Also in case of a translation fault (which is
what we get on the first fault), I think the syndrome write bit is
populated as (!read && write), so 0 since 'read' is 1 for atomics.
fine, right? Can we update the spec?
hardware. Also, changing the spec would probably need to come with a new
CPUID field since that's software visible. I'll raise it with the
architects, maybe in the future it will allow us to skip the instruction
read.
The current kernel mm_forbids_zeropage() is a big knob irrespective ofBut we still needs to decode the insn, right? Or you mean forbid zero pageBut the double page fault has some problems:I can see why the current behaviour is not ideal but I can't tell why
1. Noticeable TLB overhead. The kernel actually installs zero page with
readonly PTE for the read fault. The write fault will trigger a
write-protection fault (CoW). The CoW will allocate a new page and
make the PTE point to the new page, this needs TLB invalidations. The
tlb invalidation and the mandatory memory barriers may incur
significant overhead, particularly on the machines with many cores.
openjdk does it this way either.
A bigger hammer would be to implement mm_forbids_zeropage() but this may
affect some workloads that rely on sparsely populated large arrays.
for all read fault? IMHO, this may incur noticeable overhead for read fault
since the fault handler has to allocate real page every time.
the instruction triggering the fault.
Your mask above covers unallocated opcodes, we don't know what else willI think we can know the instruction by decoding it, right? Then we candiff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.hThis looks correct, it covers the LDADD and SWP instructions. However,
index db1aeacd4cd9..5d5a3fbeecc0 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -319,6 +319,7 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \
* "-" means "don't care"
*/
__AARCH64_INSN_FUNCS(class_branch_sys, 0x1c000000, 0x14000000)
+__AARCH64_INSN_FUNCS(class_atomic, 0x3b200c00, 0x38200000)
one concern is whether future architecture versions will add some
instructions in this space that are allowed to do a read only operation
(e.g. skip writing if the value is the same or fails some comparison).
decide whether force write fault or not by further decoding.
get in there in the future, whether we get instructions that only do
reads. We could ask for clarification from the architects but I doubt
they'd commit to allocating it only to instructions that do a write in
this space. The alternative is to check for the individual instructions
already allocated in here (after the big mask check above) but this will
increase the fault cost a bit.
There are CAS and CASP variants that also require a write permission
even if they fail the check. We should cover them as well.
[...]diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8251e2fea9c7..f7bceedf5ef3 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -529,6 +529,7 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
unsigned int mm_flags = FAULT_FLAG_DEFAULT;
unsigned long addr = untagged_addr(far);
struct vm_area_struct *vma;
+ unsigned int insn;
if (kprobe_page_fault(regs, esr))
return 0;
@@ -586,6 +587,24 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
if (!vma)
goto lock_mmap;
+ if (mm_flags & (FAULT_FLAG_WRITE | FAULT_FLAG_INSTRUCTION))
+ goto continue_fault;
Yes. Not widely used though.You mean the text section permission of the test is executive only?+This prevents recursively entering do_page_fault() but it may be worth
+ pagefault_disable();
testing it with an execute-only permission.
A point Will raised was on potential ABI changes introduced by this
patch. The ESR_EL1 reported to user remains the same as per the hardware
spec (read-only), so from a SIGSEGV we may have some slight behaviour
changes:
1. PTE invalid:
a) vma is VM_READ && !VM_WRITE permission - SIGSEGV reported with
ESR_EL1.WnR == 0 in sigcontext with your patch. Without this
patch, the PTE is mapped as PTE_RDONLY first and a subsequent
fault will report SIGSEGV with ESR_EL1.WnR == 1.
b) vma is !VM_READ && !VM_WRITE permission - SIGSEGV reported with
ESR_EL1.WnR == 0, so no change from current behaviour, unless we
fix the patch for (1.a) to fake the WnR bit which would change the
current expectations.
2. PTE valid with PTE_RDONLY - we get a normal writeable fault in
hardware, no need to fix ESR_EL1 up.
The patch would have to address (1) above but faking the ESR_EL1.WnR bit
based on the vma flags looks a bit fragile.
Similarly, we have userfaultfd that reports the fault to user. I think
in scenario (1) the kernel will report UFFD_PAGEFAULT_FLAG_WRITE with
your patch but no UFFD_PAGEFAULT_FLAG_WP. Without this patch, there are
indeed two faults, with the second having both UFFD_PAGEFAULT_FLAG_WP
and UFFD_PAGEFAULT_FLAG_WRITE set.