On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@xxxxxxx> wrote:
On 18/12/2019 4:39 am, Cong Wang wrote:
The IOVA cache algorithm implemented in IOMMU code does not
exactly match the original algorithm described in the paper
"Magazines and Vmem: Extending the Slab Allocator to Many
CPUs and Arbitrary Resources".
Particularly, it doesn't need to free the loaded empty magazine
when trying to put it back to global depot. To make it work, we
have to pre-allocate magazines in the depot and only recycle them
when all of them are full.
Before this patch, rcache->depot[] contains either full or
freed entries, after this patch, it contains either full or
empty (but allocated) entries.
How much additional memory overhead does this impose (particularly on
systems that may have many domains mostly used for large, long-term
mappings)? I'm wary that trying to micro-optimise for the "churn network
packets as fast as possible" case may penalise every other case,
potentially quite badly. Lower-end embedded systems are using IOMMUs in
front of their GPUs, video codecs, etc. precisely because they *don't*
have much memory to spare (and thus need to scrape together large
buffers out of whatever pages they can find).
The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes,
which is roughly 192K, per domain.
But on the other hand, if we were to go down this route, then why is
there any dynamic allocation/freeing left at all? Once both the depot
and the rcaches are preallocated, then AFAICS it would make more sense
to rework the overflow case in __iova_rcache_insert() to just free the
IOVAs and swap the empty mag around rather than destroying and
recreating it entirely.
It's due to the algorithm requires a swap(), which can't be done with
statically allocated magzine. I had the same thought initially but gave it
up quickly when realized this.
If you are suggesting to change the algorithm, it is not a goal of this
patchset. I do have plan to search for a better algorithm as the IOMMU
performance still sucks (comparing to no IOMMU) after this patchset,
but once again, I do not want to change it in this patchset.
(My ultimate goal is to find a spinlock-free algorithm, otherwise there is
no way to make it close to no-IOMMU performance.)
Perhaps there's a reasonable compromise wherein we don't preallocate,
but still 'free' empty magazines back to the depot, such that busy
domains will quickly reach a steady-state. In fact, having now dug up
the paper at this point of writing this reply, that appears to be what
fig. 3.1b describes anyway - I don't see any mention of preallocating
the depot.
That paper missed a lot of things, it doesn't even recommend a size
of a depot or percpu cache. For implementation, we still have to
think about those details, including whether to preallocate memory.