Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
From: Amir Goldstein
Date: Wed May 22 2024 - 06:16:33 EST
On Wed, May 22, 2024 at 11:58 AM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>
> On Wed, 22 May 2024 at 04:05, John Groves <John@xxxxxxxxxx> wrote:
> > I'm happy to help with that if you care - ping me if so; getting a VM running
> > in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> > via libvirt xml.
>
> Could you please give an example?
>
> I use a raw qemu command line with a -kernel option and a root fs
> image (not a disk image with a bootloader).
>
>
> > More generally, a famfs file extent is [daxdev, offset, len]; there may
> > be multiple extents per file, and in the future this definitely needs to
> > generalize to multiple daxdev's.
> >
> > Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
> > I think)...
> >
> > A single backing device (daxdev) will contain extents of many famfs
> > files (plus metadata - currently a superblock and a log). I'm not sure
> > it's realistic to have a backing daxdev "open" per famfs file.
>
> That's exactly what I was saying.
>
> The passthrough interface was deliberately done in a way to separate
> the mapping into two steps:
>
> 1) registering the backing file (which could be a device)
>
> 2) mapping from a fuse file to a registered backing file
>
> Step 1 can happen at any time, while step 2 currently happens at open,
> but for various other purposes like metadata passthrough it makes
> sense to allow the mapping to happen at lookup time and be cached for
> the lifetime of the inode.
>
> > In addition there is:
> >
> > - struct dax_holder_operations - to allow a notify_failure() upcall
> > from dax. This provides the critical capability to shut down famfs
> > if there are memory errors. This is filesystem- (or technically daxdev-
> > wide)
>
> This can be hooked into fuse_is_bad().
>
> > - The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
> > and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
> > I strongly suspect that famfs_fuse can't be correct unless it uses
> > this path rather than just the idea of a single backing file.
>
> Agreed.
>
> > - the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
> > character devdax.
>
> You'll need to channel those patches through the respective
> maintainers, preferably before the fuse parts are merged.
>
> > - Note that dax devices, unlike files, don't support read/write - only
> > mmap(). I suspect (though I'm still pretty ignorant) that this means
> > we can't just treat the dax device as an extent-based backing file.
>
> Doesn't matter, it'll use the iomap infrastructure instead of the
> passthrough infrastructure.
>
> But the interfaces for regular passthrough and fsdax could be shared.
> Conceptually they are very similar: there's a backing store indexable
> with byte offsets.
>
> What's currently missing from the API is an extent list in
> fuse_open_out. The format could be:
>
> [ {backing_id, offset, length}, ... ]
>
> allowing each extent to map to a different backing device.
>
> > A dax device to famfs is a lot more like a backing device for a "filesystem"
> > than a backing file for another file. And, as previously mentioned, there
> > is the iomap_ops interface and the holder_ops interface that deal with
> > multiple file tenants on a dax device (plus error notification,
> > respectively)
> >
> > Probably doable, but important distinctions...
>
> Yeah, that's why I suggested to create a new source file for this
> within fs/fuse. Alternatively we could try splitting up fuse into
> modules (core, virtiofs, cuse, fsdax) but I think that can be left as
> a cleanup step.
>
> > First question: can you suggest an example fuse file pass-through
> > file system that I might use as a jumping-off point? Something that
> > gets the basic pass-through capability from which to start hacking
> > in famfs/dax capabilities?
>
> An example is in Amir's libfuse repo at
>
> https://github.com/libfuse/libfuse
>
That's not my repo, it's the official one ;-)
but yeh, my passthrough example got merged last week:
https://github.com/libfuse/libfuse/pull/919
> > I'm confused by the last item. I would think there would be a fuse
> > inode per famfs file, and that multiple of those would map to separate
> > extent lists of one or more backing dax devices.
>
> Yeah.
>
> > Or maybe I misunderstand the meaning of "fuse inode". Feel free to
> > assign reading...
>
> I think Amir meant that each open file could in theory have a
> different mapping. This is allowed by the fuse interface, but is
> disallowed in practice.
>
> I'm in favor of caching the extent map so it only has to be given on
> the first open (or lookup).
Yeh, sorry, that was a bit confusing.
The statement is that because the simples plan as Miklos
suggested is to pass the extent list in reply to open
two different opens of the same inode are not allowed to
pass in different extent lists.
The new iomode.c code does something similar.
Currently fuse_inode has a reference to fuse_backing which
stores the backing file (that can be the dax device) and it also
has a reference to fuse_inode_dax with an rbtree of fuse_dax_mapping
Can we reuse fuse_inode_dax for the needs of famfs?
The first open would cache the extent list in fuse_inode and
second open would verify that the extent list matches.
Last file close could clean the cache extent list or not - that
is an API decision.
Thanks,
Amir.