Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
From: John Groves
Date: Tue May 21 2024 - 22:05:37 EST
Initial reply to both Amir and Miklos. Sorry for the delay - I took a few
days off after LSFMM and I'm just re-engaging now.
First an observation: these messages are on the famfs v1 patch set thread.
The v2 patch set is at [1]. That is also the default branch now if you clone
the famfs kernel from [2].
Among the biggest changes at v2 is dropping /dev/pmem support and only
supporting /dev/dax (character) devices as backing devs for famfs.
On 24/05/19 08:59AM, Amir Goldstein wrote:
> On Fri, May 17, 2024 at 12:55 PM Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> >
> > On Thu, 29 Feb 2024 at 07:52, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> >
> > > I'm not virtiofs expert, but I don't think that you are wrong about this.
> > > IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped
> > > by virtiofs client.
> > >
> > > So what are the gaps between virtiofs and famfs that justify a new filesystem
> > > driver and new userspace API?
> >
> > Let me try to fill in some gaps. I've looked at the famfs driver
> > (even tried to set it up in a VM, but got stuck with the EFI stuff).
I'm happy to help with that if you care - ping me if so; getting a VM running
in EFI mode is not necessary if you reserve the dax memory via memmap=, or
via libvirt xml.
> >
> > - famfs has an extent list per file that indicates how each page
> > within the file should be mapped onto the dax device, IOW it has the
> > following mapping:
> >
> > [famfs file, offset] -> [offset, length]
More generally, a famfs file extent is [daxdev, offset, len]; there may
be multiple extents per file, and in the future this definitely needs to
generalize to multiple daxdev's.
Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
I think)...
A single backing device (daxdev) will contain extents of many famfs
files (plus metadata - currently a superblock and a log). I'm not sure
it's realistic to have a backing daxdev "open" per famfs file.
In addition there is:
- struct dax_holder_operations - to allow a notify_failure() upcall
from dax. This provides the critical capability to shut down famfs
if there are memory errors. This is filesystem- (or technically daxdev-
wide)
- The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
I strongly suspect that famfs_fuse can't be correct unless it uses
this path rather than just the idea of a single backing file.
This interface explicitly supports files that map to disjoint ranges
of one or more dax devices.
- the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
character devdax.
- Note that dax devices, unlike files, don't support read/write - only
mmap(). I suspect (though I'm still pretty ignorant) that this means
we can't just treat the dax device as an extent-based backing file.
> >
> > - fuse can currently map a fuse file onto a backing file:
> >
> > [fuse file] -> [backing file]
> >
> > The interface for the latter is
> >
> > backing_id = ioctl(dev_fuse_fd, FUSE_DEV_IOC_BACKING_OPEN, backing_map);
> > ...
> > fuse_open_out.flags |= FOPEN_PASSTHROUGH;
> > fuse_open_out.backing_id = backing_id;
>
> FYI, library and example code was recently merged to libfuse:
> https://github.com/libfuse/libfuse/pull/919
>
> >
> > This looks suitable for doing the famfs file - > dax device mapping as
> > well. I wouldn't extend the ioctl with extent information, since
> > famfs can just use FUSE_DEV_IOC_BACKING_OPEN once to register the dax
> > device. The flags field could be used to tell the kernel to treat
> > this fd as a dax device instead of a a regular file.
A dax device to famfs is a lot more like a backing device for a "filesystem"
than a backing file for another file. And, as previously mentioned, there
is the iomap_ops interface and the holder_ops interface that deal with
multiple file tenants on a dax device (plus error notification,
respectively)
Probably doable, but important distinctions...
> >
> > Letter, when the file is opened the extent list could be sent in the
> > open reply together with the backing id. The fuse_ext_header
> > mechanism seems suitable for this.
> >
> > And I think that's it as far as API's are concerned.
> >
> > Note: this is already more generic than the current famfs prototype,
> > since multiple dax devices could be used as backing for famfs files,
> > with the constraint that a single file can only map data from a single
> > dax device.
> >
> > As for implementing dax passthrough, I think that needs a separate
> > source file, the one used by virtiofs (fs/fuse/dax.c) does not appear
> > to have many commonalities with this one. That could be renamed to
> > virtiofs_dax.c as it's pretty much virtiofs specific, AFAICT.
> >
> > Comments?
>
> Would probably also need to decouple CONFIG_FUSE_DAX
> from CONFIG_FUSE_VIRTIO_DAX.
>
> What about fc->dax_mode (i.e. dax= mount option)?
>
> What about FUSE_IS_DAX()? does it apply to both dax implementations?
>
> Sounds like a decent plan.
> John, let us know if you need help understanding the details.
I'm certain I will need some help, but I'll try to do my part.
First question: can you suggest an example fuse file pass-through
file system that I might use as a jumping-off point? Something that
gets the basic pass-through capability from which to start hacking
in famfs/dax capabilities?
When I started on famfs, I used ramfs because it got me all the basic
file system functionality minus a backing store. Then I built the dax
functionality by referring to xfs.
>
> > Am I missing something significant?
>
> Would we need to set IS_DAX() on inode init time or can we set it
> later on first file open?
>
> Currently, iomodes enforces that all opens are either
> mapped to same backing file or none mapped to backing file:
>
> fuse_inode_uncached_io_start()
> {
> ...
> /* deny conflicting backing files on same fuse inode */
>
> The iomodes rules will need to be amended to verify that:
> - IS_DAX() inode open is always mapped to backing dax device
> - All files of the same fuse inode are mapped to the same range
> of backing file/dax device.
I'm confused by the last item. I would think there would be a fuse
inode per famfs file, and that multiple of those would map to separate
extent lists of one or more backing dax devices.
Or maybe I misunderstand the meaning of "fuse inode". Feel free to
assign reading...
>
> Thanks,
> Amir.
Thanks Miklos and Amir,
John
[1] https://lore.kernel.org/linux-fsdevel/cover.1714409084.git.john@xxxxxxxxxx/T/#m3b11e8d311eca80763c7d6f27d43efd1cdba628b
[2] https://github.com/cxl-micron-reskit/famfs-linux