CephFS gives you POSIX semantics on top of a Ceph cluster — directories, file locks, ownership — without running NFS. The protocol talks directly to the cluster’s metadata daemons (MDS) and OSDs via the Ceph kernel client, so you skip the user-space FUSE overhead and the single-server bottleneck of NFS. This article walks the filesystem creation on an existing Ceph cluster, the MDS placement decisions that determine throughput, kernel vs FUSE client tradeoffs, and how we apply quotas and snapshots in production.
How to verify
# Filesystem and pool listing
sudo ceph fs ls
sudo ceph fs status
sudo ceph fs dump | head -30
sudo ceph mds stat
# MDS daemon health and placement
sudo ceph orch ps --daemon-type mds
sudo ceph mds metadata
# Client-side check
mount | grep ceph
sudo ceph fs subvolume ls cephfs
What’s happening
A CephFS filesystem requires three things on the cluster: a data pool (where file contents are striped as RADOS objects), a metadata pool (where directory entries, file modes, ACLs live), and at least one MDS daemon (Metadata Server, which serves metadata operations to clients). Data and metadata pools are separate because their access patterns differ — metadata is small, latency-sensitive, and benefits from SSD-backed OSDs; data is large, throughput-bound, and can live on cheaper spinning disks. Many production deployments use a placement rule that pins the metadata pool to NVMe OSDs.
The MDS daemon is the unusual piece. A single MDS handles all metadata operations for a filesystem — ls, stat, rename, lock acquisition. For modest workloads that’s fine, but in heavy multi-tenant environments the MDS becomes the bottleneck and you need multi-active MDS: split the metadata namespace into sub-trees, each handled by a different MDS daemon. That’s enabled by ceph fs set <fs> max_mds 2 (or more), and Ceph automatically balances sub-trees across active MDSs. The catch is that you also need standby MDS daemons for failover — for N active you typically run N+1 daemons.
Client choice: the kernel client (mount -t ceph) is in mainline Linux and is faster, but lags in feature support — newer file system extensions may require a recent kernel. The FUSE client (ceph-fuse) is always feature-current but adds userspace overhead per syscall. For VMs and bare metal where you control the kernel, use the kernel client; for containerized workloads with a minimal kernel, fall back to FUSE.
The procedure
-
On the admin node, create dedicated metadata and data pools. Use replicated for metadata (PG count modest), and either replicated or erasure-coded for data depending on capacity vs durability tradeoffs:
sudo ceph osd pool create cephfs_metadata 32 32 replicated sudo ceph osd pool create cephfs_data 128 128 replicated sudo ceph osd pool set cephfs_metadata size 3 sudo ceph osd pool set cephfs_data size 3 -
Create the filesystem and tag the pools:
sudo ceph fs new cephfs cephfs_metadata cephfs_data sudo ceph fs ls -
Place MDS daemons (3 total, 1 active + 2 standby to start):
sudo ceph orch apply mds cephfs --placement="3 ceph1 ceph2 ceph3" sudo ceph orch ps --daemon-type mds -
Generate a client keyring scoped to the filesystem:
sudo ceph fs authorize cephfs client.app01 / rw \ -o /etc/ceph/ceph.client.app01.keyring sudo scp /etc/ceph/ceph.client.app01.keyring root@app01:/etc/ceph/ sudo scp /etc/ceph/ceph.conf root@app01:/etc/ceph/ -
On the client, mount with the kernel client (Ubuntu 24.04 ships a recent-enough kernel):
sudo apt install -y ceph-common sudo mkdir -p /mnt/cephfs sudo mount -t ceph ceph1,ceph2,ceph3:/ /mnt/cephfs \ -o name=app01,secretfile=/etc/ceph/app01.secret,fs=cephfsWhere
/etc/ceph/app01.secretcontains the raw key extracted from the keyring:sudo grep 'key =' /etc/ceph/ceph.client.app01.keyring | awk '{print $3}' | sudo tee /etc/ceph/app01.secret sudo chmod 600 /etc/ceph/app01.secret -
Persist in
/etc/fstab:ceph1,ceph2,ceph3:/ /mnt/cephfs ceph name=app01,secretfile=/etc/ceph/app01.secret,fs=cephfs,_netdev 0 0 -
Set a quota on a subdirectory (CephFS quotas are advisory and enforced by the client, but they work for the kernel client on recent Linux):
sudo setfattr -n ceph.quota.max_bytes -v 53687091200 /mnt/cephfs/tenant-a sudo setfattr -n ceph.quota.max_files -v 100000 /mnt/cephfs/tenant-a getfattr -n ceph.quota.max_bytes /mnt/cephfs/tenant-a -
Enable multi-active MDS once a single active daemon shows saturation:
sudo ceph fs set cephfs max_mds 2 sudo ceph orch apply mds cephfs --placement="4 ceph1 ceph2 ceph3 ceph4" sudo ceph fs status
Common pitfalls
- Mounting before the filesystem reaches
MDS_ALL_DOWNrecovery state will hang. After cluster reboot, wait forceph fs statusto show an active MDS before client mounts. - Quotas are client-enforced; a misbehaving client can exceed them. Pair with monitoring (
ceph daemon mds.<id> dump cacheandceph fs status) to catch over-quota tenants. - Multi-active MDS introduces sub-tree migration (“re-export”) that briefly stalls metadata operations on affected paths; observe
ceph fs statusfor re-export count and don’t enable lightly. - Snapshots on CephFS (
mkdir .snap/<name>inside any directory) are not enabled by default in older clusters — confirm withceph fs get cephfs | grep allow_new_snapsand enable viaceph fs set cephfs allow_new_snaps true. - Kernel client and ceph cluster version mismatches can produce subtle bugs; pair Squid (19.x) cluster with Linux 6.5+ for stable feature parity.
In the engagements we run, CephFS replaces NFS when the shared-storage workload has many concurrent writers and a single NFS server has become the bottleneck. We co-locate the data and metadata pools on the same Ceph cluster used for RBD and RGW, monitor MDS request latency and cache size in Prometheus, and stage multi-active MDS only after the customer’s workload has demonstrated single-MDS saturation — over-provisioning MDS daemons too early just shifts complexity without solving any actual problem.