Ceph’s RBD (RADOS Block Device) layer presents thin-provisioned, snapshot-capable block images backed by the same RADOS object store underneath everything else in a Ceph cluster. RBD is what you wire to a hypervisor (libvirt + QEMU consume it natively), a Kubernetes CSI driver, or a bare metal Linux host that needs a network block device with replicated storage. This article walks pool creation, image provisioning, the two map paths (kernel rbd map vs userspace rbd-nbd), and how we structure snapshots and clones for the customer environments we run on top.
How to verify
# Pool, image, and pool stats
sudo ceph osd pool ls detail | grep rbd-data
sudo rbd pool stats rbd-data
sudo rbd ls --pool rbd-data --long
sudo rbd info rbd-data/vm-disk-01
# Map status on the client host
sudo rbd showmapped
sudo rbd device list
lsblk | grep rbd
# Snapshot inventory
sudo rbd snap ls rbd-data/vm-disk-01
sudo rbd snap ls --all rbd-data/vm-disk-01
What’s happening
An RBD image is a logical volume striped across many RADOS objects (4MB each by default). When you provision a 100 GiB image, Ceph creates one image header object and lazily writes 4MB chunks as the image fills — so a freshly created image consumes nearly zero capacity until something writes to it. Striping spreads I/O across many OSDs in parallel, which is why RBD scales bandwidth with cluster size in a way that single-disk volumes can’t. Replication and durability are handled by the pool the image lives in: a replicated pool with size=3 keeps three copies; an erasure-coded pool with k=4,m=2 keeps 6 chunks with any 2 lost recoverable.
The map path matters in practice. The rbd kernel module gives you a block device (/dev/rbd0) that the OS sees like any other disk, with the lowest latency. The downside is feature compatibility — older kernels don’t understand newer image features like object-map or fast-diff, and you have to disable them at image-create time with --image-feature. The rbd-nbd userspace path supports all features at the cost of a userspace round trip on every I/O; we use it for clients running older kernels or where a feature is non-negotiable.
Snapshots are copy-on-write within the same pool. A snapshot is a moment-in-time reference to the image’s RADOS objects; subsequent writes to the live image allocate new objects rather than overwriting referenced ones. This makes snapshots near-instant and storage-cheap until the live image diverges far enough that most objects have new versions. Clones are a step beyond — a clone is a writable snapshot, useful for cloning a “golden image” template to many VMs (rbd clone). After clone creation, you should run rbd flatten if you want the clone to be independent of the parent for long-term ops.
The procedure
-
On the admin node, create a replicated RBD pool with reasonable PG count (
pg_num≈ (OSDs × 100) / pool_size):sudo ceph osd pool create rbd-data 128 128 replicated sudo ceph osd pool application enable rbd-data rbd sudo rbd pool init rbd-data -
Create the first image (100 GiB, lazily-allocated):
sudo rbd create --pool rbd-data --size 102400 vm-disk-01 sudo rbd info rbd-data/vm-disk-01 -
Generate a client keyring with restricted access to this pool:
sudo ceph auth get-or-create client.app01 \ mon 'profile rbd' \ osd 'profile rbd pool=rbd-data' \ -o /etc/ceph/ceph.client.app01.keyring sudo scp /etc/ceph/ceph.client.app01.keyring root@app01:/etc/ceph/ sudo scp /etc/ceph/ceph.conf root@app01:/etc/ceph/ -
On the client host
app01, install the RBD tools and map:sudo apt install -y ceph-common rbd-nbd sudo rbd map --pool rbd-data vm-disk-01 --id app01 --keyring /etc/ceph/ceph.client.app01.keyring lsblk | grep rbd -
Format and mount (the device path is
/dev/rbd0for the first map):sudo mkfs.xfs /dev/rbd0 sudo mkdir -p /mnt/app-data sudo mount /dev/rbd0 /mnt/app-dataFor persistent mapping on boot, use
/etc/ceph/rbdmap:rbd-data/vm-disk-01 id=app01,keyring=/etc/ceph/ceph.client.app01.keyringAnd in
/etc/fstab:/dev/rbd/rbd-data/vm-disk-01 /mnt/app-data xfs _netdev,noauto,x-systemd.automount 0 0 -
Take a snapshot before a risky change:
sync sudo fsfreeze --freeze /mnt/app-data sudo rbd snap create rbd-data/vm-disk-01@before-upgrade sudo fsfreeze --unfreeze /mnt/app-data sudo rbd snap ls rbd-data/vm-disk-01fsfreezeensures the snapshot is filesystem-consistent — without it, you get a crash-consistent snapshot that an XFS journal can still recover, but possibly with lost in-flight writes. -
Rollback or clone:
# rollback (destructive — image goes back to snapshot state) sudo umount /mnt/app-data sudo rbd unmap /dev/rbd0 sudo rbd snap rollback rbd-data/vm-disk-01@before-upgrade # clone to a new image sudo rbd snap protect rbd-data/vm-disk-01@before-upgrade sudo rbd clone rbd-data/vm-disk-01@before-upgrade rbd-data/vm-disk-01-test
Common pitfalls
- Image features mismatch between create-time and kernel client: if
rbd mapreturns “feature set mismatch,” recreate with--image-feature layering,exclusive-lockand re-test. Or userbd-nbdto bypass kernel feature checks. pg_numset too low silently caps cluster bandwidth — under 64 for a serious pool means every OSD is hashed for nearly every object. Use the Ceph PG calculator before pool creation.rbdmapdoesn’t unmap on shutdown without the systemd unit; mount-on-boot will fail after the next reboot because the previous device is still claimed. Enablerbdmap.service.- Cloned images that haven’t been flattened pin the parent snapshot — you can’t delete the snapshot until every clone is flattened (
rbd flatten). - Discard support: ext4/XFS over RBD won’t reclaim space without
discardmount option and an image withobject-mapfeature. Otherwise the pool grows monotonically.
In the engagements we run, RBD is the block storage layer for clustered environments and KVM hypervisor fleets — libvirt consumes RBD URIs natively (rbd:rbd-data/vm-disk-01:id=app01) and our snapshot tooling integrates with the per-tenant retention model. We size pools to expected workload, set PG counts using the official calculator (not defaults), and monitor RBD-specific metrics like image fragmentation and clone depth in Prometheus alongside the OSD-level signals the rest of the platform watches.