Ceph’s RBD (RADOS Block Device) layer presents thin-provisioned, snapshot-capable block images backed by the same RADOS object store underneath everything else in a Ceph cluster. RBD is what you wire to a hypervisor (libvirt + QEMU consume it natively), a Kubernetes CSI driver, or a bare metal Linux host that needs a network block device with replicated storage. This article walks pool creation, image provisioning, the two map paths (kernel rbd map vs userspace rbd-nbd), and how we structure snapshots and clones for the customer environments we run on top.

How to verify

# Pool, image, and pool stats
sudo ceph osd pool ls detail | grep rbd-data
sudo rbd pool stats rbd-data
sudo rbd ls --pool rbd-data --long
sudo rbd info rbd-data/vm-disk-01

# Map status on the client host
sudo rbd showmapped
sudo rbd device list
lsblk | grep rbd

# Snapshot inventory
sudo rbd snap ls rbd-data/vm-disk-01
sudo rbd snap ls --all rbd-data/vm-disk-01

What’s happening

An RBD image is a logical volume striped across many RADOS objects (4MB each by default). When you provision a 100 GiB image, Ceph creates one image header object and lazily writes 4MB chunks as the image fills — so a freshly created image consumes nearly zero capacity until something writes to it. Striping spreads I/O across many OSDs in parallel, which is why RBD scales bandwidth with cluster size in a way that single-disk volumes can’t. Replication and durability are handled by the pool the image lives in: a replicated pool with size=3 keeps three copies; an erasure-coded pool with k=4,m=2 keeps 6 chunks with any 2 lost recoverable.

The map path matters in practice. The rbd kernel module gives you a block device (/dev/rbd0) that the OS sees like any other disk, with the lowest latency. The downside is feature compatibility — older kernels don’t understand newer image features like object-map or fast-diff, and you have to disable them at image-create time with --image-feature. The rbd-nbd userspace path supports all features at the cost of a userspace round trip on every I/O; we use it for clients running older kernels or where a feature is non-negotiable.

Snapshots are copy-on-write within the same pool. A snapshot is a moment-in-time reference to the image’s RADOS objects; subsequent writes to the live image allocate new objects rather than overwriting referenced ones. This makes snapshots near-instant and storage-cheap until the live image diverges far enough that most objects have new versions. Clones are a step beyond — a clone is a writable snapshot, useful for cloning a “golden image” template to many VMs (rbd clone). After clone creation, you should run rbd flatten if you want the clone to be independent of the parent for long-term ops.

The procedure

On the admin node, create a replicated RBD pool with reasonable PG count (pg_num ≈ (OSDs × 100) / pool_size):

sudo ceph osd pool create rbd-data 128 128 replicated
sudo ceph osd pool application enable rbd-data rbd
sudo rbd pool init rbd-data

Create the first image (100 GiB, lazily-allocated):

sudo rbd create --pool rbd-data --size 102400 vm-disk-01
sudo rbd info rbd-data/vm-disk-01

Generate a client keyring with restricted access to this pool:

sudo ceph auth get-or-create client.app01 \
  mon 'profile rbd' \
  osd 'profile rbd pool=rbd-data' \
  -o /etc/ceph/ceph.client.app01.keyring
sudo scp /etc/ceph/ceph.client.app01.keyring root@app01:/etc/ceph/
sudo scp /etc/ceph/ceph.conf root@app01:/etc/ceph/

On the client host app01, install the RBD tools and map:

sudo apt install -y ceph-common rbd-nbd
sudo rbd map --pool rbd-data vm-disk-01 --id app01 --keyring /etc/ceph/ceph.client.app01.keyring
lsblk | grep rbd

Format and mount (the device path is /dev/rbd0 for the first map):

sudo mkfs.xfs /dev/rbd0
sudo mkdir -p /mnt/app-data
sudo mount /dev/rbd0 /mnt/app-data

For persistent mapping on boot, use /etc/ceph/rbdmap:

rbd-data/vm-disk-01    id=app01,keyring=/etc/ceph/ceph.client.app01.keyring

And in /etc/fstab:

/dev/rbd/rbd-data/vm-disk-01 /mnt/app-data xfs _netdev,noauto,x-systemd.automount 0 0

Take a snapshot before a risky change:
```
sync
sudo fsfreeze --freeze /mnt/app-data
sudo rbd snap create rbd-data/vm-disk-01@before-upgrade
sudo fsfreeze --unfreeze /mnt/app-data
sudo rbd snap ls rbd-data/vm-disk-01
```
fsfreeze ensures the snapshot is filesystem-consistent — without it, you get a crash-consistent snapshot that an XFS journal can still recover, but possibly with lost in-flight writes.

Rollback or clone:

# rollback (destructive — image goes back to snapshot state)
sudo umount /mnt/app-data
sudo rbd unmap /dev/rbd0
sudo rbd snap rollback rbd-data/vm-disk-01@before-upgrade

# clone to a new image
sudo rbd snap protect rbd-data/vm-disk-01@before-upgrade
sudo rbd clone rbd-data/vm-disk-01@before-upgrade rbd-data/vm-disk-01-test

Common pitfalls

Image features mismatch between create-time and kernel client: if rbd map returns “feature set mismatch,” recreate with --image-feature layering,exclusive-lock and re-test. Or use rbd-nbd to bypass kernel feature checks.
pg_num set too low silently caps cluster bandwidth — under 64 for a serious pool means every OSD is hashed for nearly every object. Use the Ceph PG calculator before pool creation.
rbdmap doesn’t unmap on shutdown without the systemd unit; mount-on-boot will fail after the next reboot because the previous device is still claimed. Enable rbdmap.service.
Cloned images that haven’t been flattened pin the parent snapshot — you can’t delete the snapshot until every clone is flattened (rbd flatten).
Discard support: ext4/XFS over RBD won’t reclaim space without discard mount option and an image with object-map feature. Otherwise the pool grows monotonically.

In the engagements we run, RBD is the block storage layer for clustered environments and KVM hypervisor fleets — libvirt consumes RBD URIs natively (rbd:rbd-data/vm-disk-01:id=app01) and our snapshot tooling integrates with the per-tenant retention model. We size pools to expected workload, set PG counts using the official calculator (not defaults), and monitor RBD-specific metrics like image fragmentation and clone depth in Prometheus alongside the OSD-level signals the rest of the platform watches.