Skip to content

Ceph RBD block devices: provisioning a pool, mapping a volume, snapshotting

Provision a Ceph RBD pool, create and map block images on a Linux client, take snapshots, and the operational practices we apply for RBD-backed VM and container storage.

Ceph’s RBD (RADOS Block Device) layer presents thin-provisioned, snapshot-capable block images backed by the same RADOS object store underneath everything else in a Ceph cluster. RBD is what you wire to a hypervisor (libvirt + QEMU consume it natively), a Kubernetes CSI driver, or a bare metal Linux host that needs a network block device with replicated storage. This article walks pool creation, image provisioning, the two map paths (kernel rbd map vs userspace rbd-nbd), and how we structure snapshots and clones for the customer environments we run on top.

How to verify

# Pool, image, and pool stats
sudo ceph osd pool ls detail | grep rbd-data
sudo rbd pool stats rbd-data
sudo rbd ls --pool rbd-data --long
sudo rbd info rbd-data/vm-disk-01

# Map status on the client host
sudo rbd showmapped
sudo rbd device list
lsblk | grep rbd

# Snapshot inventory
sudo rbd snap ls rbd-data/vm-disk-01
sudo rbd snap ls --all rbd-data/vm-disk-01

What’s happening

An RBD image is a logical volume striped across many RADOS objects (4MB each by default). When you provision a 100 GiB image, Ceph creates one image header object and lazily writes 4MB chunks as the image fills — so a freshly created image consumes nearly zero capacity until something writes to it. Striping spreads I/O across many OSDs in parallel, which is why RBD scales bandwidth with cluster size in a way that single-disk volumes can’t. Replication and durability are handled by the pool the image lives in: a replicated pool with size=3 keeps three copies; an erasure-coded pool with k=4,m=2 keeps 6 chunks with any 2 lost recoverable.

The map path matters in practice. The rbd kernel module gives you a block device (/dev/rbd0) that the OS sees like any other disk, with the lowest latency. The downside is feature compatibility — older kernels don’t understand newer image features like object-map or fast-diff, and you have to disable them at image-create time with --image-feature. The rbd-nbd userspace path supports all features at the cost of a userspace round trip on every I/O; we use it for clients running older kernels or where a feature is non-negotiable.

Snapshots are copy-on-write within the same pool. A snapshot is a moment-in-time reference to the image’s RADOS objects; subsequent writes to the live image allocate new objects rather than overwriting referenced ones. This makes snapshots near-instant and storage-cheap until the live image diverges far enough that most objects have new versions. Clones are a step beyond — a clone is a writable snapshot, useful for cloning a “golden image” template to many VMs (rbd clone). After clone creation, you should run rbd flatten if you want the clone to be independent of the parent for long-term ops.

The procedure

  1. On the admin node, create a replicated RBD pool with reasonable PG count (pg_num ≈ (OSDs × 100) / pool_size):

    sudo ceph osd pool create rbd-data 128 128 replicated
    sudo ceph osd pool application enable rbd-data rbd
    sudo rbd pool init rbd-data
  2. Create the first image (100 GiB, lazily-allocated):

    sudo rbd create --pool rbd-data --size 102400 vm-disk-01
    sudo rbd info rbd-data/vm-disk-01
  3. Generate a client keyring with restricted access to this pool:

    sudo ceph auth get-or-create client.app01 \
      mon 'profile rbd' \
      osd 'profile rbd pool=rbd-data' \
      -o /etc/ceph/ceph.client.app01.keyring
    sudo scp /etc/ceph/ceph.client.app01.keyring root@app01:/etc/ceph/
    sudo scp /etc/ceph/ceph.conf root@app01:/etc/ceph/
  4. On the client host app01, install the RBD tools and map:

    sudo apt install -y ceph-common rbd-nbd
    sudo rbd map --pool rbd-data vm-disk-01 --id app01 --keyring /etc/ceph/ceph.client.app01.keyring
    lsblk | grep rbd
  5. Format and mount (the device path is /dev/rbd0 for the first map):

    sudo mkfs.xfs /dev/rbd0
    sudo mkdir -p /mnt/app-data
    sudo mount /dev/rbd0 /mnt/app-data

    For persistent mapping on boot, use /etc/ceph/rbdmap:

    rbd-data/vm-disk-01    id=app01,keyring=/etc/ceph/ceph.client.app01.keyring

    And in /etc/fstab:

    /dev/rbd/rbd-data/vm-disk-01 /mnt/app-data xfs _netdev,noauto,x-systemd.automount 0 0
  6. Take a snapshot before a risky change:

    sync
    sudo fsfreeze --freeze /mnt/app-data
    sudo rbd snap create rbd-data/vm-disk-01@before-upgrade
    sudo fsfreeze --unfreeze /mnt/app-data
    sudo rbd snap ls rbd-data/vm-disk-01

    fsfreeze ensures the snapshot is filesystem-consistent — without it, you get a crash-consistent snapshot that an XFS journal can still recover, but possibly with lost in-flight writes.

  7. Rollback or clone:

    # rollback (destructive — image goes back to snapshot state)
    sudo umount /mnt/app-data
    sudo rbd unmap /dev/rbd0
    sudo rbd snap rollback rbd-data/vm-disk-01@before-upgrade
    
    # clone to a new image
    sudo rbd snap protect rbd-data/vm-disk-01@before-upgrade
    sudo rbd clone rbd-data/vm-disk-01@before-upgrade rbd-data/vm-disk-01-test

Common pitfalls

  • Image features mismatch between create-time and kernel client: if rbd map returns “feature set mismatch,” recreate with --image-feature layering,exclusive-lock and re-test. Or use rbd-nbd to bypass kernel feature checks.
  • pg_num set too low silently caps cluster bandwidth — under 64 for a serious pool means every OSD is hashed for nearly every object. Use the Ceph PG calculator before pool creation.
  • rbdmap doesn’t unmap on shutdown without the systemd unit; mount-on-boot will fail after the next reboot because the previous device is still claimed. Enable rbdmap.service.
  • Cloned images that haven’t been flattened pin the parent snapshot — you can’t delete the snapshot until every clone is flattened (rbd flatten).
  • Discard support: ext4/XFS over RBD won’t reclaim space without discard mount option and an image with object-map feature. Otherwise the pool grows monotonically.

In the engagements we run, RBD is the block storage layer for clustered environments and KVM hypervisor fleets — libvirt consumes RBD URIs natively (rbd:rbd-data/vm-disk-01:id=app01) and our snapshot tooling integrates with the per-tenant retention model. We size pools to expected workload, set PG counts using the official calculator (not defaults), and monitor RBD-specific metrics like image fragmentation and clone depth in Prometheus alongside the OSD-level signals the rest of the platform watches.