ext4 vs XFS vs Btrfs: choosing a filesystem for Linux production workloads

Choosing a filesystem for a Linux server still matters, even though ext4 is “good enough” for most cases. The differences are real on the edges: XFS scales much better past 16 TB and handles parallel writes well; ext4 has shorter fsck times on small-to-mid volumes and the most predictable behavior; Btrfs offers snapshots and checksums but has operational scars that make us cautious. This article lays out the decision criteria we use when we provision a new Linux host for a customer, the production-relevant comparison points, and the workloads each filesystem fits.

How to verify

# What's mounted where, with type and options
findmnt -t ext4,xfs,btrfs
mount | grep -E 'ext4|xfs|btrfs'

# Per-filesystem health and tuning
sudo tune2fs -l /dev/sda1 | head -20         # ext4
sudo xfs_info /dev/sdb1                       # XFS
sudo btrfs filesystem show                    # Btrfs
sudo btrfs filesystem df /mnt/data            # Btrfs space report

# Generic capacity
df -hT

What’s happening

ext4 is the kernel’s default journaling filesystem since Ubuntu 9.10. It’s mature, predictable, and well-understood — every Linux kernel since 2.6.28 supports it natively. Its design is a refinement of ext3 with extents, delayed allocation, and persistent preallocation. It handles up to 1 EiB volume size in theory and 16 TiB practically before you start hitting performance softening at the metadata layer. fsck.ext4 is reasonably fast on small volumes (minutes for 100 GB) but scales linearly — a 10 TB ext4 volume takes hours to fsck on a bad boot.

XFS was originally SGI’s high-end filesystem. It’s optimized for large files, parallel I/O, and large volumes — past 16 TiB it consistently outperforms ext4 on metadata-heavy workloads. Its allocation group design lets multiple writers proceed in parallel on different parts of the disk without contention. XFS doesn’t shrink (you can’t reduce volume size, period), which is a constraint to know about. xfs_repair is generally much faster than fsck.ext4 on the same volume size because XFS’s journal design recovers without scanning the whole tree.

Btrfs is a copy-on-write filesystem with snapshots, checksums, RAID-5/6 in-filesystem, and subvolume management. On paper it’s the most feature-rich; in practice the RAID-5/6 implementation has known data-loss bugs that are still listed as “unstable” in the kernel docs, the metadata performance softens on large volumes faster than XFS, and the fsck story is “if your filesystem is corrupt, restore from backup.” We use Btrfs for boot disks with snapshots (the Fedora/openSUSE approach is well-established) but rarely as the data filesystem on a production server.

The decision criteria we actually use:

Default for application servers (databases, web tier, queue workers): ext4. Predictable, well-instrumented, every monitoring tool understands it.
Large data volumes (>16 TB), file servers, parallel writes: XFS. The performance lead is real, the operational maturity matches.
Snapshot-required boot disks where ZFS isn’t an option: Btrfs. Sub-volume snapshots are atomic and fast.
Replication/snapshots/checksums on data volumes: ZFS, not Btrfs — Btrfs’s data-volume features are years behind ZFS in maturity.

The procedure

Identify the disk and its size, then pick the filesystem:
```
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT
```

ext4 for a typical app-server volume (≤16 TB, no snapshot need):

sudo mkfs.ext4 -L data -E lazy_itable_init=0,lazy_journal_init=0 /dev/sdb1
sudo mkdir -p /mnt/data
echo 'LABEL=data /mnt/data ext4 defaults,noatime 0 2' | sudo tee -a /etc/fstab
sudo mount /mnt/data
tune2fs -l /dev/sdb1 | head -20

The lazy_*=0 flags do the metadata init at format time instead of lazily — costs a few minutes upfront, saves a CPU-eating background process that surprises capacity tests.

XFS for a large data volume:

sudo mkfs.xfs -L bigdata -d agcount=64 -i size=512 /dev/sdc1
sudo mkdir -p /mnt/bigdata
echo 'LABEL=bigdata /mnt/bigdata xfs defaults,noatime,inode64 0 0' | sudo tee -a /etc/fstab
sudo mount /mnt/bigdata
xfs_info /dev/sdc1

agcount=64 (allocation groups) gives parallel allocation across 64 regions; i size=512 makes XATTR-heavy workloads (e.g., NFSv4 ACLs) faster.

Btrfs for a boot disk with snapshots:

sudo mkfs.btrfs -L btrfs-root /dev/sda1
sudo mount /dev/sda1 /mnt/temp
sudo btrfs subvolume create /mnt/temp/@
sudo btrfs subvolume create /mnt/temp/@home
sudo btrfs subvolume create /mnt/temp/@snapshots
sudo umount /mnt/temp
# Then mount @ as root, @home as /home, etc.

Common mount options:
- noatime: don’t update access times. Big win on all three filesystems; only matters if some legacy software reads atime.
- discard (ext4, XFS, Btrfs): inline TRIM. Good for SSDs, can hurt some workloads. We prefer fstrim.timer (weekly) instead.
- barrier=1 (XFS, default in modern kernels): ensures filesystem barriers — keeps journal ordering correct on power loss.

Verify health periodically:

# ext4
sudo tune2fs -l /dev/sdb1 | grep -i 'last checked'
# Force a check on next boot:
sudo touch /forcefsck   # Then reboot

# XFS
sudo xfs_repair -n /dev/sdc1   # readonly check

# Btrfs
sudo btrfs scrub start /mnt/data
sudo btrfs scrub status /mnt/data

Common pitfalls

ext4 silently allows >16 TB but the metadata performance degrades there and fsck becomes long-running. Above 16 TB, switch to XFS.
XFS volumes cannot be shrunk. Plan partition sizes accordingly; you can grow with xfs_growfs but not shrink ever.
Btrfs RAID-5/6 is documented as unstable in the kernel — don’t use it for production data. Btrfs RAID-1 is stable.
noatime on a database volume vs relatime: usually noatime is the right choice but verify your application doesn’t depend on atime (very few do).
Running fsck on a mounted filesystem corrupts it. Always unmount or boot from rescue media.

In the engagements we run, the filesystem choice is part of the host provisioning automation — ext4 for application VMs, XFS for storage VMs and data volumes, and Btrfs only when snapshots are a requirement we can’t satisfy with ZFS-on-root or volume-manager snapshots. We document the choice in the customer runbook so capacity expansions and recovery procedures match the filesystem’s actual semantics — there’s nothing more frustrating than a “we need to shrink the volume” ticket on an XFS filesystem at 2 AM.