Skip to content

Velero disaster recovery: restoring a Kubernetes cluster

How Velero captures Kubernetes resources and persistent volumes, the BackupStorageLocation + VolumeSnapshotLocation model, and the cross-cluster restore drill we run on engagements.

Velero is the open-source Kubernetes backup tool we install by default on managed clusters that need cluster-level disaster recovery rather than just GitOps-driven redeploys. It serializes Kubernetes resources to an object-storage bucket, snapshots persistent volumes via the cloud’s snapshot API or via restic/Kopia for non-cloud volumes, and can restore the same backup into a different cluster — which is the actual definition of cluster DR. This article covers the install, the dual-location model, and the restore drill the support desk runs every quarter.

How to verify

After install, the operational surface lives in the velero namespace and in the CLI:

kubectl -n velero get pods
velero version
velero backup-location get
velero snapshot-location get
velero backup get | head
velero backup describe <name> --details | head -30

backup-location and snapshot-location are the two pillars: where the serialized YAML lives and how PV data is captured. Both must be Available for backups to function.

What’s happening

A Velero backup is two artifacts: a tarball of every Kubernetes resource (namespaced + cluster-scoped, filtered by the backup spec) and a set of volume snapshots referenced by name in that tarball. When you velero restore create --from-backup, Velero replays the YAML through the API server (with optional namespace/label remapping), then rehydrates the PVs from the snapshots and rewires PVCs to point at the new volumes.

BackupStorageLocation is the bucket where tarballs land. Most clusters use one (S3, GCS, Azure Blob, MinIO, Wasabi, or any S3-compatible). VolumeSnapshotLocation is the snapshotting backend — typically the cloud’s native CSI snapshot driver. For environments where snapshots don’t exist (on-prem with local-path PVs), Velero ships a “Pod Volume Backup” mode that uses restic or Kopia to back up volume contents file-by-file.

Restic/Kopia mode trades cloud snapshot performance for portability: backups taken on AWS can be restored on bare-metal Kubernetes because the data is in the bucket, not in cloud-provider snapshot space.

The procedure

  1. Provision the bucket and the IAM credentials. For AWS, create an S3 bucket and an IAM user/role with s3:*Object on the bucket plus the CSI snapshot permissions (ec2:CreateSnapshot, ec2:DeleteSnapshot, ec2:DescribeSnapshots, etc.).

  2. Install Velero via the CLI installer (Helm chart is also supported). The CLI bakes the cloud-provider plugin in.

    velero install \
      --provider aws \
      --plugins velero/velero-plugin-for-aws:v1.10.0 \
      --bucket acme-velero-prod \
      --backup-location-config region=us-east-1 \
      --snapshot-location-config region=us-east-1 \
      --secret-file ./velero-credentials \
      --use-node-agent \
      --uploader-type kopia

    --use-node-agent deploys the DaemonSet that handles file-level PV backup; --uploader-type kopia is the modern default (restic still works).

  3. Take a backup of one namespace as a smoke test.

    velero backup create acme-web-smoke \
      --include-namespaces acme-web \
      --default-volumes-to-fs-backup
    velero backup describe acme-web-smoke --details

    --default-volumes-to-fs-backup tells Velero to capture every PV in that namespace via the node-agent rather than cloud snapshots. Pick one model per backup spec.

  4. Schedule the recurring backups via Schedule resources (Velero’s native cron, not Kubernetes CronJob):

    velero schedule create daily-all \
      --schedule "0 2 * * *" \
      --ttl 720h0m0s \
      --default-volumes-to-fs-backup
    velero schedule get

    --ttl 720h is 30 days. Velero garbage-collects backups past TTL.

  5. Cross-cluster restore drill. The drill is what proves DR. Spin up a second cluster, install Velero with the same bucket configured as the BackupStorageLocation (read-only is enough), then:

    velero backup-location create acme-prod-readonly \
      --provider aws \
      --bucket acme-velero-prod \
      --access-mode ReadOnly
    velero backup get
    velero restore create dr-test --from-backup daily-all-20260601020000 \
      --include-namespaces acme-web \
      --namespace-mappings acme-web:acme-web-dr
    velero restore describe dr-test --details
  6. Verify the restored workload. Pods, PVCs, ConfigMaps, Secrets, Services, Ingresses, CRDs — they all need to be in the new namespace. Run smoke tests; the support desk validates the application behavior, not just the kubectl output.

Operational notes

  • Cluster-scoped CRDs and validating webhooks restore before the resources that depend on them only if you set restore-resource-priorities. The default order is usually right but you’ll meet a CRD ordering bug eventually — read the controller logs when restores stall.
  • Hooks: --pre-backup-hook and --post-restore-hook annotations on pods let you quiesce databases (call pg_start_backup, run a mysqldump, etc.). Without quiesce, PV snapshots of a running database produce crash-consistent state — see Backup app-consistent vs crash-consistent.
  • Velero does not back up the cluster’s own etcd. For cluster-state DR (managed services like EKS/GKE/AKS), the cloud provider owns etcd; for self-managed control planes, take etcd snapshots separately.
  • Restore namespace remapping (--namespace-mappings) is the trick we use for DR drills — restore the prod backup into a dr-test namespace alongside everything else so the drill doesn’t disturb production.
  • velero plugin ecosystem includes restore action hooks for renaming Ingress hosts, switching storage classes, and rewriting node selectors — typical needs when restoring across regions or providers.
  • Kopia repos can grow significantly faster than expected if PV churn is high. Monitor bucket size; tune --default-volumes-to-fs-backup per workload.

In the engagements we run, Velero is the cluster DR baseline. Daily backup to S3 with object lock, weekly cross-cluster restore drill into a parked DR cluster, and a runbook the support desk can execute on its own. The full integration into the operating model is at /en/services/managed-operations/, and the multi-region picture for clustered workloads sits in Clustered Environments.