Velero is the open-source Kubernetes backup tool we install by default on managed clusters that need cluster-level disaster recovery rather than just GitOps-driven redeploys. It serializes Kubernetes resources to an object-storage bucket, snapshots persistent volumes via the cloud’s snapshot API or via restic/Kopia for non-cloud volumes, and can restore the same backup into a different cluster — which is the actual definition of cluster DR. This article covers the install, the dual-location model, and the restore drill the support desk runs every quarter.
How to verify
After install, the operational surface lives in the velero namespace and in the CLI:
kubectl -n velero get pods
velero version
velero backup-location get
velero snapshot-location get
velero backup get | head
velero backup describe <name> --details | head -30
backup-location and snapshot-location are the two pillars: where the serialized YAML lives and how PV data is captured. Both must be Available for backups to function.
What’s happening
A Velero backup is two artifacts: a tarball of every Kubernetes resource (namespaced + cluster-scoped, filtered by the backup spec) and a set of volume snapshots referenced by name in that tarball. When you velero restore create --from-backup, Velero replays the YAML through the API server (with optional namespace/label remapping), then rehydrates the PVs from the snapshots and rewires PVCs to point at the new volumes.
BackupStorageLocation is the bucket where tarballs land. Most clusters use one (S3, GCS, Azure Blob, MinIO, Wasabi, or any S3-compatible). VolumeSnapshotLocation is the snapshotting backend — typically the cloud’s native CSI snapshot driver. For environments where snapshots don’t exist (on-prem with local-path PVs), Velero ships a “Pod Volume Backup” mode that uses restic or Kopia to back up volume contents file-by-file.
Restic/Kopia mode trades cloud snapshot performance for portability: backups taken on AWS can be restored on bare-metal Kubernetes because the data is in the bucket, not in cloud-provider snapshot space.
The procedure
-
Provision the bucket and the IAM credentials. For AWS, create an S3 bucket and an IAM user/role with
s3:*Objecton the bucket plus the CSI snapshot permissions (ec2:CreateSnapshot,ec2:DeleteSnapshot,ec2:DescribeSnapshots, etc.). -
Install Velero via the CLI installer (Helm chart is also supported). The CLI bakes the cloud-provider plugin in.
velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.10.0 \ --bucket acme-velero-prod \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1 \ --secret-file ./velero-credentials \ --use-node-agent \ --uploader-type kopia--use-node-agentdeploys the DaemonSet that handles file-level PV backup;--uploader-type kopiais the modern default (restic still works). -
Take a backup of one namespace as a smoke test.
velero backup create acme-web-smoke \ --include-namespaces acme-web \ --default-volumes-to-fs-backup velero backup describe acme-web-smoke --details--default-volumes-to-fs-backuptells Velero to capture every PV in that namespace via the node-agent rather than cloud snapshots. Pick one model per backup spec. -
Schedule the recurring backups via
Scheduleresources (Velero’s native cron, not Kubernetes CronJob):velero schedule create daily-all \ --schedule "0 2 * * *" \ --ttl 720h0m0s \ --default-volumes-to-fs-backup velero schedule get--ttl 720his 30 days. Velero garbage-collects backups past TTL. -
Cross-cluster restore drill. The drill is what proves DR. Spin up a second cluster, install Velero with the same bucket configured as the BackupStorageLocation (read-only is enough), then:
velero backup-location create acme-prod-readonly \ --provider aws \ --bucket acme-velero-prod \ --access-mode ReadOnly velero backup get velero restore create dr-test --from-backup daily-all-20260601020000 \ --include-namespaces acme-web \ --namespace-mappings acme-web:acme-web-dr velero restore describe dr-test --details -
Verify the restored workload. Pods, PVCs, ConfigMaps, Secrets, Services, Ingresses, CRDs — they all need to be in the new namespace. Run smoke tests; the support desk validates the application behavior, not just the kubectl output.
Operational notes
- Cluster-scoped CRDs and validating webhooks restore before the resources that depend on them only if you set
restore-resource-priorities. The default order is usually right but you’ll meet a CRD ordering bug eventually — read the controller logs when restores stall. - Hooks:
--pre-backup-hookand--post-restore-hookannotations on pods let you quiesce databases (callpg_start_backup, run amysqldump, etc.). Without quiesce, PV snapshots of a running database produce crash-consistent state — see Backup app-consistent vs crash-consistent. - Velero does not back up the cluster’s own etcd. For cluster-state DR (managed services like EKS/GKE/AKS), the cloud provider owns etcd; for self-managed control planes, take etcd snapshots separately.
- Restore namespace remapping (
--namespace-mappings) is the trick we use for DR drills — restore the prod backup into adr-testnamespace alongside everything else so the drill doesn’t disturb production. velero pluginecosystem includes restore action hooks for renaming Ingress hosts, switching storage classes, and rewriting node selectors — typical needs when restoring across regions or providers.- Kopia repos can grow significantly faster than expected if PV churn is high. Monitor bucket size; tune
--default-volumes-to-fs-backupper workload.
In the engagements we run, Velero is the cluster DR baseline. Daily backup to S3 with object lock, weekly cross-cluster restore drill into a parked DR cluster, and a runbook the support desk can execute on its own. The full integration into the operating model is at /en/services/managed-operations/, and the multi-region picture for clustered workloads sits in Clustered Environments.