The Ceph RADOS Gateway (RGW) speaks the Amazon S3 API on top of the same RADOS object store that backs RBD and CephFS. For applications that already use the AWS SDK, that means swapping the endpoint and the credentials — the SDK calls don’t change. RGW handles bucket policies, presigned URLs, multipart uploads, lifecycle rules, and object versioning. This article walks the RGW daemon placement, the bootstrap of the first user and bucket, the gotchas around endpoint configuration and TLS termination, and the integration points we wire to in production.
How to verify
# Daemon placement and health
sudo ceph orch ps --daemon-type rgw
sudo ceph -s | grep rgw
# Local API check (replace with your endpoint hostname)
curl -i http://s3.example.internal/
# Look for: HTTP/1.1 200 OK and an XML body listing buckets (empty initially)
# User and bucket listing via radosgw-admin
sudo radosgw-admin user list
sudo radosgw-admin bucket list
sudo radosgw-admin bucket stats --bucket=app-uploads
# Once a user exists, use the AWS CLI for full S3 verification
aws --endpoint-url http://s3.example.internal s3 ls
What’s happening
An RGW deployment exposes a stateless gateway daemon (radosgw) that translates S3 (or Swift) API calls into RADOS reads and writes. Each gateway daemon is independent and behind a load balancer; the cluster pool design holds bucket data, indices, and metadata. The default pool layout creates <zone>.rgw.buckets.data (objects), <zone>.rgw.buckets.index (per-bucket index), and <zone>.rgw.meta (users, buckets metadata). For high-bucket-count workloads, the bucket index pool benefits from SSD placement — the index is hot and small, the data pool is cold and large.
User and bucket management is via radosgw-admin on the cluster side, or via the S3 API (with appropriate credentials) on the client side. Users are local to the cluster’s “zone” (a Ceph concept for multi-site replication groups); for a single-site deployment you have one zone in one zonegroup in one realm, and you mostly don’t think about it. When you create a user, RGW issues an access key and secret that look exactly like AWS credentials — AKIA... style — and the SDK will accept them with the only difference being the endpoint URL.
The “S3-compatible” claim has limits. Most S3 SDK operations work, but some advanced features lag — S3 Object Lambda, AWS-specific signing edge cases, certain lifecycle policy combinations. Test the specific SDK calls your application relies on, especially around multipart upload concurrency and lifecycle transitions. RGW also has its own non-AWS extensions (the “admin API” at /admin/) that let you script user creation and quota management more cleanly than the S3 API does.
The procedure
-
On the admin node, place RGW daemons across the cluster. Three behind a load balancer is the production minimum:
sudo ceph orch apply rgw default --realm=default --zone=default \ --placement="3 ceph1 ceph2 ceph3" --port=8000 sudo ceph orch ps --daemon-type rgw -
Create the first S3 user:
sudo radosgw-admin user create --uid=app01 --display-name="App01 Service Account" \ [email protected]Output includes
access_keyandsecret_key— save them; they’re not retrievable in plaintext later, only re-issuable. -
Set a quota on the user (optional but recommended):
sudo radosgw-admin quota set --uid=app01 --quota-scope=user \ --max-size=107374182400 --max-objects=1000000 sudo radosgw-admin quota enable --uid=app01 --quota-scope=user -
Test from a workstation with the AWS CLI:
export AWS_ACCESS_KEY_ID=<access_key> export AWS_SECRET_ACCESS_KEY=<secret_key> aws --endpoint-url http://s3.example.internal:8000 s3 mb s3://app-uploads echo "hello world" > /tmp/hello.txt aws --endpoint-url http://s3.example.internal:8000 s3 cp /tmp/hello.txt s3://app-uploads/ aws --endpoint-url http://s3.example.internal:8000 s3 ls s3://app-uploads/ -
Generate a presigned URL for a one-shot download:
aws --endpoint-url http://s3.example.internal:8000 s3 presign s3://app-uploads/hello.txt --expires-in 3600 -
Apply a bucket policy that grants public read on
public/*:{ "Version": "2012-10-17", "Statement": [{ "Sid": "PublicReadPrefix", "Effect": "Allow", "Principal": "*", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::app-uploads/public/*" }] }Apply with:
aws --endpoint-url http://s3.example.internal:8000 \ s3api put-bucket-policy --bucket app-uploads --policy file:///tmp/policy.json -
Front the RGW endpoints with HAProxy or Caddy for TLS termination —
radosgwcan serve HTTPS directly but a real LB gives you certificate rotation, observability, and connection limits in one place.
Common pitfalls
- The S3 SDK’s “virtual-hosted style” addressing (
bucket.s3.example.internal) requires wildcard DNS pointing at the RGW LB and a TLS certificate that covers the wildcard. “Path-style” (s3.example.internal/bucket) avoids that but is being deprecated by the AWS SDK over time. - Bucket index sharding: without resharding, a single bucket index becomes a hotspot at high object counts. RGW does automatic dynamic resharding for new buckets in Pacific+, but pre-existing buckets need
radosgw-admin bucket reshard. - TLS endpoints in client SDK config — Boto3 and other SDKs verify the server certificate against the hostname they connect to, so the certificate’s SAN must match the endpoint URL or you’ll see SSL errors that look like auth errors.
- Multipart upload garbage: aborted multipart uploads leave orphan parts in the pool. Configure a lifecycle rule (
AbortIncompleteMultipartUpload) to clean them up, or storage grows monotonically. - Lifecycle rules are processed asynchronously by
radosgw-admin lc process; in some Ceph versions the LC processor stalls and rules don’t run. Monitor withradosgw-admin lc list.
In the engagements we run, RGW is the object storage tier for self-hosted environments where the customer’s application speaks S3 SDK but the data can’t leave the customer’s perimeter. We expose the gateway behind HAProxy with TLS termination, monitor request rate and bucket-level metrics in Prometheus, and write lifecycle rules during deployment instead of waiting for a capacity ticket — RGW backed by a Ceph cluster gives you AWS-shaped object storage with the operating model of an on-prem cluster.