Application performance monitoring used to mean buying a vendor SDK and shipping data to a single backend. OpenTelemetry breaks that pattern — a vendor-neutral SDK in the application emits traces, metrics, and logs over OTLP to a Collector, which then exports to whichever backend(s) you want (Jaeger, Tempo, Honeycomb, Datadog, Prometheus, Loki, all at once). The catch: the Collector configuration is where everything goes right or wrong. This article walks through a working Collector deployment, an SDK in a Python or Node app, the sampling strategy that keeps trace volume sane, and the resource attributes that make a trace traceable to a real host and version.
How to verify
After the Collector is up and an app is sending, all of the following should pass.
sudo systemctl status otelcol-contrib --no-pager
ss -lntp | grep -E ':(4317|4318|13133|8888)\b'
# Collector self-health
curl -fsS http://127.0.0.1:13133
# Collector's own metrics
curl -fsS http://127.0.0.1:8888/metrics | grep -E '^otelcol_(receiver|exporter|processor)_(accepted|sent|dropped)'
# A trace landed in the backend (Jaeger as example)
curl -fsS 'http://jaeger:16686/api/traces?service=demo-app&limit=1' | jq '.data[0] | {traceID, spans: (.spans | length)}'
# Sampled rate visible in the SDK
curl -fsS http://demo-app/metrics | grep 'otel_traces_sampler'
A common failure: otelcol_exporter_send_failed_spans is non-zero — usually the backend hostname is wrong, TLS is misconfigured, or the auth header is missing. The Collector keeps trying until its queue fills, then drops.
What’s happening
OpenTelemetry has three pieces. SDK in the application creates spans/metrics/logs and ships them to a configured endpoint (typically OTLP gRPC over 4317 or OTLP HTTP over 4318). The Collector is a separate Go process with a pipeline model: receivers accept data (OTLP, Jaeger, Zipkin, Prometheus scrape, syslog, …), processors modify the data in flight (batch, filter, attribute, sample), exporters send to one or more backends. Backends store and query (Jaeger, Tempo, Prometheus, Loki, Elasticsearch, vendor APM).
The Collector matters because it decouples the SDK from the backend. Apps emit OTLP and never need to change when you switch a backend or add a second one. The Collector handles sampling, retry, batching, attribute redaction (PII), and routing to multiple destinations — all configuration, no SDK redeploy.
Two settings determine the cost of a tracing setup. Sampling: head-based (decide at trace start, cheap and predictable) vs. tail-based (decide after the spans arrive, more expensive but lets you keep all errors and a sample of successes). Resource attributes: service.name, service.version, host.name, deployment.environment, cloud.region — these make a trace traceable to a release. Without them, all you know is “a request was slow.”
The procedure
-
Install the OpenTelemetry Collector Contrib distribution. The Core distribution lacks the receivers and exporters most teams need; Contrib is what production uses.
OTEL_VERSION="0.110.0" curl -fLO "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v${OTEL_VERSION}/otelcol-contrib_${OTEL_VERSION}_linux_amd64.deb" sudo dpkg -i otelcol-contrib_${OTEL_VERSION}_linux_amd64.deb -
Write the Collector config. Receivers for OTLP, a batch processor, sample heads at 10 %, route traces to Tempo and metrics to Prometheus remote-write.
# /etc/otelcol-contrib/config.yaml receivers: otlp: protocols: grpc: { endpoint: 0.0.0.0:4317 } http: { endpoint: 0.0.0.0:4318 } processors: batch: send_batch_size: 8192 timeout: 5s memory_limiter: check_interval: 1s limit_mib: 1500 spike_limit_mib: 300 resource: attributes: - { key: deployment.environment, value: prod, action: upsert } - { key: cloud.region, value: ca-central-1, action: upsert } attributes/redact: actions: - { key: http.request.header.authorization, action: delete } - { key: db.statement.params, action: delete } probabilistic_sampler: sampling_percentage: 10 exporters: otlp/tempo: endpoint: tempo.internal:4317 tls: { insecure: true } prometheusremotewrite: endpoint: http://prometheus.internal:9090/api/v1/write loki: endpoint: http://loki.internal:3100/loki/api/v1/push debug: verbosity: basic extensions: health_check: { endpoint: 127.0.0.1:13133 } pprof: { endpoint: 127.0.0.1:1777 } service: extensions: [health_check, pprof] telemetry: metrics: { address: 127.0.0.1:8888 } pipelines: traces: receivers: [otlp] processors: [memory_limiter, resource, attributes/redact, probabilistic_sampler, batch] exporters: [otlp/tempo] metrics: receivers: [otlp] processors: [memory_limiter, resource, batch] exporters: [prometheusremotewrite] logs: receivers: [otlp] processors: [memory_limiter, resource, batch] exporters: [loki]memory_limiteris non-optional in production — without it, a slow backend backs traces up in memory and the Collector OOMs. -
Run the Collector with systemd.
sudo systemctl enable --now otelcol-contrib sudo journalctl -u otelcol-contrib -f -
Add the SDK to your application. Python example with auto-instrumentation:
pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-bootstrap -a installOTEL_SERVICE_NAME=demo-app \ OTEL_RESOURCE_ATTRIBUTES=service.version=1.4.2,deployment.environment=prod \ OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.internal:4317 \ OTEL_EXPORTER_OTLP_PROTOCOL=grpc \ OTEL_TRACES_SAMPLER=parentbased_traceidratio \ OTEL_TRACES_SAMPLER_ARG=1.0 \ opentelemetry-instrument python app.pyAt the app, sample at 100 % (
SAMPLER_ARG=1.0) and let the Collector apply the 10 %probabilistic_sampler. This gives you the option to change the sample rate centrally without redeploying the app. -
Add the same instrumentation in Node.js apps.
// tracing.js — load before any other module const { NodeSDK } = require('@opentelemetry/sdk-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { Resource } = require('@opentelemetry/resources'); const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions'); const sdk = new NodeSDK({ resource: new Resource({ [SemanticResourceAttributes.SERVICE_NAME]: 'demo-app', [SemanticResourceAttributes.SERVICE_VERSION]: '1.4.2', [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'prod', }), traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector.internal:4317' }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start();Start the app with
node --require ./tracing.js server.js. -
Verify the trace shows up. Hit a sampled endpoint, then query the backend. In Tempo:
tempo-cli search --service demo-app --limit 5. In Jaeger UI: select the servicedemo-app, click Find Traces, confirm the spans containservice.versionanddeployment.environment.
Operational notes
- A trace without resource attributes is anonymous — when an SRE asks “which version of demo-app produced this?”, the answer should come from
resource.attributes.service.version, not a guess. - Head-based sampling at 10 % is fine for steady traffic; for a service with rare errors, switch to tail-based sampling (the
tail_samplingprocessor) so all errors are kept regardless of rate. - The Collector’s batch processor delays spans by the
timeout— set it to 5s, not 30s, or trace traces lag noticeably behind the request. - The
attributes/redactprocessor is the only thing between PII (Authorization headers, query parameters) and the backend; an SDK that auto-instruments HTTP clients sends these by default. - For high-volume apps, run a Collector daemonset/sidecar on every node instead of one centralized Collector — the local Collector handles batching and a single central one handles routing.
For the trace storage side, see jaeger-tracing. For the existing Tempo deployment that consumes this Collector’s exports, see grafana-tempo-tracing-s3-deploy.
Stack Harbor stands up Collectors on the managed operations tier — pipeline, sampling, and redaction reviewed before any application points at them, and backends chosen to fit the retention and query patterns the team actually needs.