Skip to content

OpenTelemetry for APM — collector, SDKs, and the signals you actually need

A practical first OpenTelemetry deployment — Collector configured with OTLP receivers, OTLP/Jaeger/Prometheus exporters, an SDK in an application, and the sampling and resource attributes that make traces useful.

Application performance monitoring used to mean buying a vendor SDK and shipping data to a single backend. OpenTelemetry breaks that pattern — a vendor-neutral SDK in the application emits traces, metrics, and logs over OTLP to a Collector, which then exports to whichever backend(s) you want (Jaeger, Tempo, Honeycomb, Datadog, Prometheus, Loki, all at once). The catch: the Collector configuration is where everything goes right or wrong. This article walks through a working Collector deployment, an SDK in a Python or Node app, the sampling strategy that keeps trace volume sane, and the resource attributes that make a trace traceable to a real host and version.

How to verify

After the Collector is up and an app is sending, all of the following should pass.

sudo systemctl status otelcol-contrib --no-pager
ss -lntp | grep -E ':(4317|4318|13133|8888)\b'
# Collector self-health
curl -fsS http://127.0.0.1:13133
# Collector's own metrics
curl -fsS http://127.0.0.1:8888/metrics | grep -E '^otelcol_(receiver|exporter|processor)_(accepted|sent|dropped)'
# A trace landed in the backend (Jaeger as example)
curl -fsS 'http://jaeger:16686/api/traces?service=demo-app&limit=1' | jq '.data[0] | {traceID, spans: (.spans | length)}'
# Sampled rate visible in the SDK
curl -fsS http://demo-app/metrics | grep 'otel_traces_sampler'

A common failure: otelcol_exporter_send_failed_spans is non-zero — usually the backend hostname is wrong, TLS is misconfigured, or the auth header is missing. The Collector keeps trying until its queue fills, then drops.

What’s happening

OpenTelemetry has three pieces. SDK in the application creates spans/metrics/logs and ships them to a configured endpoint (typically OTLP gRPC over 4317 or OTLP HTTP over 4318). The Collector is a separate Go process with a pipeline model: receivers accept data (OTLP, Jaeger, Zipkin, Prometheus scrape, syslog, …), processors modify the data in flight (batch, filter, attribute, sample), exporters send to one or more backends. Backends store and query (Jaeger, Tempo, Prometheus, Loki, Elasticsearch, vendor APM).

The Collector matters because it decouples the SDK from the backend. Apps emit OTLP and never need to change when you switch a backend or add a second one. The Collector handles sampling, retry, batching, attribute redaction (PII), and routing to multiple destinations — all configuration, no SDK redeploy.

Two settings determine the cost of a tracing setup. Sampling: head-based (decide at trace start, cheap and predictable) vs. tail-based (decide after the spans arrive, more expensive but lets you keep all errors and a sample of successes). Resource attributes: service.name, service.version, host.name, deployment.environment, cloud.region — these make a trace traceable to a release. Without them, all you know is “a request was slow.”

The procedure

  1. Install the OpenTelemetry Collector Contrib distribution. The Core distribution lacks the receivers and exporters most teams need; Contrib is what production uses.

    OTEL_VERSION="0.110.0"
    curl -fLO "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v${OTEL_VERSION}/otelcol-contrib_${OTEL_VERSION}_linux_amd64.deb"
    sudo dpkg -i otelcol-contrib_${OTEL_VERSION}_linux_amd64.deb
  2. Write the Collector config. Receivers for OTLP, a batch processor, sample heads at 10 %, route traces to Tempo and metrics to Prometheus remote-write.

    # /etc/otelcol-contrib/config.yaml
    receivers:
      otlp:
        protocols:
          grpc: { endpoint: 0.0.0.0:4317 }
          http: { endpoint: 0.0.0.0:4318 }
    
    processors:
      batch:
        send_batch_size: 8192
        timeout: 5s
      memory_limiter:
        check_interval: 1s
        limit_mib: 1500
        spike_limit_mib: 300
      resource:
        attributes:
          - { key: deployment.environment, value: prod, action: upsert }
          - { key: cloud.region, value: ca-central-1, action: upsert }
      attributes/redact:
        actions:
          - { key: http.request.header.authorization, action: delete }
          - { key: db.statement.params, action: delete }
      probabilistic_sampler:
        sampling_percentage: 10
    
    exporters:
      otlp/tempo:
        endpoint: tempo.internal:4317
        tls: { insecure: true }
      prometheusremotewrite:
        endpoint: http://prometheus.internal:9090/api/v1/write
      loki:
        endpoint: http://loki.internal:3100/loki/api/v1/push
      debug:
        verbosity: basic
    
    extensions:
      health_check: { endpoint: 127.0.0.1:13133 }
      pprof: { endpoint: 127.0.0.1:1777 }
    
    service:
      extensions: [health_check, pprof]
      telemetry:
        metrics: { address: 127.0.0.1:8888 }
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, resource, attributes/redact, probabilistic_sampler, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, resource, batch]
          exporters: [prometheusremotewrite]
        logs:
          receivers: [otlp]
          processors: [memory_limiter, resource, batch]
          exporters: [loki]

    memory_limiter is non-optional in production — without it, a slow backend backs traces up in memory and the Collector OOMs.

  3. Run the Collector with systemd.

    sudo systemctl enable --now otelcol-contrib
    sudo journalctl -u otelcol-contrib -f
  4. Add the SDK to your application. Python example with auto-instrumentation:

    pip install opentelemetry-distro opentelemetry-exporter-otlp
    opentelemetry-bootstrap -a install
    OTEL_SERVICE_NAME=demo-app \
    OTEL_RESOURCE_ATTRIBUTES=service.version=1.4.2,deployment.environment=prod \
    OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.internal:4317 \
    OTEL_EXPORTER_OTLP_PROTOCOL=grpc \
    OTEL_TRACES_SAMPLER=parentbased_traceidratio \
    OTEL_TRACES_SAMPLER_ARG=1.0 \
      opentelemetry-instrument python app.py

    At the app, sample at 100 % (SAMPLER_ARG=1.0) and let the Collector apply the 10 % probabilistic_sampler. This gives you the option to change the sample rate centrally without redeploying the app.

  5. Add the same instrumentation in Node.js apps.

    // tracing.js — load before any other module
    const { NodeSDK } = require('@opentelemetry/sdk-node');
    const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
    const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
    const { Resource } = require('@opentelemetry/resources');
    const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
    
    const sdk = new NodeSDK({
      resource: new Resource({
        [SemanticResourceAttributes.SERVICE_NAME]: 'demo-app',
        [SemanticResourceAttributes.SERVICE_VERSION]: '1.4.2',
        [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'prod',
      }),
      traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector.internal:4317' }),
      instrumentations: [getNodeAutoInstrumentations()],
    });
    sdk.start();

    Start the app with node --require ./tracing.js server.js.

  6. Verify the trace shows up. Hit a sampled endpoint, then query the backend. In Tempo: tempo-cli search --service demo-app --limit 5. In Jaeger UI: select the service demo-app, click Find Traces, confirm the spans contain service.version and deployment.environment.

Operational notes

  • A trace without resource attributes is anonymous — when an SRE asks “which version of demo-app produced this?”, the answer should come from resource.attributes.service.version, not a guess.
  • Head-based sampling at 10 % is fine for steady traffic; for a service with rare errors, switch to tail-based sampling (the tail_sampling processor) so all errors are kept regardless of rate.
  • The Collector’s batch processor delays spans by the timeout — set it to 5s, not 30s, or trace traces lag noticeably behind the request.
  • The attributes/redact processor is the only thing between PII (Authorization headers, query parameters) and the backend; an SDK that auto-instruments HTTP clients sends these by default.
  • For high-volume apps, run a Collector daemonset/sidecar on every node instead of one centralized Collector — the local Collector handles batching and a single central one handles routing.

For the trace storage side, see jaeger-tracing. For the existing Tempo deployment that consumes this Collector’s exports, see grafana-tempo-tracing-s3-deploy.


Stack Harbor stands up Collectors on the managed operations tier — pipeline, sampling, and redaction reviewed before any application points at them, and backends chosen to fit the retention and query patterns the team actually needs.