Skip to content

The TIG stack — Telegraf, InfluxDB, Grafana for time-series without Prometheus

A working TIG deployment for time-series data — Telegraf agents, InfluxDB 2 bucket and retention, Grafana datasource with Flux, and the retention policy that keeps the disk bounded.

The TIG stack (Telegraf + InfluxDB + Grafana) is the push-based answer to the Prometheus question. Telegraf agents collect metrics and push to InfluxDB; InfluxDB stores them in buckets with a per-bucket retention; Grafana queries them with Flux or InfluxQL. It is the right shape when you have devices that cannot host a /metrics endpoint (network gear, MQTT sensors, app frameworks with native StatsD), or when you want a push model with strict per-tenant retention. This article walks through a working TIG install with the operational settings that prevent the disk and the cardinality from running away.

How to verify

After install, the three components should be talking and a Telegraf agent’s metrics should be queryable in Grafana.

sudo systemctl status influxdb telegraf grafana-server --no-pager
ss -lntp | grep -E ':(8086|3000)\b'
# InfluxDB health
curl -fsS http://127.0.0.1:8086/health | jq
influx ping
# A bucket exists and has cardinality
influx bucket list
influx query 'from(bucket:"telegraf") |> range(start: -5m) |> count()' --token $INFLUX_TOKEN

# Telegraf is sending
sudo journalctl -u telegraf -n 50 --no-pager | grep -E 'Wrote|error'

Wrote 12 metrics in 23.4ms in Telegraf’s log means the agent is pushing successfully. failed to write metrics to bucket means the token, bucket name, or org is wrong — the message contains which.

What’s happening

Telegraf is a Go agent with a plug-in architecture: dozens of inputs (system metrics, MySQL, Nginx, SNMP, Kafka, MQTT, StatsD, the list keeps growing) feed a buffer; that buffer is flushed to one or more outputs (InfluxDB, Prometheus remote-write, Kafka, file, anything). InfluxDB 2 is a time-series database with buckets (containers with a retention policy), measurements (analogous to tables), tags (indexed labels, low-cardinality), and fields (the actual values, not indexed).

The cardinality trap is the same as Loki and Prometheus. A tag with high cardinality (request ID, user ID) creates a new series per value and the index grows quickly; the right place for those is in a field. The retention story is the strength of InfluxDB compared to Prometheus — set a bucket to 30 days and the database evicts older points without intervention; multiple buckets with different retentions let you keep one-day raw and one-year downsampled in the same store.

The procedure

  1. Install InfluxDB 2. From the official APT repo.

    curl -fsSL https://repos.influxdata.com/influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata.gpg >/dev/null
    echo 'deb https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list
    sudo apt update
    sudo apt install -y influxdb2 influxdb2-cli
    sudo systemctl enable --now influxdb
  2. Bootstrap a primary org, user, and bucket.

    influx setup \
      --username sh-admin \
      --password "$(openssl rand -base64 32)" \
      --org stackharbor \
      --bucket telegraf \
      --retention 720h \
      --force
    # The admin token is printed; capture it. Store it in /root/.influxdbv2/configs.
    influx config ls

    --retention 720h is 30 days. Adjust to your real retention requirement.

  3. Create a dedicated write token for Telegraf (do not use the admin token).

    ORG_ID=$(influx org list --name stackharbor --json | jq -r '.[0].id')
    BUCKET_ID=$(influx bucket list --org stackharbor --name telegraf --json | jq -r '.[0].id')
    influx auth create --org stackharbor \
      --write-bucket $BUCKET_ID \
      --description 'telegraf write'
    # Capture the printed token; this goes into /etc/telegraf/telegraf.conf
  4. Install and configure Telegraf.

    sudo apt install -y telegraf
    # /etc/telegraf/telegraf.conf
    [agent]
      interval = "10s"
      round_interval = true
      metric_batch_size = 1000
      metric_buffer_limit = 10000
      flush_interval = "10s"
      flush_jitter = "2s"
      omit_hostname = false
    
    [global_tags]
      env = "prod"
      cluster = "ca-central-1"
    
    [[outputs.influxdb_v2]]
      urls = ["http://127.0.0.1:8086"]
      token = "$INFLUX_TOKEN"
      organization = "stackharbor"
      bucket = "telegraf"
    
    [[inputs.cpu]]
      percpu = true
      totalcpu = true
    [[inputs.mem]]
    [[inputs.disk]]
      ignore_fs = ["tmpfs", "devtmpfs", "overlay"]
    [[inputs.diskio]]
    [[inputs.net]]
    [[inputs.system]]
    [[inputs.systemd_units]]
      pattern = "(nginx|postgresql|telegraf|sshd)*"

    $INFLUX_TOKEN is read from /etc/default/telegraf:

    # /etc/default/telegraf
    INFLUX_TOKEN=<write-token-from-step-3>
    sudo systemctl enable --now telegraf
  5. Wire Grafana as a datasource against InfluxDB 2. Use the Flux query language for new dashboards; InfluxQL still works but is deprecated for 2.x.

    # /etc/grafana/provisioning/datasources/influxdb.yaml
    apiVersion: 1
    datasources:
      - name: InfluxDB
        type: influxdb
        url: http://127.0.0.1:8086
        access: proxy
        jsonData:
          version: Flux
          organization: stackharbor
          defaultBucket: telegraf
          tlsSkipVerify: false
        secureJsonData:
          token: <a-read-token-created-the-same-way>
  6. Set per-bucket retention and downsampling. A common pattern: keep telegraf (10 s resolution) for 30 days, downsample to a telegraf_1m bucket for one year.

    influx bucket create --org stackharbor --name telegraf_1m --retention 8760h
    # A task that aggregates every 1m and writes to the long-retention bucket:
    influx task create -f /etc/influxdb/tasks/downsample-1m.flux --org stackharbor
    // /etc/influxdb/tasks/downsample-1m.flux
    option task = { name: "downsample-telegraf-1m", every: 1m }
    from(bucket: "telegraf")
      |> range(start: -2m, stop: -1m)
      |> aggregateWindow(every: 1m, fn: mean, createEmpty: false)
      |> to(bucket: "telegraf_1m", org: "stackharbor")

Operational notes

  • A high-cardinality tag (anything per-request) causes InfluxDB’s TSI index to grow without bound; check _internal series storage_shard_disk_size and tsi1_series_create to catch this early.
  • Telegraf’s metric_buffer_limit is per-output — exceed it and the agent starts dropping metrics with a log line; size it for your worst plausible InfluxDB outage.
  • InfluxDB 2 uses bcrypt for passwords and the setup --password flag accepts plaintext — rotate the admin password through influx user password after setup so the bash history does not hold it.
  • The retention enforcement is async — a bucket with --retention 1h does not delete points exactly at 1h; expect a few minutes lag.
  • Grafana panels written in InfluxQL keep working but cannot use the Flux-only features (joins, pivots); document which dashboards are stuck on the old query language.

For the pull-based alternative — Prometheus on the same host — see prometheus-install-ubuntu. For a Netdata-based local agent that can push to InfluxDB, see netdata-install.


Stack Harbor runs TIG for clients whose source devices cannot host a /metrics endpoint, as part of the managed operations tier — retention buckets sized to the audit window, write tokens scoped to a bucket, and cardinality on every series watched alongside the standard health metrics.