Every team eventually hits the gap: a metric they want is not in node_exporter, not in cAdvisor, not in any of the dozens of off-the-shelf exporters. The default move is to write a custom exporter. Often it is the right move, and often it is the wrong one — the textfile collector (a script writing a .prom file periodically) or the blackbox exporter (a probe of an external service) covers the case with less code and less to operate. This article walks through when a real exporter is warranted, the skeleton we use in Python and Go, and the metric-type rules that prevent your dashboards from lying.
How to verify
Whatever you build, three checks confirm it is well-formed and Prometheus is reading it.
# Exporter is up and responding
ss -lntp | grep <port>
curl -fsS http://127.0.0.1:<port>/metrics | head -20
# Metrics parse cleanly
curl -fsS http://127.0.0.1:<port>/metrics | promtool check metrics
# Prometheus is scraping it
curl -fsS http://prometheus:9090/api/v1/targets?state=active | jq '.data.activeTargets[] | select(.scrapePool=="my-exporter") | {url:.scrapeUrl, lastError, health, lastScrape}'
# A series shows up
curl -fsS 'http://prometheus:9090/api/v1/query?query=my_custom_metric' | jq '.data.result'
promtool check metrics catches almost every common error — duplicate types, conflicting labels, malformed lines. Run it in the exporter’s own test suite, not just in CI for the Prometheus repo.
What’s happening
Prometheus scrapes endpoints that expose plain-text metrics in a specific format: a # HELP line, a # TYPE line (counter, gauge, histogram, summary), and one or more sample lines per series with optional labels. An “exporter” is just any HTTP server that serves that format. The official Prometheus client libraries (Python, Go, Java, Ruby, Node, etc.) handle the format, the registry, the HTTP server, and a few cross-cutting concerns (process metrics, runtime metrics).
The first decision is whether you need a long-running process at all. Two cheaper alternatives often work:
- Textfile collector (part of node_exporter) — a shell or cron script writes a
.promfile to/var/lib/node_exporter/textfile/, and node_exporter exposes it. Right for metrics from periodic jobs (backups, cert renewal, batch counts). - Blackbox exporter — probes external endpoints (HTTP, TCP, ICMP, DNS, gRPC) and reports up/down + latency. Right for “is this service reachable” without instrumenting the service.
You need a real exporter when the source is an internal queryable thing (a database, an internal API, a vendor’s web admin page) and the metrics need to be fresh on each scrape.
The procedure
-
Decide between textfile, blackbox, or real exporter. A simple rule: if the source data changes only when a script runs (backup completion, log roll), use the textfile collector. If you want to probe externally (is the URL up), use blackbox_exporter. Otherwise, write an exporter.
-
Python exporter skeleton. Use
prometheus_clientand theCollectorinterface for “scraped on demand” metrics.# exporter.py import time, requests from prometheus_client import start_http_server, Gauge, Counter, REGISTRY from prometheus_client.core import GaugeMetricFamily, CounterMetricFamily class WidgetCollector: """Scrapes the internal widget API on each Prometheus scrape.""" def collect(self): try: r = requests.get("http://widget-api.internal/stats", timeout=5) r.raise_for_status() data = r.json() except Exception as e: # Surface scrape failure as a metric instead of returning nothing g = GaugeMetricFamily('widget_scrape_success', 'Whether the last scrape succeeded') g.add_metric([], 0) yield g return g = GaugeMetricFamily('widget_scrape_success', 'Whether the last scrape succeeded') g.add_metric([], 1) yield g total = GaugeMetricFamily('widget_count', 'Number of widgets', labels=['region', 'status']) for row in data['rows']: total.add_metric([row['region'], row['status']], row['count']) yield total lat = GaugeMetricFamily('widget_latency_seconds', 'Median widget processing latency', labels=['region']) for r2 in data['latency_p50_by_region']: lat.add_metric([r2['region']], r2['p50_ms'] / 1000.0) yield lat if __name__ == '__main__': REGISTRY.register(WidgetCollector()) start_http_server(9333, addr='127.0.0.1') while True: time.sleep(60)Pin the request timeout. A scrape that hangs for 30 seconds because the upstream API is slow blocks Prometheus and times out the whole job.
-
Go exporter skeleton. Same idea, lighter footprint.
package main import ( "log" "net/http" "time" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) type widgetCollector struct { up *prometheus.Desc count *prometheus.Desc latency *prometheus.Desc } func newCollector() *widgetCollector { return &widgetCollector{ up: prometheus.NewDesc("widget_scrape_success", "1 if scrape succeeded", nil, nil), count: prometheus.NewDesc("widget_count", "widgets", []string{"region", "status"}, nil), latency: prometheus.NewDesc("widget_latency_seconds", "median latency", []string{"region"}, nil), } } func (c *widgetCollector) Describe(ch chan<- *prometheus.Desc) { ch <- c.up; ch <- c.count; ch <- c.latency } func (c *widgetCollector) Collect(ch chan<- prometheus.Metric) { data, err := fetchUpstream(5 * time.Second) if err != nil { ch <- prometheus.MustNewConstMetric(c.up, prometheus.GaugeValue, 0) return } ch <- prometheus.MustNewConstMetric(c.up, prometheus.GaugeValue, 1) for _, row := range data.Rows { ch <- prometheus.MustNewConstMetric(c.count, prometheus.GaugeValue, float64(row.Count), row.Region, row.Status) } for _, r2 := range data.LatencyP50ByRegion { ch <- prometheus.MustNewConstMetric(c.latency, prometheus.GaugeValue, r2.P50Ms/1000.0, r2.Region) } } func main() { prometheus.MustRegister(newCollector()) http.Handle("/metrics", promhttp.Handler()) log.Fatal(http.ListenAndServe("127.0.0.1:9333", nil)) } -
Pick the right metric type for each number. This is the rule most exporters get wrong.
- Counter — monotonically increasing total (requests served, bytes transmitted, errors). Reset is allowed only on process restart.
- Gauge — a current value that goes up or down (current connections, queue depth, free memory).
- Histogram — distribution of observations, exposed as buckets, sum, count. Use for response times you will compute percentiles on.
- Summary — also distribution, with pre-computed quantiles. Avoid in custom exporters — quantiles cannot be aggregated across instances. Prefer histograms.
-
Make scrape failures observable. A
widget_scrape_successgauge (0 or 1) lets you alert on the exporter being broken — distinct from the exporter being down (whichup == 0catches). Without it, a misconfigured exporter that returns 200 OK with no data looks healthy. -
Run it under systemd, behind a localhost bind. Same template as node_exporter. Add a Prometheus scrape job, and an alert on
up == 0for the exporter itself. -
Use the textfile collector instead, when you can. For a backup script:
# At the end of /usr/local/bin/backup-postgres TMP=/var/lib/node_exporter/textfile/backup.prom.$$ cat > "$TMP" <<EOF # HELP backup_last_success_timestamp_seconds Unix epoch of last successful backup # TYPE backup_last_success_timestamp_seconds gauge backup_last_success_timestamp_seconds{job="postgres"} $(date +%s) backup_size_bytes{job="postgres"} ${BACKUP_BYTES} backup_duration_seconds{job="postgres"} ${BACKUP_DURATION} EOF mv "$TMP" /var/lib/node_exporter/textfile/backup.promAtomic rename, not direct write — node_exporter reads partial files otherwise. No HTTP server, no port, no auth surface.
Common pitfalls
- Choosing Summary over Histogram for latency; quantiles cannot be aggregated and your fleet-wide p95 calculation is wrong.
- Forgetting to expose a
_scrape_successgauge; a broken exporter looks identical to a working one with no data. - Doing expensive work in the metric handler (calling a slow API on every scrape); the scrape timeout fires and Prometheus marks the target down. Cache or use a background refresh and serve the last-known value.
- Per-scrape side effects (database query) that count as load on the upstream system; scrape every 15s and you create real CPU on the source.
- Exposing the exporter on
0.0.0.0with no auth;/metricsoften leaks internal hostnames and configuration values.
For the textfile pattern in detail, see prometheus-node-exporter. For probing external endpoints, the standard blackbox_exporter covers most cases without code.
Stack Harbor reaches for a custom exporter only after eliminating the textfile and blackbox alternatives, on the managed operations tier — Python or Go skeletons in version control, scrape-failure gauges wired to the same alerting tree as the rest of the metrics, and the upstream load from scrapes audited before the exporter ships.