Prometheus Notes

From Federal Burro of Information
Jump to navigationJump to search

PromQL

node exporter:

node_memory_MemAvailable_bytes{job=~"myjob.*"} / on ( instance ) node_memory_MemTotal_bytes{job=~"myjob.*"}
node_memory_MemFree_bytes{job=~"myjob.*"} / on ( instance ) node_memory_MemTotal_bytes{job=~"myjob.*"}
sum(kube_pod_container_resource_requests_cpu_cores) / sum(kube_node_status_capacity_cpu_cores) * 100


topk(
10,
count({job="prometheus"}) by (__name__)
)

Hourly average below previous daily average

Rate of successful activity in the last hour is less than 50% of the 24h average.

sum by (group) (increase(activity_metric_duration_count{status="success"}[1h]))
/
(sum by (group) (increase(activity_metric_duration_count{status="success"}[24h]))/24) < 0.5

renaming metrics

scrape_configs:
­- job_name: sql
  targets: [172.21.132.39:41212]
  metric_relabel_configs:
­  - source_labels: ['prometheus_metric_name']
    target_label: '__name__'
    regex: '(.*[^_])_*'
    replacement: '${1}'
­  - regex: prometheus_metric_name
    action: labeldrop

turns this:

query_result_dm_os_performance_counters{
  counter_instance="ex01",
  counter_name="log file(s) size (kb)",
  prometheus_metric_name="sqlserver_databases",
}

into :

sqlserver_databases{
  counter_instance="ex01",
  counter_name="log file(s) size (kb)",
}

dirty install node exporter

curl -L -o /tmp/node_exporter-1.0.1.linux-amd64.tar.gz https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz
tar zxvf /tmp/node_exporter-1.0.1.linux-amd64.tar.gz -C /tmp/
cp /tmp/node_exporter-1.0.1.linux-amd64/node_exporter /usr/bin/prometheus-node-exporter

curl -L -o /tmp/node_exporter-1.0.1.linux-armv6.tar.gz https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-armv6.tar.gz
tar zxvf /tmp/node_exporter-1.0.1.linux-armv6.tar.gz
cp /tmp/node_exporter-1.0.1.linux-armv6/node_exporter /usr/bin/prometheus-node-exporter

chmod 755 /usr/bin/prometheus-node-exporter
chown root:root /usr/bin/prometheus-node-exporter

cat << EOF > /etc/default/prometheus-node-exporter
ARGS="--collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$  \
      --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/) \
      --collector.netclass.ignored-devices=^lo$  \
      --collector.systemd
EOF
  
chown root:root /etc/default/prometheus-node-exporter
chmod 644 /etc/default/prometheus-node-exporter

cat << EOF > /lib/systemd/system/prometheus-node-exporter.service
[Unit]
Description=Prometheus exporter for machine metrics
Documentation=https://github.com/prometheus/node_exporter
[Service]
Restart=always
User=nobody  
EnvironmentFile=/etc/default/prometheus-node-exporter
ExecStart=/usr/bin/prometheus-node-exporter $ARGS
ExecReload=/bin/kill -HUP $MAINPID
TimeoutStopSec=20s
SendSIGKILL=no
[Install]
WantedBy=multi-user.target
EOF

chown root:root /lib/systemd/system/prometheus-node-exporter.service
chmod 644 /lib/systemd/system/prometheus-node-exporter.service

systemctl daemon-reload
systemctl enable prometheus-node-exporter.service
systemctl start prometheus-node-exporter.service

Exposing metrics

with python use:

import prometheus_client


example output:

# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 21217
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 307200

cpu usage from cpu seconds

usage by job:

100 - (avg by (job) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

reference: https://www.robustperception.io/understanding-machine-cpu-usage

cont. cpus on system by job:

avg(count(node_cpu_seconds_total)without (cpu))by(job)

when you have tagged you node pools by colour:

100 - (avg by (colour) (irate(node_cpu_seconds_total{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)

Plotting more than one metrics

	label_replace(
		max(process_open_handles{kubernetes_namespace="mynamespace"}), 
		"aggregation", "max", "", ""
	)
	or
	label_replace(
		quantile(0.95, process_open_handles{kubernetes_namespace="mynamespace"}), 
		"aggregation", "p95", "", ""
	)
	or
	label_replace(
		quantile(0.5, process_open_handles{kubernetes_namespace="mynamespace"}),  
		"aggregation", "p50", "", ""
	)
	or
	label_replace(
		avg(process_open_handles{kubernetes_namespace="mynamespace"}),  
		"aggregation", "avg", "", ""
	)

On Cardinality and Metrics

references:

topk(10, count by (__name__)({__name__=~".+"}))

agg by job:

topk(10, count by (__name__, job)({__name__=~".+"}))


From: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/889

It is also useful to perform the following query in order to determine scrape targets exporting the maximum number of metrics:

max_over_time(scrape_samples_scraped[1h])
topk(10,sum(max_over_time(scrape_samples_scraped[1h]))by(job))

synonym?

topk(10, count by (job)({__name__=~".+"}))

Additionally to that the following query could be useful for determining scrape targets that introduce the most of time series churn rate:

max_over_time(scrape_series_added[1h])
topk(10,sum(max_over_time(scrape_series_added[1h]))by(job))

check your config

you have to do this between the time the container comes up and when it fails because it read the config file.

kns exec "cluster-metrics-prometheus-server-0" -c prometheus-server -- /bin/promtool check config /etc/config/prometheus.yml

kns is just kubectl that uses the $KNS env var for it's namespace. A little hack I created.

help

/prometheus help

resources

https://timber.io/blog/promql-for-humans/

https://www.weave.works/blog/promql-queries-for-the-rest-of-us/

https://promcon.io/2018-munich/slides/taking-advantage-of-relabeling.pdf

https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085

https://www.robustperception.io/extracting-full-labels-from-consul-tags

https://blog.freshtracks.io/prometheus-relabel-rules-and-the-action-parameter-39c71959354a

/Prometheus Internal Metrics

On joining