Prometheus Notes: Difference between revisions
No edit summary |
|||
Line 16: | Line 16: | ||
count({job="prometheus"}) by (__name__) | count({job="prometheus"}) by (__name__) | ||
) | ) | ||
=== Hourly average below previous daily average === | |||
Rate of successful activity in the last hour is less than 50% of the 24h average. | |||
<pre> | |||
sum by (group) (increase(activity_metric_duration_count{status="success"}[1h])) | |||
/ | |||
(sum by (group) (increase(activity_metric_duration_count{status="success"}[24h]))/24) < 0.5 | |||
</pre> | |||
== renaming metrics == | == renaming metrics == |
Latest revision as of 15:23, 23 August 2023
PromQL
node exporter:
node_memory_MemAvailable_bytes{job=~"myjob.*"} / on ( instance ) node_memory_MemTotal_bytes{job=~"myjob.*"}
node_memory_MemFree_bytes{job=~"myjob.*"} / on ( instance ) node_memory_MemTotal_bytes{job=~"myjob.*"}
sum(kube_pod_container_resource_requests_cpu_cores) / sum(kube_node_status_capacity_cpu_cores) * 100
topk( 10, count({job="prometheus"}) by (__name__) )
Hourly average below previous daily average
Rate of successful activity in the last hour is less than 50% of the 24h average.
sum by (group) (increase(activity_metric_duration_count{status="success"}[1h])) / (sum by (group) (increase(activity_metric_duration_count{status="success"}[24h]))/24) < 0.5
renaming metrics
scrape_configs: - job_name: sql targets: [172.21.132.39:41212] metric_relabel_configs: - source_labels: ['prometheus_metric_name'] target_label: '__name__' regex: '(.*[^_])_*' replacement: '${1}' - regex: prometheus_metric_name action: labeldrop
turns this:
query_result_dm_os_performance_counters{ counter_instance="ex01", counter_name="log file(s) size (kb)", prometheus_metric_name="sqlserver_databases", }
into :
sqlserver_databases{ counter_instance="ex01", counter_name="log file(s) size (kb)", }
dirty install node exporter
curl -L -o /tmp/node_exporter-1.0.1.linux-amd64.tar.gz https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-amd64.tar.gz tar zxvf /tmp/node_exporter-1.0.1.linux-amd64.tar.gz -C /tmp/ cp /tmp/node_exporter-1.0.1.linux-amd64/node_exporter /usr/bin/prometheus-node-exporter curl -L -o /tmp/node_exporter-1.0.1.linux-armv6.tar.gz https://github.com/prometheus/node_exporter/releases/download/v1.0.1/node_exporter-1.0.1.linux-armv6.tar.gz tar zxvf /tmp/node_exporter-1.0.1.linux-armv6.tar.gz cp /tmp/node_exporter-1.0.1.linux-armv6/node_exporter /usr/bin/prometheus-node-exporter chmod 755 /usr/bin/prometheus-node-exporter chown root:root /usr/bin/prometheus-node-exporter cat << EOF > /etc/default/prometheus-node-exporter ARGS="--collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$ \ --collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/) \ --collector.netclass.ignored-devices=^lo$ \ --collector.systemd EOF chown root:root /etc/default/prometheus-node-exporter chmod 644 /etc/default/prometheus-node-exporter cat << EOF > /lib/systemd/system/prometheus-node-exporter.service [Unit] Description=Prometheus exporter for machine metrics Documentation=https://github.com/prometheus/node_exporter [Service] Restart=always User=nobody EnvironmentFile=/etc/default/prometheus-node-exporter ExecStart=/usr/bin/prometheus-node-exporter $ARGS ExecReload=/bin/kill -HUP $MAINPID TimeoutStopSec=20s SendSIGKILL=no [Install] WantedBy=multi-user.target EOF chown root:root /lib/systemd/system/prometheus-node-exporter.service chmod 644 /lib/systemd/system/prometheus-node-exporter.service systemctl daemon-reload systemctl enable prometheus-node-exporter.service systemctl start prometheus-node-exporter.service
Exposing metrics
with python use:
import prometheus_client
example output:
# HELP go_memstats_frees_total Total number of frees. # TYPE go_memstats_frees_total counter go_memstats_frees_total 21217 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 307200
cpu usage from cpu seconds
usage by job:
100 - (avg by (job) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
reference: https://www.robustperception.io/understanding-machine-cpu-usage
cont. cpus on system by job:
avg(count(node_cpu_seconds_total)without (cpu))by(job)
when you have tagged you node pools by colour:
100 - (avg by (colour) (irate(node_cpu_seconds_total{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)
Plotting more than one metrics
label_replace( max(process_open_handles{kubernetes_namespace="mynamespace"}), "aggregation", "max", "", "" ) or label_replace( quantile(0.95, process_open_handles{kubernetes_namespace="mynamespace"}), "aggregation", "p95", "", "" ) or label_replace( quantile(0.5, process_open_handles{kubernetes_namespace="mynamespace"}), "aggregation", "p50", "", "" ) or label_replace( avg(process_open_handles{kubernetes_namespace="mynamespace"}), "aggregation", "avg", "", "" )
On Cardinality and Metrics
references:
- https://www.robustperception.io/which-are-my-biggest-metrics
- https://promlabs.com/blog/2020/12/17/promql-queries-for-exploring-your-metrics/
topk(10, count by (__name__)({__name__=~".+"}))
agg by job:
topk(10, count by (__name__, job)({__name__=~".+"}))
From: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/889
It is also useful to perform the following query in order to determine scrape targets exporting the maximum number of metrics:
max_over_time(scrape_samples_scraped[1h])
topk(10,sum(max_over_time(scrape_samples_scraped[1h]))by(job))
synonym?
topk(10, count by (job)({__name__=~".+"}))
Additionally to that the following query could be useful for determining scrape targets that introduce the most of time series churn rate:
max_over_time(scrape_series_added[1h])
topk(10,sum(max_over_time(scrape_series_added[1h]))by(job))
check your config
you have to do this between the time the container comes up and when it fails because it read the config file.
kns exec "cluster-metrics-prometheus-server-0" -c prometheus-server -- /bin/promtool check config /etc/config/prometheus.yml
kns is just kubectl that uses the $KNS env var for it's namespace. A little hack I created.
help
resources
https://timber.io/blog/promql-for-humans/
https://www.weave.works/blog/promql-queries-for-the-rest-of-us/
https://promcon.io/2018-munich/slides/taking-advantage-of-relabeling.pdf
https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085
https://www.robustperception.io/extracting-full-labels-from-consul-tags
https://blog.freshtracks.io/prometheus-relabel-rules-and-the-action-parameter-39c71959354a