What are /r/PrometheusMonitoring's favorite Products & Services?

I probably wouldn't do this with a webhook, rather poll Prometheus with a query that tells you how many X you need. That will allow you to auto scale up and down as needed.

This is how things like keda works to auto-scale on Kubernetes.

Yes, Prometheus 2.25.0 introduced remote write receive. You need to enable it with a feature flag.

--enable-feature=remote-write-receiver

See the docs.

As far as I know, you don't. That's just the way Grafana works. Since most of the logic for that kind of thing is done in the javascript UI code, the server just acts as a dumb proxy.

What you're asking about is an Enterprise Feature. You need to pay a license.

You can use inhibition rules to suppress alerts when a condition is met:

https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule

E.g. drop all "warning" alerts when critical alert (e g. "site down") is firing.

When using Prometheus' file-based service discovery mechanism, the Prometheus instance will listen for changes to the file and automatically update the scrape target list, without requiring an instance restart.

https://prometheus.io/docs/guides/file-sd/

You can automate adding a new target with something like Ansible and lineinfile

Prometheus is not involved, its just mentioned in the documentation as the service discovery mechanism is the same, ie prometheus has screap configs as does Promtail.
Fyi grafana is running a webinars that go through all these questions.
https://grafana.com/docs/loki/latest/clients/promtail/
https://grafana.com/go/webinar/opinionated-observability-stack-prometheus-loki-tempo/?pg=videos&plcmt=featured-1

you probably need to use an arduino http client library to query the prometheus api https://prometheus.io/docs/prometheus/latest/querying/api/

I don't know any lib which helps you do this. A short Google query found mostly only links to ingesting data from arduinos into prometheus.

Hi hi! Cortex author here :-)

I recommend these two talks for comparing the two projects; and old one with Bartek + I and a more recent one with Bartek + Marco:

https://grafana.com/blog/2019/11/21/promcon-recap-two-households-both-alike-in-dignity-cortex-and-thanos/

https://grafana.com/blog/2020/07/16/how-the-cortex-and-thanos-projects-collaborate-to-make-scaling-prometheus-better-for-all/

(links to write up on our blog, but feel free to just watch the youtube)

In your particular case you call out multiple tenants, and I'd argue (with all my biases) that this is something Cortex might do slightly better at than Thanos - its baked in from the start and some of the isolation primitives (QoS on query path, per-tenant limits, shuffle sharding) are super cool. Thanos docs are better and it has a bigger community of end users though - so you'll probably find that easier to get started with.

Its marginal though - and the two systems are way more similar than you might think: both use Prometheus TSDB, both use the PromQL engine, both even use the same code for query optimisation! And with the Thanos receiver both do remote write now.

Let me know if you have any questions.

Rather than have a different dashboard for each environment, you can use Grafana variables to select which environment you want to see on a single dashboard.

Then in your Prometheus query, you use the Grafana variable in the query string like foo_metric{env="$env"}.

Here's an example, but it selects different nodes: https://grafana.demo.do.prometheus.io/d/DP0Yo9PWk/use-method-node

Hi, I read through the article and it was good with the 'how', but I'm missing the 'why'. We have the prometheus-adapter & HPA custom metrics so I am unsure as to what keda adds to the mix. Otherwise really clear & straight-forward post!

It's not wrong, just that if you want to bind a service account other than default you need to configure your pods with that service account.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/

You should be able to read at least: https://www.robustperception.io/accessing-data-from-prometheus-1-x-in-prometheus-2-0. I haven’t tried it, but writing might also work.

Here’s a list of integrations that support remote read / write: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage

The Prometheus documentation gives some suggestions about what might be best practices for writing alert rules.

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

This suggests having two standardized annotations.

summary -- Brief summary of the problem with limited to no templating. Example: "HAProxy backend failed health check". I used the "summary" to display as the subject of the email blasts, or subject of the iOS notification push.
description -- Detailed, human readable message describing the problem with templating.

The only label I added to alerts was the "severity" label. This label made rules possible in Alertmanager to create priority based alerts. High priority alerts can wake you, while low priority alerts should be seen and acted upon the next day. Otherwise, I let the labels on the alert metric flow through as they might be useful to the person on the receiving end of the page.

With the clients I work with I usually suggest they standardize on one or more annotations as well:

runbook -- Link into the wiki with notes or instructions for the alert. I made this a required annotation in git hooks to help folks document their alerting and common issues.
grafana -- Optional. Link to the service's Grafana dashboard. This can be useful in debugging the service.
elk -- Link to relevant Kibana search. Use this to pivot from this alert directly to relevant logs in your ELK stack.

Alertmanager's templating features only extend to the text/api calls that Alertmanager sends to PagerDuty, Slack, Email, or similar. It doesn't do any fancy templating to make alerts look better in the Alertmanager UI. So I would suggest using an annotation approach so that information is easily visible in Alertmanager's UI and can also be templated into the PagerDuty alert incidents.

The regexp engine in golang doesn't support lookarounds: https://golang.org/pkg/regexp/syntax/

Also not_match_re is not a valid field for Alertmanager config: https://prometheus.io/docs/alerting/configuration/#route

So I guess you are out of luck here. 🤷‍♂️

The alertmanager supports inhibit rules that can be used suppress alerts if certain conditions are met.

Keep in mind that Prometheus and Nagios implement different styles of monitoring. Porting a Nagios configuration directly to Prometheus may not give the best results. For example, it may be better to install the node_exporter on "host 1" and alert on the up metric rather than using ICMP.

You may also wish to investigate the SNMP exporter if you are monitoring networking equipment.

You don’t really say why AGPL is a problem for you, and I’d encourage you to reconsider - IMO all unmodified usage of Grafana should be fine.

But if you work for an organisation where they’ve implemented a blanket ban on AGPL (due to all the fud about the license), we also make a free-as-in-beer version of our Grafana enterprise product available:

> for those that don’t intend to modify the code, simply use our Enterprise download. This is a free-to-use, proprietary-licensed, compiled binary that matches the features of the AGPL version

https://grafana.com/blog/2021/04/20/qa-with-our-ceo-on-relicensing/

Maybe that’s what you need?

If I'm understanding correctly, it sounds like you are looking for event monitoring for that case. Maybe something like Loki would work better for that:
https://grafana.com/oss/loki/

It sounds like the issue isn't PromQL as much as knowing which specific metrics to evaluate and how. That's the job.

One thing that might help is load up some dashboards from Grafana prometheus/node-exporter and then edit the graphs to see the PromQL & metric names behind the graphs you find interesting or want to create alerts around.

Ideally, a monitored system should retrieve the current value for a metric at the moment it is scraped by Prometheus, rather than getting the value on some set schedule and returning the most recent value to Prometheus.

See the official docs on writing exporters.

1 don’t pushgateway metrics have a time stamp with each push? (It’s been a while since I used it)? If so you can query on that.

https://prometheus.io/docs/prometheus/latest/querying/functions/%23absent_over_time

What version did you upgrade from? The startup hasn't changed much. The only big change that I remember recently was memory-snapshot-on-shutdown. But this is disabled by default.

I think pprof is now available during startup, so you can take CPU and memory profiles.

The main question is, how many WAL files is it having to replay at startup? I've seen some servers get into bad crash / recovery loops where they generate many hundreds of files in the wal dir. After a clean startup, Prometheus should cleanup this situation.

You need to have some sort of aggregation function ^[1] in order to use by() or without()

For example: sum without(dest) (rate(node_nat_traffic[5m])) or sum by(src) (rate(node_nat_traffic[5m]))

its awesome that you want to learn this and jump right in and get things working but...

might i suggest that you take a step back, read and learn how prometheus and exporters work, then move forward with setting up a small environment and build from there.

start here - prometheus-docs

Prometheus model is scraping / pulling data from targets such as the exporters you have running in the pcs.

But if for some reasons ( security / network / other ) you have the need to push data to Prometheus just install Prometheus on each pc running it with specific agent options ( check https://prometheus.io/blog/2021/11/16/agent/ ) , so the scraping will be local, then configure them to push data to a dedicated Prometheus that will collect everything.

It should be noted that scraping targets is a best practice. However…Prometheus does support pushing. It does have some compromises which you can read about here: https://prometheus.io/docs/practices/pushing/

From what I've gathered(by no means an expert, so please correct me if I'm wrong); If something gives you a numeric value, you can make a metric out of it. You might have to script it yourself though.

Here's a list of exporters already available on the official Prometheus site.

If you have set up Alertmanager you can add some or all of the rules on this page.

If you don't want to use Alertmanager just use the expressions provided with your own dashboard and setup Grafana Alerts

Re: Elasticsearch check out https://www.elastic.co/guide/en/elasticsearch/reference/7.4/data-rollup-transform.html.

Admittedly the idea to use ES is still half baked (less really). From what I gather from a quick scan of the doco the functionality in ES can take values from one index and generate summaries in another. I'm thinking a single document with min/max/p50/p75/p95/p99 values to summarize n documents of higher resolution. Unsure if it can be chained and not sure how the various indices would be queried. Perhaps using a read alias though I suspect a small app would be needed to craft the ES query and probably maintain the various indices and aliases. Also not sure how to get metrics from isolated accounts (eg. production) into a shared ES cluster. Probably another small app. Starts to look a bit like Thanos or Cortex.

Have you tried this from that link?

>Bot messages: To notify members with a bot message, the message must contain <!channel> or <!everyone>.

I'm not sure but openITCOCKPIT could be a solution for you. It is based on Naemon (fork of Nagios) for "classic" status checks and also has a Prometheus integration. So you get the classical status information(ok, warning, critical) for Prometheus monitored metrics/services. Both combined in one web interface.

See this blog for more information:

https://openitcockpit.io/2020/2020/10/20/openitcockpit-4-1-with-prometheus-integration/

This is actually how Prometheus itself works. Each http connection to a target is kept open, and every scrape interval only a GET /mtrics is sent.

What u/coentin is talking about is this section of writing exporters best practice doc. When Prometheus does the GET, you should have your exporter poll on the fly.

https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace

https://prometheus.io/docs/prometheus/latest/querying/functions/#label_join

Prometheus QL allows you to dynamically fabricate new labels at query time.

We have not found anything that these two label_* functions can’t come up with unless, of course, you ingested some forbidden strings like credentials into the label and want to annihilate them from the tsdb.

For that, then you probably need write a standalone tool to unpack, replace the string, and repack the blocks of the tsdb.

I re-read this comment.

Remote write, by the mechanism it operates, should always provide 100% identical data. It just physically can't work any other way. It takes the samples collected and sends them with the timestamps stored in Prometheus.

The only way to mess this up is to have a non-compliant receiver that changes the data. Prometheus to Pormetheus is, of course, compliant with itself.

See these test results

I thought perhaps I could do some calculation on that (albeit an approximation) with the histogram_quantile function found on the prometheus docs page. Perhaps substituting my metrics into

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) but I have tried a few variations of this and not managed to return anything meaningful.

What are my options? I can create whatever metrics I need to expose by modifying my query in the shell script on the target? Perhaps abandon the summaries and use gauges??

Agreed - Prometheus remote write is easier to use than Thanos querier + Thanos sidecars. There are many remote storage integrations for Prometheus. The most interesting are Cortex, Victoria Metrics and M3DB.

What you need is relabel_config per job and per instance/shard/AZ-group to drop targets that were discovered based on specific labels. relabel_config happens after discovery and before scraping targets.

Default retention time is 15 days: https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects

You can increase this value, you just have to think about the required disk space for storage, depending on your scrape interval, the number of metrics and your total retention time:

> Prometheus stores an average of only 1-2 bytes per sample. Thus, to plan the capacity of a Prometheus server, you can use the rough formula: > needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample

The problem is we will be writing to two different VictoriaMetrics DBs behind two different zones: Central and East. More write locations means more system resources will be used. In order to minimize this, we would like to only write to one location unless that connection gets interrupted. Is there a way to use sharding with Prometheus to accomplish this? I'm reading from https://prometheus.io/docs/practices/remote_write/ to try to understand.

One option is to use the file_sd_config (https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config) and read the list of targets from a file. One benefit is that you can update this list via code-generation. It should make your config a bit more readable as well.

Sure!

route: receiver: 'default'

routes: - receiver: 'support' match_re: my_label: my|regex continue: true

- receiver: 'managers' match_re: my_label: my|regex

receivers: - name: 'default' email_config: - to: '' html: '{{ template "my.template.default.html" . }}'

- name: 'support' email_config: - to: '' html: '{{ template "my.template.support.html" . }}'

- name: 'managers' email_config: - to: '' html: '{{ template "my.template.managers.html" . }}'

The key is continue: true

https://prometheus.io/docs/alerting/latest/configuration/

Hello,

https://prometheus.io/docs/alerting/latest/configuration/

You define a path to where you store templates using:

# Files from which custom notification template definitions are read. # The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'. templates: [ - <filepath> ... ]

So like

templates: - 'templates/*.tmpl'

These template files contain definitions for Go templates. Here is a template definition I wrote for slack messages:

{{ define "slack.custom.title" }}[{{ .CommonLabels.alertname }}] {{ .CommonLabels.instance }}{{ end }}

{{ define "slack.custom.text" }}{{ range .Alerts.Firing }}- {{ .Annotations.description }}\n{{ end }}{{ end }}

Here I define "slack.custom.title" as simply putting the alertname inside brackets, and adding the instance. So the title for the slack message would look like "[low disk] host-123.tld"

I can then use "slack.custom.title" as part of my receiver configuration

- name: "slack" slack_configs: - api_url: "{{ slack_api_url }}" send_resolved: true title: '{{ template "slack.custom.title" . }}'

Hope this helps!

I've never used this function myself, but it sounds like changes() might be what you are looking for. https://prometheus.io/docs/prometheus/latest/querying/functions/#changes

Maybe something like this :

changes(my_metric[1h]) > 1

i do not have direct experience with helm charts and blackbox, sorry - however it sounds like your issue is a discovery problem - you might be able to solve it by setting your targets to use something like https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config or https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read ?

This page lists a ton of Prometheus exporters for databases. Perhaps one of them suits your needs? https://prometheus.io/docs/instrumenting/exporters/

Writing a custom exporter is not hard. If the exporters in the list above don't suit your needs, you can write your own.

Read up on range queries.

The actual query will probably be something like avg(metric_name{label="value"}) or maybe avg_over_time. Not sure. Start and stop are required for a range query AFAIK. Unless you use delta to build your query.

Step is the time between observations. So, if you request data starting at 03:00:00 with a 30s step you will get: 03:00:00, 03:00:30, 03:01:00, etc. Stopping whenever you specify the stop parameter.

Would something like

expr: node_nf_conntrack_entries / node_nf_conntrack_entries_limit &gt; 0.8
description: |
  conntrack table is ${{ labels.value | humanizePercentage }} full.
  ({{ with query "node_nf_conntrack_entries{instance='$labels.instance'} }}
    {{- . | first | value -}}
  {{ end }}
  our of 
  ({{ with query "node_nf_conntrack_entries_limit{instance='$labels.instance'} }}
    {{- . | first | value -}}
  {{ end }}

be helpful?

https://prometheus.io/docs/prometheus/latest/configuration/template_examples/ might be helpful

Nice. A couple of suggestion. The temperature metric should be temperature_celsius. See the documentation on base units.

It would also be cool to add the /sys/class/apex parsing to https://github.com/prometheus/procfs.

Routes work on being able to match labels that are on the alert. (The alert is really just another time series.) Somewhere, you need your alerts to acquire a label of 'app' with the correct value.

I think the problem is getting that correct value of AAB out of the mountpoint label. You could use the label_replace() PromQL function in the query. Or, probably simpler, use the templating feature to build the value of the "app" label in the alert rule. There are some additional template functions like "rePlaceAll" that will do regex substitutions.

https://prometheus.io/docs/prometheus/latest/configuration/template_reference/

Not tested, but something like this as part of your alert definition:

labels: app: "{{ rePlaceAll '^(/var/log/') '' $labels.mountpoint }}"

To clean up old stale data you might not need anymore, you can look into the storage retention policies if you're using local storage, which can be used to control the amount of stored data.

Prometheus still remembers the old metric data, so that's what Grafana also knows about.

You might want to go back and delete the old data with the Delete Series API endpoint.

You can use https://prometheus.io/webtools/alerting/routing-tree-editor/ to visualize your routing tree and test where each label set will go.

Write your own exporter (there are Prometheus client libraries for various languages) that queries the API and exposes its data in the Prometheus format:

https://prometheus.io/docs/instrumenting/clientlibs/ https://prometheus.io/docs/instrumenting/writing_exporters/

I doubt Druid is the best option for remote storage for Prometheus. See the official list of remote storage systems available for Prometheus. Read the linked docs about these systems, try setting up and operating them and then choose the best solution for your needs. Note that Prometheus supports concurrent data replication into multiple distinct remote storage systems, so you can evaluate multiple systems at once.

Did you set up the Prometheus targets? Prometheus needs to know when and where the webservers should be scraped. In a simple environment this is probably easiest done with a static file configuration.

In more dynamic environments there are ways to automatically discover the targets.

Yes it is, https://prometheus.io/docs/prometheus/latest/configuration/configuration/

Look out for "external_labels"

Hash of the grouping labels for the alerts.

The data structure is the same as what notification templating uses internally, so you can use those docs: https://prometheus.io/docs/alerting/notifications/

How does this improve over using Pushgateway, especially in regards to the downsides?

Single point of failure & bottleneck
You lose up metric
Series churn / forgetting metrics

Started again and just followed:

https://prometheus.io/docs/prometheus/latest/installation/

If I just run docker run -p 9090:9090 prom/prometheus then it all starts, but where do I configure the .yml files?

Thanks

https://prometheus.io/docs/prometheus/latest/storage/

I'd guess that:

1) your retention window is very short , makes you sensitive to some of the ground realities.

2) retention blocks are 2h long, so a significant portion of your retention window. Your retention window is 1.5 blocks long. I expect you will never see it under 4h in reality since they delete things at the block level (assumption)

3) this line about "he initial two-hour blocks are eventually compacted into longer blocks in the background." I suspect you have one 2h 'current block' and 1 4h 'longer block' it's possible that a 6h retention window may be as low as you can actually go if you use time based retention.

4) it's possible that past retention blocks get deleted eventually and that you just need to wait a bit.

I have a feeling it's just a combo of 2 and 3 that you are seeing. I'm curious what happens if your set your retention window to 2h...

It should work with targets behind a proxy with TLS and basic auth.

EC2 SD is part of scrape_config, and so is authentication configuration, among other things. SD will discover the targets, and connect to them based on what you set for basic auth within the same scrape_config.

It's all in the docs:

Prometheus scrape_config

Prometheus ec2_sd_config

The link you shared was for Prometheus 1.8, which is obsolete now. We completely re-wrote the TDSB compression so that the downsides of that version are gone.

As for Prometheus performance, the last benchmark from a while ago was on the order of 250k samples/second/core. Since Prometheus is written in Go, it scales well for CPU. I've heard of people ingesting > 1 million samples/second on a single instance.

What you read about memory is also incorrect, and about the obsolete version. Also, that number is misleading, it's a "target", not a limit. Prometheus uses exactly as much memory as it needs, there are no fixed limits. Even with the old version, it would happily use more memory if it needed to in order to ingest the data it was being asked to.

Prometheis is not "zero config", it's explicitly not, because the goal is monitoring and alerting. The problem with "zero config" is that how do you tell the difference between a failed discovery, and no discovery. Prometheus requires you configure service discovery in order to collect data from targets. This allows you to detect when things have failed. Relying on zero config is fragile, and IMO, broken.

The node_exporter not equivalent to nedata's host agent. That's more like InfluxDB's Telegraf agent. It's one exporter for one thing, which is node metrics. For other services, you should look at the list of exporters and integrations.

I've been reviewing Prometheus federation.

Yes, this sounds like an option! I was thinking the capability would exist under node_exporter, but a federated Prometheus approach works as well. Thanks dfndoe!

I downgraded to 2.2 and the "HTTP_requests_total" is there, but still no "node_cpu". I imagine this has something to do with the versioning then, I just realized that 2.3 was released 4 days ago. Strange they would remove this from 2.3 though and not update their getting started guide..

Anyone have any thoughts on this?

Edit: even the getting started guide mentions the 'node_cpu' query

https://prometheus.io/docs/introduction/first_steps/

>For example, you can see the node's CPU usage via the node_cpu metric.

https://prometheus.io/docs/introduction/faq/#can-prometheus-be-made-highly-available?

Don't say it doesn't do HA.

HA in the sense of replicating databases will never happen because it's totally against their philosophy.

is -storage.local.retention the only valid flag for 2.0 still? And I assume there is no migration path from 1.x?

I haven't been able to find an updated version of this blog post for 2.0 https://prometheus.io/blog/2017/04/10/promehteus-20-sneak-peak/

If the two queries you want to display on the same graph have disjoint label sets, you can use or, for instance:

sum(rate(scope_request_duration_seconds_count[5m])) by (job) or sum(rate(cortex_request_duration_seconds_count[5m])) by (job)

(See https://prometheus.io/docs/querying/operators/#logical-set-binary-operators for more detail)

You should consider using something like Grafana to produce dashboards (in front of Prometheus), which allows you to add multiple queries to a graph without resorting to such tricks.

Yes, in Prometheus you can either change the global scrape interval or the interval by job in your config. Defaults:

scrape_interval: 15s scrape_timeout: 10s evaluation_interval: 1m

For rate queries in Grafana, you can also use its interval and range variables when you change the time range on the dashboard:

https://grafana.com/docs/grafana/latest/features/datasources/prometheus/#using-interval-and-range-variables

For reference, heres a great comparison of the two https://grafana.com/blog/2019/11/21/promcon-recap-two-households-both-alike-in-dignity-cortex-and-thanos/

actually re-reading your question i think i may be misunderstanding. i assumed you were using label_values for templating (in variables of your grafana dashboard settings) but it appears you may be using it in a graphing query? im not sure you should be doing that tbh.

https://grafana.com/docs/grafana/latest/features/datasources/prometheus/#query-variable

the point of using label_values in templating is so you can build your graphs to have queries like:

node_load1{instance=~"$host"}

where $host is a variable that uses label_values in your dashboard -> settings -> variables (in this case my $host variable is set to label_values(node_load1, instance)

https://banzaicloud.com/blog/grafana-templating/ is also helpful for explaining what im trying to say clearer :P

> Data Retention

The current best long term storage solutions are Thanos and Cortex.

Thanos is pretty easy to use if you have a Google GCS or Amazon S3 compatible bucket storage service. But it sounds like you don't. One option would be to setup Ceph/Rook or Minio to handle that. But again, your hosting provider's storage doesn't sound stable.

Cortex is a bit harder to deploy, but, there are providers that can just do this for you. For example, Grafana Cloud can host this for you. This can also solve your resource utilization problem by moving those long-term queries to Grafana's hosted service.

What are /r/PrometheusMonitoring's favorite Products & Services?
From 3.5 billion Reddit comments

The most popular Services mentioned in /r/PrometheusMonitoring:

Prometheus

Grafana

Kubernetes

Speedtest by Ookla

Zabbix

InfluxDB

openITCOCKPIT

Notion

O'REILLY Safari

Slack

elasticsearch

The most popular reviews in /r/PrometheusMonitoring:

What are /r/PrometheusMonitoring's favorite Products & Services? From 3.5 billion Reddit comments