Can't believe no-one has mentioned prometheus? It is the standard when it comes to monitoring anything in Kubernetes.

This helm-chart installs everything that you need, including alerts and dashboards: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#kube-prometheus-stack . It can obviously also be extended.

Loki is great for logs; it doesn't do metrics. Although you can extract metrics from logs.

My smokeping_prober costs nothing and can send pings as fast or as slow as you want.

You can easily collect this into Prometheus and set all of the alerting thresholds you want.

Prometheus can also replace your SNMP monitoring if you want. You can use it to define much better alerting than typical SNMP monitoring, eliminating the need for a low-skill NOC altogether.

I spent years in Nagios-land, and now I'm in deep with Prometheus, which I view as a combination of Nagios and Graphite. I think Prometheus is really solid, and am particularly excited about the integrations with Kubernetes (kube-prometheus, prometheus-operator), so if monitoring Kubernetes is a need for you, Prometheus is a strong option.

Check out Prometheus's list of exporters, which is how metrics are exposed to Prometheus for scraping. It's quite extensive. I'm happy to try to answer questions you might have.

As far as "resolving issues itself", Prometheus can send alerts to a webhook to take desired actions. I haven't walked down that path, yet.

So, just a disclaimer, I work on an open source monitoring software called Prometheus. But I've been a Linux sysadmin type for 20+ years.

There are a lot of monitoring solutions out there, with many styles and many categories.

Over the last 13 years I've mostly focused on metrics-based monitoring solutions. SNMP is a style of metrics-based monitoring, as is Prometheus.

The key advantage of metrics-based solutions is that we can get both performance information, and alerting, out of the same set of data. Systems that are "check-based", like Nagios, are only able to poke at your systems with a very corse stick by comparison.

Now, syslog, or any other log event driven monitoring, tends to be extremely useful for debugging. But they are much more difficult to use to create alerts. So you end up having to take event streams and turn them in to metrics anyway.

TL;DR:

Metrics-based (SNMP, Prometheus) can be used to generate alerts that tell you where, and what timeframe, in your event logs (syslog, apache, etc) to look.

EDIT: FYI, Prometheus has a very nice SNMP agent/converter that allows you to ingest SNMP so that you can visualize and write alerts for your SNMP devices.

Honestly, just write more code. Practice, practice, practice.

IMO, the best way to do this is to contribute to SRE-related open source projects. I started contributing lots to one 6 years ago or so, and my coding skills have greatly improved.

Shameless plug for the project. There are tons of open issues where you could add things, fix things, and get good code feedback on.

I don't know what you mean by "multi-tiered" in this context.

You say you don't want open source, but this really sounds like a job for Prometheus. It will cover everything from bare metal, to containers, to apps themselves.

It has built-in service discovery that covers a number of use cases, can easily be driven by your existing config management or cloud provider's APIs.

It deals with both performance metrics and alerting.

It can scale easily to the infra you're talking about, and with a little work can handle billions of metrics from 100k targets. The sysadmin overhead is very low.

Hey! This is one of my specialties. We use an in cluster Prometheus to do service discovery in cluster based off a service account token that we give cluster read permission to.

Then, there is a central Prometheus server that uses the /federate endpoint of each cluster Prometheus to gather data centrally and add annotations like cluster name to each metric.

You can read more about federation in Prometheus here: https://prometheus.io/docs/prometheus/latest/federation/

You need to define your alert rules (https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/). Once you define your alert rules, Prometheus will use polling mechanism to identify these alerts.

To send out alerts (such as email), you will also need to configure your Alertmanager (https://prometheus.io/docs/alerting/alertmanager/).

You can read more on this here: (https://prometheus.io/docs/alerting/overview/)

Prometheus and node-exporter work well for my environment. Grafana also has support for Prometheus data sources out of the box. There are plenty of other exporters but you can also instrument existing code fairly easily or write a custom sidecar.

Shameless plug for the monitoring system I work on: Prometheus

Free, open source, scales from my Raspberry Pi to as big as you could imagine.

It's also metrics-based, which means instead of checks, you can gather all the data from a service endpoint, visualize (Grafana) and alert on it.

The traditional answer is something like /u/redteamalphamale mentions.

Newer answer would be something like osquery(https://osquery.io/) or netdata as per /u/derprondo

The "new hotness" would be something like prometheus(https://prometheus.io/) paired with grafana.

Personally having used all of them, Prometheus is where I'm happily at now. It is a pull model, and can do metrics from machines all the way into applications, and is pretty awesome for a complete metrics solution, and it does alerting, etc as well.

Prometheus is configured with scrape configs that can use service discovery backends (Kubernetes, Consul) or cloud provider APIs (EC2.) In the case of cloud provider APIs, instances can be tagged to cause specific services to be scraped etc.

In the case of Kubernetes, installing the helm chart gets you pretty auto-magical config out of the box.

Promethius + Grafana are the goto solutions when it comes to K8s. Also see here, pretty good article on getting started with Prometheus and K8s monitoring.

There's no time like the present to learn yourself some Linux. I'm a Windows person and will always choose Windows when there aren't damned good reasons not to but sometimes a little Linux is the right tool. And if IT is your job, knowing your way around multiple platforms makes you much more valuable than someone who just does Windows.

But... Prometheus runs on Windows. And so does Grafana. So it theory you can do the same thing on Windows you just won't be as trendy as everyone else running Influxdb.

Learning security is a good thing. The major standard today for communication is "Mutual TLS". Each endpoint is secured with a TLS cert, and each client needs a signed "client certificate" in order to make a connection.

This ensures that every connection between systems is safe.

The design comes from a "Zero Trust" model. You make no assumptions about the network being secure.

As for your other questions, learning "Infra as Code". Using configuration management tools like Ansible and infra declaration tools like Terraform and Pulumi.

The other thing you could look into is learning about service orchestration. Rather than Docker Compose, you can build the infra with Kuberentes.

The next thing to learn is Prometheus if you haven't already.

Grafana and Prometheus are great choices (although they might be a bit heavy if you're just trying to get data from a single webapp).

The way it works is:

You'll have to write some prometheus code into your existing go code, telling your code what it should record and eventually export. Prometheus has a whole instruction page for this
In order to export the metrics you set up in the previous step, your binary will expose a port that will serve http content with the value of the metrics
Your Prometheus server will reach out and scrape those metrics periodically and store it in its time series DB
Prometheus will then make these metrics queryable, in which you can point a Grafana frontend to them and use grafana to display the metrics how you like

Yup, starting everything out with a CM tool is really good advice.

I'd also start out by dropping legacy monitoring systems and go with something like Prometheus. No need for SNMP. :-)

But the deployment is key, as you'll have to push the node_exporter to all nodes.

I probably wouldn't do this with a webhook, rather poll Prometheus with a query that tells you how many X you need. That will allow you to auto scale up and down as needed.

This is how things like keda works to auto-scale on Kubernetes.

Yes, Prometheus 2.25.0 introduced remote write receive. You need to enable it with a feature flag.

--enable-feature=remote-write-receiver

See the docs.

I've been using prometheus/grafana for a while, they are (in my opinion) the standard stack for Kubernetes clusters, so running in containers is not a concern, Prometheus gets its data making requests, if you are not using k8s you might need to expose a service that serves metrics, as for speed i'm not sure if it is faster, but it does use less resources on the hosts it monitors, which is always good for a tool that you are using because you care about performance on your instances.

Here you have a list of exporters, exporters are libraries to expose an endpoint like the one Prometheus expects, most of them are quite simple to implement

https://prometheus.io/docs/instrumenting/exporters/

The first thing you should think about is monitoring. CPU utilization, IO utilization, and application performance.

Typically I start by explaining the RED Method. Yes, the article says microservices, but that's just for buzzword matching. It applies to all kinds of services.

The thing you'll probably run into first is memory saturation. On our systems, I came up with a neat metrics query based on Prometheus data we collect.

instance:node_memory_available:ratio * 100 < 5 and rate(node_vmstat_pgmajfault[1m]) > 1000

Basically what this means is "If the available memory is less than 5%, and there are more than 1000 major page faults per second, for 15 minutes, alert".

The trick I discovered was that just having 5% available memory was too noisy, as many servers can run just fine like that. What happens is that as the memory pressure increased, the kernel starts having to page out application code from the page cache, which means that next time a different bit of code runs, it has to page that in from disk to execute. This is a "Major Page Fault", and a pretty clear indicator that things are sucking.

Even tho it's designed to scale to large containerized networks, Prometheus will happily run on a raspberry pi sized machine. I use it at home to monitor a few machines, and some custom stuff like my real-time circuitbreaker panel power monitor.

Prometheus is stupid simple to setup, just one binary and a config file. The agents (exporters) are plentiful and easy to use.

I used to be a huge supporter of Sensu, but it never really took off and the barrier to entry was not worth it. When they came out with the paid version I thought they would get better, but for some reason the main thing was now you can run it with Java instead of ruby which made me ask why not push that to open-source version.

I wish them luck, but if anyone asks me for advice on monitoring I have 3 answers:

Are you using containers? Do you have the time to re-think how you want to do monitoring? Go with Prometheus https://prometheus.io/ it is amazing.
Want something better than Nagios? Icinga2 is a solid choice.
If you have Nagios, maybe try some automation and take a look at how Etsy run their Nagios, you might be surprised.

Each component should be actively monitored. Servers/Services going down are covered individually. Same goes for load balancers, etc.

You also want synthetic probes to cover your end-to-end needs. This will help catch the unknown-unknown problems.

The thing you want to avoid is alerting on things that users don't care about. Users don't care if the traffic rate drops to zero. Users only care about their requests. There's a subtle difference there.

In other words: Your best bet is to instrument the code and publish a /metrics endpoint.

You can find Prometheus libraries for many languages at: clientlibs

Prometheus does not use SQL (which would indeed be rather unsuited for timeseries) , but rather a custom data format.

If you want to check it out, the documentation actually explains it quite well:

https://prometheus.io/docs/prometheus/latest/storage/

There you will also find info about all the remote storage options to solve HA :)

You can use inhibition rules to suppress alerts when a condition is met:

https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule

E.g. drop all "warning" alerts when critical alert (e g. "site down") is firing.

When using Prometheus' file-based service discovery mechanism, the Prometheus instance will listen for changes to the file and automatically update the scrape target list, without requiring an instance restart.

https://prometheus.io/docs/guides/file-sd/

You can automate adding a new target with something like Ansible and lineinfile

Ahh, okie. I was a little confused.

I would focus less on vendor certifications, those come and go, and IMO don't hold a lot of value.

I would also recommend focusing on looking for internship positions, or finding an open source project or similar organization to volunteer for. What kind of project depends on what you're interested in.

Shameless plug, but Prometheus is always looking for more contributors. :-)

And this is why predict_linear() exists.

Instead of waiting for the temp to go from 68 to 75, you do something like

predict_linear(server_room_temp[15m], 60*60) > 73

If the temp trend over the last 15 minutes points to the temp going up at 5 degrees/hour from your normal set point, you basically gain 45 minutes of reaction time.

Of course, you combine this alert with normal threshold alerts, but having metrics-driven alerts is pretty useful.

> Also, do you think that’s an unfair question to ask an intern?

"Fair" kinda depends on what position you were applying for. If you're applying for an infra monitoring company or some sort of ops/devops position, you might want to know a few fundamentals about infra monitoring and configuration management. At the very least, the popular tools available.

But how I would answer that is likely some combination of Prometheus driven alerts/monitoring and Chef or Ansible deployed agents. I've also had a lot of experience working with relatively large SolarWinds and Nagios/Icinga2 setups but Prometheus is miles better IMO. Granted, if you want to poll custom metrics it's less trivial to write a Prometheus exporter than a Nagios plugin.

https://prometheus.io

It’s not exactly straightforward to get going (assuming no knowledge) but it’s very well built and has demonstrated operational resilience for us over the last ~1 year we’ve been using it.

Works exceptionally well for what you want.

Use folium to display some interesting data on a map
Get a Raspberry Pi and some interesting sensors and make the sensor data available for a monitoring system like Prometheus

We support OpenBSD in the Prometheus node_exporter. It's not as feature complete as the Linux code, but it should get you started. Of course, we're always looking for more help with OpenBSD support. :)

FYI you should really include a rate in uses of histogram_quantile, as you probably don't want the 90th percentile latency over all time:

histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m]))

See https://prometheus.io/docs/querying/functions/#histogram_quantile()

Well, it's a good thing Prometheus is just an open source software system. You'll never be able to do business with us because it's not a business. :-)

What some people call buzzwords, I call high level terminology. Click past the front page and enjoy some reasonably good (IMO) reference documentation.

If you want something less reading intensive, here's a video from dockercon, but not really docker-specific. https://www.youtube.com/watch?v=PDxcEzu62jk

Or skip right to the big "getting started" button on the home page.

https://prometheus.io/docs/introduction/getting_started/

If you run the <code>node_exporter</code> and Prometheus, it's a pretty simple matter of graphing the netstat values.

For example: <code>rate(node_netstat_Tcp_RetransSegs[1m]) / rate(node_netstat_Tcp_OutSegs[1m]) * 100</code>

you probably need to use an arduino http client library to query the prometheus api https://prometheus.io/docs/prometheus/latest/querying/api/

I don't know any lib which helps you do this. A short Google query found mostly only links to ingesting data from arduinos into prometheus.

Yes, add prometheus to your application, use histogram metrics (https://prometheus.io/docs/concepts/metric_types/#histogram)

Alternatively you can deploy a service mesh and get out the info from there, but personally I would rather instrument the application directly. (https://istio.io/latest/docs/tasks/observability/metrics/)

Those are the default values. The configuration in the UI is the full internal representation, so you know what default values are injected into your config.

No data will be "spammed" if you don't configure a receiver to send alerts. Alertmanager doesn't maintain persistent connections to any downstream service.

If you want to test end-to-end, you'll need to have something to measure that. I don't know what your actual app is going over this connection, so I can't say anything about it.

For example, if it's HTTP-based, you can use the blackbox_exporter running on your laptop. It can measure the end-to-end to the home server.

If it's a custom app, you can measure every request over the wire with one of the Prometheus client libraries.

If that's the concern, you can reduce the frequency that Prometheus scrapes data. It's normally scraped every 20 seconds, but you could bump that up to 5 minutes, and reduce the amount of storage and processing by 15x.

If you're worried about the amount of disk usage, you can do an approximate check of the amount of storage required with the following formula:

> needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample > To lower the rate of ingested samples, you can either reduce the number of time series you scrape (fewer targets or fewer series per target), or you can increase the scrape interval. However, reducing the number of series is likely more effective, due to compression of samples within a series.

You can approximate bytes_per_sample as 1.3, which many people say is the average they see after Prometheus has done compression on their metrics.

Generally, Prometheus is pretty efficient.

There's a nice thread on this from r/linuxadmin that covers this nicely.

TL;DR, Prometheus was designed as an SRE-friendly Nagios replacement.

EDIT: Another thread from r/devops about why you should just use Prometheus.

Sorta, the design of Prometheus is that targets are found through "Service Discovery".

Targets don't call out to Prometheus themselves, because you may have more than one Prometheus monitoring at any time. Targets are designed to be "dumb", they just have to know about themselves.

So Prometheus supports a whole bunch of discovery methods. * Read the list from the config file (SIGHUP reloadable) * Read lists from files automatically. * Read from DNS (SRV/A records) * Read from cloud providers (AWS, GCP, Azure, etc) * Read from container platforms (Kubernetes, Marathon, etc)

So, it highly depends on how you deploy your stuff. For example, we "register" things in Prometheus using our Chef server. The Prometheus cookbook writes out the list of stuff to monitor by doing chef search and dumping the output into a text file. Prometheus uses file watches to automatically reload these files.

Yes, up/down checks are "free" in Prometheus. This is part of why it uses a polling model.

Any target you monitor automatically has an up metric you can watch. This is mentioned in the concepts jobs and instances section.

You can do ICMP, HTTP, TCP ping-type probes with the blackbox_exporter.

Shameless plug, I wrote a simple smokeping_prober that basically runs the equivalent of "ping" in a background routine and monitors the results. Prometheus collects histogram data so you can see the same kind of heatmaps you would get from smokeping, but with higher granularity. Instead of 10 pings every minute, it sends a continuous stream, so you can see latency spikes that might only happen for a few seconds every minute. It can also send pings every X milliseconds if you really want to get data on short latency bursts.

ELK is good for logs monitoring, but not much else. For easy monitoring and alerting, you might want to try Prometheus. It's open source like Zabbix/ELK/etc, but is far more efficient and easy to setup.

Disclaimer, I work on Prometheus.

We use Prometheus, most of our dashboards are public too.

Same thing with our alerts, they're in a public repo.

Prometheus is what you're looking for.

The blackbox exporter reports the cert times so you can alert when they expire soon.

The data is multipurpose, so it's not just one check, like Nagios would do. You get uptime, latency, etc.

What is "correctness" in this context. Correctness of APIs needs to be done with unit testing.

But it sounds like what you're really looking for is monitoring, which I would recommend Prometheus. You can generate internal application metrics to count every API request, the response status, and the duration.

You might want to look at the RED Method.

https://prometheus.io/docs/visualization/grafana/ If you want some awesome value add out of Prometheus, set up grafana to work with it. We are using it to do dashboards and collect usage data over time to better tune out vm resource allocation.

We've been setting up multi-platform builds for testing some Prometheus components. So far we have FreeBSD setup using BuildKite. This has been working pretty well.

The current setup was built with Ansible, so it was pretty easy to get rolling.

They play different roles: Statsd is an aggregator and Prometheus is more like storage/database with data querying and built-in alerting system.

They can even be used together, i.e. Prometheus as a Statsd backend service (see https://github.com/prometheus/statsd_exporter).

Usually Prometheus is compared to Graphite, for example (https://prometheus.io/docs/introduction/comparison/).

You should be able to read at least: https://www.robustperception.io/accessing-data-from-prometheus-1-x-in-prometheus-2-0. I haven’t tried it, but writing might also work.

Here’s a list of integrations that support remote read / write: https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage

You probably want to use a metrics-based monitoring system like Prometheus. It can measure all of the OS basics with the node_exporter.

> Most important metrics would be network latency

This is easy, but what do you mean by "network latency"? From where to where? You need to be able to describe what exactly you're trying to measure.

Yep. You run the node or windows exporter on each server and scrape them all with prometheus (see the prometheus.yml config). Same thing for mtail and/or promtail for logs.

For multiple sites things get a bit complicated, see the prometheus docs.

I really like Prometheus and have several projects that use it. However, I wouldn't recommend it for this use-case and it's actually the only example Prometheus has listed on their website under "When does it not fit".

To give a few reasons:

Many of the built in functions in Prometheus extrapolate which isn't good for this type of data, see https://github.com/prometheus/prometheus/issues/3746
All values in Prometheus are float64 under the hood.
Prometheus obtains current samples and stores them in its database to create a time series. It's designed for systems where delayed scrapes and missing points of data are fine. (e.g. microservices where if you miss a scrape, it's impossible to tell what samples would have been at that time)
Time series are mostly immutable. You can't just go into the database and adjust the value of something from yesterday. e.g. if you wanted to add the price data of AAPL for the past year, you can't. If you made a trade yesterday and wanted to add it via some service you have to expose that, it would be timestamped at whatever time it's ingested.

Overall using Prometheus for this type of data can be fine, but you have to be very careful around how you use it. If you just want to write some alerts or show some pretty graphs, Prometheus is fine. But you can clearly see the system is not designed for accuracy, which would likely make other TSDBs or even a traditional RDBMS better for this type of work.

Recording rules run the query you specify on a timer, and write the output of that query to the TSDB.

So if you configure a rule like this:

- record: job:up:avg expr: avg without (instance) (up)

It will take metrics like this:

up{job="node",instance="foo:9100"} 1.0 up{job="node",instance="bar:9100"} 1.0 up{job="web",instance="foo:8080"} 1.0 up{job="web",instance="bar:8080"} 0.0

And store this in the TSDB:

job:up:avg{job="node"} 1.0 job:up:avg{job="web"} 0.5

It will run this at whatever interval you specify in the rule group.

There are many ways to solve this:

You could go agains the Prometheus API to see the metric ALERTS{alertstate="firing"} and process that.

You could use the value of startsAt from the alert itself (see https://prometheus.io/docs/alerting/latest/clients/ ), combined with a known value for repeat_interval - see https://prometheus.io/docs/alerting/latest/configuration/#route

It sounds like you want to store time series data, So you're not looking to store a single value for 24 hours but want all temperature values over the last 24 hours?

This is not what MQTT was designed for, but you can collect collect data from MQTT and store it in another system for looking at data over time later.

Prometheus can do this, check out mqtt2prometheus for details on that.

A simpler solution might be to set up Home Assistant and let it collet and store your data. An added bonus with Home Assistant is you could use the data collected to fire off automations, or even notifications if the temperature gets too high.

Grafana is just a UI /frontend (generally used to query a backend for timeseries data)

Prometheus is a timeseries 'backend datasource'

With NodeJS you spin up a server (typically using express.js) . Prometheus then scrapes any metrics you expose on that server via an endpoint.

Client Library to setup that endpoint (listed on their page of client libraries) -https://github.com/siimon/prom-client

I'd checkout their example for the easiest start.

https://github.com/siimon/prom-client/blob/master/example/server.js

Once you have that endpoint setup -> You configure Prometheus to 'scrape' it to store those metrics within the Prometheus itself.

Lastly, once all that is done. You setup a 'datasource' on Grafana to then query those metrics.

https://prometheus.io/docs/visualization/grafana/#creating-a-prometheus-data-source

The Prometheus documentation gives some suggestions about what might be best practices for writing alert rules.

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

This suggests having two standardized annotations.

summary -- Brief summary of the problem with limited to no templating. Example: "HAProxy backend failed health check". I used the "summary" to display as the subject of the email blasts, or subject of the iOS notification push.
description -- Detailed, human readable message describing the problem with templating.

The only label I added to alerts was the "severity" label. This label made rules possible in Alertmanager to create priority based alerts. High priority alerts can wake you, while low priority alerts should be seen and acted upon the next day. Otherwise, I let the labels on the alert metric flow through as they might be useful to the person on the receiving end of the page.

With the clients I work with I usually suggest they standardize on one or more annotations as well:

runbook -- Link into the wiki with notes or instructions for the alert. I made this a required annotation in git hooks to help folks document their alerting and common issues.
grafana -- Optional. Link to the service's Grafana dashboard. This can be useful in debugging the service.
elk -- Link to relevant Kibana search. Use this to pivot from this alert directly to relevant logs in your ELK stack.

Alertmanager's templating features only extend to the text/api calls that Alertmanager sends to PagerDuty, Slack, Email, or similar. It doesn't do any fancy templating to make alerts look better in the Alertmanager UI. So I would suggest using an annotation approach so that information is easily visible in Alertmanager's UI and can also be templated into the PagerDuty alert incidents.

The regexp engine in golang doesn't support lookarounds: https://golang.org/pkg/regexp/syntax/

Also not_match_re is not a valid field for Alertmanager config: https://prometheus.io/docs/alerting/configuration/#route

So I guess you are out of luck here. 🤷‍♂️

The alertmanager supports inhibit rules that can be used suppress alerts if certain conditions are met.

Keep in mind that Prometheus and Nagios implement different styles of monitoring. Porting a Nagios configuration directly to Prometheus may not give the best results. For example, it may be better to install the node_exporter on "host 1" and alert on the up metric rather than using ICMP.

You may also wish to investigate the SNMP exporter if you are monitoring networking equipment.

The goal of Prometheus was alerting from the beginning, the alertmanager component has been production usable since around 2015 or so, but it got some major re-work in 2016.

Early on we wrote a check script for Nagios, which we still maintain to help people migrate.

Yes, the InfluxDB TICK stack, or TIG stack completes the set for that setup.

Yea, we're already making progress. Like I said, it's already the new defacto standard for server software. Prometheus is the second graduate member of the Cloud Native Computing Foundation. Many of the network vendors are already talking to us, and the RFC work is in progress.

There's the snmp_exporter to provide a bridge into SNMP. It's pretty nice because it makes translating MIBs into something more human friendly pretty easy.

There's also a huge list of integrations for popular software.

I find the article's treatment of histograms vs. summaries to be a little sketchy. See the Prometheus docs: Histograms and summaries for a good treatment of the differences.

I suppose my view is that histograms are usually the way to go over summaries, including for quantile calculations, because summaries can't be aggregated in a way that makes sense, generally.

That's because, in my experience, most of the time the relevant metrics are split across many different Prometheus targets. For example, if I care about API response latency for an application, requests to that application are probably serviced by a whole fleet of servers.

So, if you tried to use a summary, which you might do if you've only read this article, you'd end up in a situation where determining the 'true' 95th percentile request latency wasn't possible.

The difference there can be very important - for example, if an application server is just coming online, in the Java world it'll often be slow for its first few requests as classes are loaded, JVM JIT optimization kicks in, etc. If you do "avg(quantiles)" with summaries, this ends up showing as a huge spike in "request latency". If you do a proper histogram_quantile setup, this barely makes a blip at all.

I don't have time to read the rest of the article right now - food is now ready - but I might do so later.

I use prometheus+grafana for monitoring my python applications. The nice thing about prometheus is that it polls stats using very simple text format over http (see https://prometheus.io/docs/instrumenting/exposition_formats/). No need for a library.

class Stats: some_counter = 0

def my_function(): Stats.some_counter += 1

@app.route('/metrics') def metrics(): return "some_counter %s" % Stats.some_counter

When you say "instrumenting your code and pulling that into e.g. Prometheus", is this what you are referring to https://prometheus.io/docs/practices/instrumentation/?

This is an open source solution I know. I've tried it once, but didn't really play with it too much: https://prometheus.io/

A commercial, cloud-based solution is also Azure App Insights: https://docs.microsoft.com/en-us/azure/application-insights/app-insights-nodejs (your app doesn't need to run on Azure; also, disclaimer: I work for the Azure team)

Thanks for reading my blog post! Let me clarify this a bit. I guess the wording for the particular query example was not the best and I updated the article to make it more clear and avoid confusion. The idea with Linux kernel update example was to show that sometimes metadata might not be available in Prometheus and it might be really cumbersom to get it in there (e.g. a live kernel update information or anything else that is not present in the exporter). That's where using PostgreSQL/TimescaleDB might help a lot as it's just a simple SQL insert command to get metadata in.

Also Prometheus official documentation is saying that labels should not be overused due to the fact that data stored in Prometheus is denormalized (https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels). It's worth saying that PromQL is read only language so you can not modify your metadata which can be limiting for some use cases.

Since you seem more into open source solutions, a couple of ideas.

For a VM platform, you might want to check out Ganeti. I use it for a small server coop to run our own VPS platform. We use KVM + Ceph RBD. This allows us to do live VM migrations between nodes, and have no SPoF for storage.

And of course, I always recommend Prometheus for monitoring, but I'm biased there as I work on it. :-)

VictoriaMetrics is in development and isn't published yet. I'm planning to open source a single-node solution when it will be ready for production use. Cluster solution will be provided as SaaS. Initially it will support remote storage API for Prometheus, later more sophisticated solutions will be built on top of the core engine.

I'm assuming you've looked at the manuals for these applications, but just for a friendlt RTFM reminder for Telegraf, Prometheus and Grafana, now, here's where it gets a little annoying, you have to plan what you're going to show and logic out how you're going to gather, process and present that data.. don't go into this willie-nilly because then you're going to end up with a mess... look at your sensors, buy compatible environmental sensors, then read up on how to read them, play with the APIs and build out the panel slowly while adhearing to your plan.

I hope this helps.

Sounds like you need better monitoring software. In Prometheus we have the concept of a "group interval" that rolls up alerts of the same type into a single notification. You can also set threshold intervals in your config to avoid notification flapping.

Prometheus was designed with a bit of a decentralized nature in mind. For example, where we originally built it we had over 1500 servers (each one 4x larger than the type you listed) and 5 major groups of software developers. Each group may run a dozen different services.

Each service might have 100-200 machines or more, metrics from the servers, API servers, memcache, databases, etc.

Each team would run their own Prometheus server(s), that way they don't have any conflicts with other team's monitoring. We built it all with templates so there wasn't a lot to do there but say "I want monitoring for my services", the rest was automatic.

At the service level, they would use Prometheus recording rules to simplify the possibly hundreds of thousands of metrics down to some key service indicator metrics, maybe a few hundred.

Those few hundred metrics would be federated up to a organization-wide Prometheus to provide easy access to the global company health.

This is of course, much more complicated, but we're talking about a company with a large amount of compute power, for example we had 10 PB of storage just for key application logs used to do user trend reporting.

This was all in the original design, because it was easy to implement. Now that we have Thanos for a more universal clustering layer, it's a lot less necessary.

Of course, Prometheus scales down very efficiently, I run it on a Raspberry Pi style machine at home to monitor my local network, just for fun.

Where I work now, we have several Prometheus servers monitoring different parts of our app. One app alone produces nearly 1 million metrics. We're probably a bit overkill for metrics, but it allows us very fast performance and failure debugging. Most places don't need this level of detail.

I wouldn't call Grafana part of the TICK stack. It's not maintained by InfluxData, and the Prometheus documentation literally tells you to use Grafana.

Yea, it sounds good.

Why no line-interactive? Typically double conversion is pretty inefficient. Do you have any specs on how efficient your AC-DC-AC conversion is?

Have you thought about non-SNMP telemetry, maybe Prometheus format?

We have an internal SLO of 99.9% uptime, currently measured by pingdom, but we're moving towards measuring various individual endpoints with Prometheus. The new definition is measuring the 5xx error percent as seen by the front-end load balancers. But I want to get us to have more granular SLOs for different services.

We also want to have a 99th percentile latency SLO of 1 second.

I'm of the opinion that Monitoring and Management are separate systems. Too many times I've seen half baked systems that try and do both in the same software and you end up getting one OK solution and one shit solution.

Personally, I'm a big fan of Prometheus for monitoring. But I'm biased, I liked it enough that I work on the project.

I'm not sure what you mean by support, maybe you can go into more detail about what problems you're trying to solve.

Disclaimer, I work on the project. :-)

For monitoring, Prometheus can deal with everything in AWS and Azure.

Prometheus can automatically read the list of all your instances and monitor/alert for them.

> Just to clarify, you're saying that structured logging is bad when it's used instead of monitoring metrics; is that correctly understood?

It's not that it's strictly bad, but it's a little like trying to hammer nails using a screwdriver. I can imagine it being detrimental to both kinds of output, because when log statements == monitoring events, you are compelled to either elide logging which doesn't have the associated counter bumps thus making the logs less useful for debugging.

> I'm also curious if you have any good examples of how to do monitoring metrics? Anything we can learn from?

For a good example, take a look at the Prometheus project and the types of metrics defined by their libraries. (The page about the entire data model is also interesting for a broader perspective).

Prometheus and the snmp_exporter can be used to build dashboards and write alerts. Basically anything you can graph, you can alert on.

It can also monitor servers/services this way. :-)

Prometheus with the blackbox_exporter.

You can feed Prometheus a list of IPs from text pretty easily, the input format is yaml using file_sd_configs

I have a little setup where I send a ping every 1 second, and use a recording rule to give me percentiles over 15 seconds.

job:probe_loss:avg15s = 1 - avg_over_time(probe_success{job="blackbox_icmp"}[15s])
job:probe_durration_seconds:min15s = min_over_time(probe_duration_seconds{job="blackbox_icmp"}[15s])
job:probe_durration_seconds:q25_15s = quantile_over_time(0.25, probe_duration_seconds{job="blackbox_icmp"}[15s])
job:probe_durration_seconds:q50_15s = quantile_over_time(0.50, probe_duration_seconds{job="blackbox_icmp"}[15s])
job:probe_durration_seconds:q75_15s = quantile_over_time(0.75, probe_duration_seconds{job="blackbox_icmp"}[15s])
job:probe_durration_seconds:max15s = max_over_time(probe_duration_seconds{job="blackbox_icmp"}[15s])

So we went with Prometheus for our application level metrics and alerting. Initially focused on The Four Golden Signals.

Some reading:

https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html

https://landing.google.com/sre/book/chapters/practical-alerting.html

Monitoring. The thing that I took for granted, and learned in the last few years is that Monitoring is the first step to understanding a system.

I'm biased as fuck, so I would suggest Prometheus. But, my real goal is that you should understand monitoring, so almost anything will do.

Look at the Google SRE hierarchy of needs:

https://landing.google.com/sre/book/images/srle-01.jpg

Depends on what you mean by wmi information. Are you wanting monitoring software? Or do you want to get inventory information?

For monitoring, there is the wmi_exporter for Prometheus. This will let you collect metrics from windows machines onto Prometheus on whatever OS you want.

> Trying to make sure we're not getting into new anti patterns though.

The Best Practices section is pretty good, and I'd also recommend the doc on writing exporters which covers a lot of the details around instrumentation.

Ideally, a monitored system should retrieve the current value for a metric at the moment it is scraped by Prometheus, rather than getting the value on some set schedule and returning the most recent value to Prometheus.

See the official docs on writing exporters.

Is instrumenting the application itself an option?

https://prometheus.io/docs/instrumenting/clientlibs/

"Write the exposition yourself" is usually the only option if your needs can't be met by existing exporters. Clever Bash and the node_exporter textfile module is the "quick and dirty" I see most people do.

1 don’t pushgateway metrics have a time stamp with each push? (It’s been a while since I used it)? If so you can query on that.

https://prometheus.io/docs/prometheus/latest/querying/functions/%23absent_over_time

Doh, sorry, I missed the swappiness setting.

So, you may not be aware, but the meaning of 0 changed in Linux 3.5. Where 0 used to mean "don't swap unless really needed", now it won't swap at all. I haven't experienced/tested this myself, but I've read that having some swap enabled and vm.swappiness=0 can cause kswapd to chew CPU time. If you intend to have swap and attempt to use it, set it to 1.

I too have been using / admining Linux since the old Slackware / Linux 1.2.x days.

Like I said, I'm interested in this problem. I work extensively on monitoring systems, especially OS level monitoring.

Besides the slab stuff, I'd also like to see graphs of your systems for the more normal meminfo memory metrics like MemFree, MemAvailable, Buffers, Cached, Slab, etc.

I'm going to try and get some of this slabinfo monitoring done to see if I can see similar differences on my production systems (1000s of nodes).

Write it to Prometheus as a counter. Don't reset it, you can query the delta or increase on the counter for specific ranges. There's off the shelf nftables exporters.

Perhaps I got confused because the disk format is identical and that TSDB is not what prometheus cares about.

https://prometheus.io/docs/introduction/comparison/#prometheus-vs-opentsdb

https://iximiuz.com/en/posts/prometheus-is-not-a-tsdb/

Promehteus is not downloaded from a yum repository. You have to download the files from their webpage.

If you want to install Prometheus:
https://prometheus.io/download/#prometheus

https://prometheus.io/download/#node\_exporter

What version did you upgrade from? The startup hasn't changed much. The only big change that I remember recently was memory-snapshot-on-shutdown. But this is disabled by default.

I think pprof is now available during startup, so you can take CPU and memory profiles.

The main question is, how many WAL files is it having to replay at startup? I've seen some servers get into bad crash / recovery loops where they generate many hundreds of files in the wal dir. After a clean startup, Prometheus should cleanup this situation.

relabel_configs is for modifying targets for scraping. metric_relabel_configs is for modifying/filtering metrics discovered during scraping.

Documentation can be found here.

Key excerpt: ```

List of target relabel configurations.

relabel_configs: [ - <relabel_config> ... ]

List of metric relabel configurations.

metric_relabel_configs: [ - <relabel_config> ... ] ```

Also, you likely want to use __name__ to filter on metric name. I've never seen __meta_name before and cannot find documentation on it.

For future reference, I recommend looking up how to do things on the Robust Perception blog, as the posts there are usually by the writers of Prometheus.

You need to have some sort of aggregation function ^[1] in order to use by() or without()

For example: sum without(dest) (rate(node_nat_traffic[5m])) or sum by(src) (rate(node_nat_traffic[5m]))

its awesome that you want to learn this and jump right in and get things working but...

might i suggest that you take a step back, read and learn how prometheus and exporters work, then move forward with setting up a small environment and build from there.

start here - prometheus-docs

Prometheus model is scraping / pulling data from targets such as the exporters you have running in the pcs.

But if for some reasons ( security / network / other ) you have the need to push data to Prometheus just install Prometheus on each pc running it with specific agent options ( check https://prometheus.io/blog/2021/11/16/agent/ ) , so the scraping will be local, then configure them to push data to a dedicated Prometheus that will collect everything.

It should be noted that scraping targets is a best practice. However…Prometheus does support pushing. It does have some compromises which you can read about here: https://prometheus.io/docs/practices/pushing/

What is Reddit's opinion of Prometheus?
From 3.5 billion Reddit comments

➔ Prometheus website

By popularity on Reddit, this Service is:

100 reviews of this app found across Reddit:

List of target relabel configurations.

List of metric relabel configurations.

What is Reddit's opinion of Prometheus? From 3.5 billion Reddit comments

➔ Prometheus website

By popularity on Reddit, this Service is:

100 reviews of this app found across Reddit:

List of target relabel configurations.

List of metric relabel configurations.

What is Reddit's opinion of Prometheus?
From 3.5 billion Reddit comments