What are /r/devops' favorite Products & Services?
From 3.5 billion Reddit comments

Search for a Subreddit:

Products

Services

Android Apps

VPNs

The most popular Products mentioned in /r/devops:

The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations

➔ 10 /r/devops comments

Check price

➔ View 10 /r/devops comments

Check price

The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win

➔ 9 /r/devops comments

Check price

➔ View 9 /r/devops comments

Check price

Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations

➔ 6 /r/devops comments

Check price

➔ View 6 /r/devops comments

Check price

Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Addison-Wesley Signature Series (Fowler))

➔ 5 /r/devops comments

Check price

➔ View 5 /r/devops comments

Check price

Practice of Cloud System Administration, The: DevOps and SRE Practices for Web Services, Volume 2

➔ 3 /r/devops comments

Check price

➔ View 3 /r/devops comments

Check price

The Manager's Path: A Guide for Tech Leaders Navigating Growth and Change

➔ 2 /r/devops comments

Check price

➔ View 2 /r/devops comments

Check price

#10

UNIX and Linux System Administration Handbook, 4th Edition

➔ 2 /r/devops comments

Check price

➔ View 2 /r/devops comments

Check price

#11

Kubernetes: Up and Running: Dive into the Future of Infrastructure

➔ 2 /r/devops comments

Check price

➔ View 2 /r/devops comments

Check price

#12

Interconnections: Bridges, Routers, Switches, and Internetworking Protocols

➔ 2 /r/devops comments

Check price

➔ View 2 /r/devops comments

Check price

#13

Systems Performance: Enterprise and the Cloud

➔ 2 /r/devops comments

Check price

➔ View 2 /r/devops comments

Check price

#14

The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win

➔ 2 /r/devops comments

Check price

➔ View 2 /r/devops comments

Check price

#15

The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices

➔ 2 /r/devops comments

Check price

➔ View 2 /r/devops comments

Check price

The most popular Services mentioned in /r/devops:

Kubernetes

159 /r/devops comments

elasticsearch

99 /r/devops comments

GitLab

90 /r/devops comments

Jenkins

86 /r/devops comments

Amazon Web Services

76 /r/devops comments

Prometheus

74 /r/devops comments

Grafana

60 /r/devops comments

Git Bash

40 /r/devops comments

CircleCI

31 /r/devops comments

#10

Drone.io

28 /r/devops comments

#11

DigitalOcean

27 /r/devops comments

#12

Concourse

25 /r/devops comments

#13

Datadog

24 /r/devops comments

#14

Vagrant

24 /r/devops comments

#15

Let's Encrypt

18 /r/devops comments

The most popular Android Apps mentioned in /r/devops:

Sync for reddit

1 /r/devops comment

Free

Termux

1 /r/devops comment

Free

Cloud Pros- AWS Certified Arch

1 /r/devops comment

$6.99

The most popular VPNs mentioned in /r/devops:

Anonymity score

3 /r/devops comments

$8.32 / month

The most popular reviews in /r/devops:

>Gitlab doesn't have very sensitive data ( I am assuming it would be mostly code)

Umm, if you're a company that sells software products (i.e. the sort of company that needs a source code repository, not a manufacturing company or something), then the code is probably the most sensitive thing you own because the code becomes your product, i.e. the thing that makes you the money.

So from a security point of view, you (as a company) might want to retain complete control over the repositories to make sure your code is never exposed for someone to get a competitive advantage over you. It means bad actors can't pick it apart for vulnerabilities. And also from a reliability point of view due to incidents like https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/. To my knowledge it has only happened one time, but that's still very bad.

My question would be "if they're the same price, why would any company NOT self-host" - my experience is usually that self-hosted products cost more and therefore the argument is more balanced. In this case the price is a win, and self-hosting is a win (yes it's effort, but my experience running Gitlab CE suggests that it's not that difficult)

> Cloudflare has been ISO 27701 certified as a PII Processor and PII Controller since 2021 and the certificate is available upon request.

and even more importantly

> Cloudflare maintains PCI DSS Level 1 compliance

https://www.cloudflare.com/privacy-and-compliance/certifications/

I keep [this]( https://kubernetes.io/docs/reference/kubectl/cheatsheet/ ) (official k8s docs) bookmarked, mostly for all the filtering/formatting options.

Networking is networking. There's no difference who does it.

Regardless, this is a timeless book: https://www.amazon.com/TCP-Illustrated-Protocols-Addison-Wesley-Professional/dp/0321336313

If you try Tectonic Sandbox, you get a turnkey local version (thanks to Vagrant).

CoreOS has a good guide for getting started: https://coreos.com/tectonic/docs/latest/tutorials/sandbox/first-app.html

Other distributions should have something similar. And if all else fails, Kubernetes.io also has good tutorials: https://kubernetes.io/docs/tutorials/

If you're confused about "Why" you'd use Kubernetes, you may not need it yet. For me, the biggest reason to use it is for when you start running a lot of containers. With Kubernetes, it's often as easy to manage 3 containers as it is 300.

That's the real gist of it. Once you start thinking about updating 300 containers running on say, systemd instead, it's a headache. It's a long, manual, and error prone process. Kubernetes just mostly figures things out for you after you tell it what you want.

The tutorials won't show you that side of stuff, if you haven't already experienced the pain of large-scale distributed management. The tutorials will introduce you to Kubernetes primitives and the basics for getting stuff done.

Full disclosure: I work for CoreOS.

We're paying for 1Password Business - https://1password.com/teams/pricing/

There's desktop apps, browser plugins, and a website.

For us being able to share some passwords was a huge requirement. We've got ~240 different hosting providers, most of whom don't allow team accounts. So we needed a good way of handling those credentials. Adding people to vaults and taking them away is pretty easy (we do that for the vault containing corporate credit cards for people who don't have one).

I wish they had a real linux application, but their browser plugin for FF on linux does the job there.

Hence why I prefaced with not the lightest weight.

Argo is resource heavy if you also need to run a k8s cluster just to get CI. Hell Jenkins on a single small VM would be a lightweight option.

Light weight weight could mean a git server running with a commit hook that runs a bash scripts that rebuilds and runs a daemon on push

Even lighter still would be sftp to straight to /var/www under Apache, assuming you're running a wsgi compatible application stack

I still maintain that gitlabs benefits far outweigh any of these 'lightweight' options. Look at the installation options on their page, https://about.gitlab.com/install/#ubuntu. Step 1 have a server. Step 2 install a package. Step 3 login. Sounds like a no brainier to me

I spent years in Nagios-land, and now I'm in deep with Prometheus, which I view as a combination of Nagios and Graphite. I think Prometheus is really solid, and am particularly excited about the integrations with Kubernetes (kube-prometheus, prometheus-operator), so if monitoring Kubernetes is a need for you, Prometheus is a strong option.

Check out Prometheus's list of exporters, which is how metrics are exposed to Prometheus for scraping. It's quite extensive. I'm happy to try to answer questions you might have.

As far as "resolving issues itself", Prometheus can send alerts to a webhook to take desired actions. I haven't walked down that path, yet.

Phoenix Project is a solid high-level overview of the concepts, and leads directly into the author's next book, The DevOps Handbook, which really digs into the details.

If you were a financial institution, you'd know that Cloudflare has a bunch of relevant certificates. Since I assume you are not a financial institution, I don't know what regulations you have to follow, but chances are that Cloudflare can handle your data.

However no one knows since it's your data and regulations might say you need to get this confirmed for all vendors you use, which would include Cloudflare.

But technically you are right: Cloudflare receives the plain text data from the backend server (transport might be HTTPS, but it's repackaged). See also here. Whether this is an actual problem or not depends on your regulator and the certificates Cloudflare has.

What is usually the bigger problem is that PIs can be access by wrong people (e.g. I log in and see your PI).

Signed host keys with an ssh ca. There are a bunch of ways to do this, but you basically need one line of @cert-authority in your known_hosts and never have to update it again.

You also might want to look into cashier.

You don't seem to list any software actually used by distros. Any reason for that?

For example debian uses auto-builder and wannabuild https://www.debian.org/devel/buildd/ Fedora uses Koji https://fedoraproject.org/wiki/Infrastructure/KojiBuildSystem etc.

The package build systems are going to be different than generic CI, but possibly better adjusted to what your want to do.

Theres been a few talks about their backend, they go to PyCon occasionally since the whole backend is python.

http://highscalability.com/eve-online-architecture

Here's an article from them back in 2015:

https://www.eveonline.com/news/view/tranquility-tech-3

Docker complicates things more than you need right now.

I suggest using Vagrant (https://www.vagrantup.com/) for development environments instead of a central dev server, and writing Ansible, Chef, or Puppet code to provision both the Vagrant dev environments and the production server.

That way, dev and prod look very similar to each other.

This tool may help you get started with the provisioning part: http://phansible.com/

I read an article recently and they run the entire db on a crazy high spec single server

https://letsencrypt.org/2021/01/21/next-gen-database-servers.html

“We run the CA against a single database in order to minimize complexity”

You can always role your own using something like https://grafana.com (there are other dashboard builders), but most have their own dashboards (Travis, Jenkins, GitLab, etc). Another common tactic is to update folks via Slack or similar and/or use embedded status images in a wiki or project web page (e.g., the green [build|passing] icon you often see on github). I myself have used all of the above, depending on how much info my peeps needed to see.

I recently thought of the same question and this Stack Overflow by Craig Mcluckie who was one of the cofounders of K8s and is recently the founder of Heptio helped me differentiate between the two and once I did that I realized what role they both can play.

AWS has a quick-start for VPC architecture that you can look at to see how they create a full stack with subnets, route tables, etc. ( https://aws.amazon.com/quickstart/architecture/vpc/). You might also look at the other quick start examples they have available. All of the quick starts have sample templates you can look at and see how they define the details and properties for each resource.

Terraform is (I think) one of the best alternatives to CloudFormation (https://www.terraform.io). You might look at that and see if it does what you need, however, that will then introduce something else to learn.

Hopefully this helps!

Most of the time I just need logs or have to exec into a container. Therefore I enhanced my zshrc file and since Im already an fzf user it wasnt much I had to add. You can find my stuff here: https://gist.github.com/zepptron/9635568b9d90d858daca7780feb8c4b7

Fzf: https://github.com/junegunn/fzf

My company ran an unusually large elasticsearch cluster on ec2. (We had indexes ranging from 5TB to over 7TB at any given time). While our use case is not common, we pushed elasticsearch to several limits that show what kind of issues you could run into managing your own cluster:

I/O Wait with EBS Volumes: Elasticsearch talks to disk a lot. We tried every class of EBS (sometimes, magnetic is the way to go: https://logz.io/blog/benchmarking-elasticsearch-magnetic-ebs/). We consistently hit the EBS bandwidth caps.

Elasticsearch assumes it has unfettered access to the disk, so when you are out of EBS burst balance your cluster grinds to a halt. Instances will fail to respond to other instances, other instances will start promoting replicas -- leading to more bandwidth demands and, usually, a cascading failure of the whole cluster.

We ended yup with SSD ephemeral storage -- which cancelled out any savings we got from rolling our own cluster.

Garbage Collection Pauses: We continued to see long-pauses on indexing which turned out to be garbage collection. Garbage Collection in ES is a "stop the world" event. We were running large instances and giving half the memory to the heap. It turns out this is a bad strategy if your total memory is 60GB. (https://www.elastic.co/blog/a-heap-of-trouble)

These considerations may or may apply to you. We killed as many Elasticsearch Service clusters as self-hosted clusters as we grew. In the end, our desire to tweak and optimize won out and we ran our own instances and handled our own fault-tolerance and backups.

Unless you are planning on massive scale, the Service is worth the extra few cents an hour.

We have kibana pointed at three clusters with tribe nodes. 200 billion documents or so. 210 data nodes (across the three clusters).

When querying data, the complexity of the query will limit the number of panels on a dashboard, as will attention paid to cache/query sizes, file system cache etc. Lots of aggregations will tank query performance.

We also outright ban leading wildcard queues due to cost/complexity.

Edit: added links to conf talks about doing this which have some of our scaling notes.

https://www.elastic.co/elasticon/conf/2018/sf/scaling-log-aggregation-at-fitbit

https://m.youtube.com/watch?v=Vp0W78-__BQ

The Jez Humble / David Farley book on Continuous Delivery is a must read from a standpoint of teams that deliver solutions in an automated way. More oriented towards software developers than operations / IT but really a must read for both types of folks for us to all come together as "DevOps".

edit: Amazon Link: https://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-Wesley/dp/0321601912

I have a background in (mechanical/electrical) systems engineering and product development. I knew network basics, can read and debug code, but not I'm not a developer by any means. A few years ago I really wanted to get to know the cloud (and devops) as I realized that the specialists (often developers) around me (and working for me) don't know it that well either.

Here is what I did (mainly AWS in my case, so all this applies to AWS, but this is also applicable to other cloud providers):
- Read the AWS well architected framework. It gives you a larger picture.
- Get a Acloudguru subscription and get cloud certification. I don't care at all about certs. But I got mine just as a challenge to learn. I even didn't mind renewing mine.
- I got onto Upwork.com as a freelancer and taking small jobs. Low or no pay first. I did mainly problem solving for people that are stuck somewhere. Trust me, there are people with serious problems you can solve (my developer disappeared and need someone to explain to me how this works type of job). Over time I was able to bill $250/h (I'm based in the US)
- I watched tons of youtube movies: 'My architecture' are good ones and also reInvent sessions. I get in my car or on a bus and watch youtube. You get a very deep understanding of how systems work, which is hard to get from reading documentation.
- Learn IAM inside out. Most of the learning curve and banging your head against the whole is IAM related.

https://www.codewars.com/kata/latest

even more fun when doing them together with colleagues on a Friday afternoon before the weekend!

(and it keeps people from breaking stuff right before the weekend)

So basically RunDeck?

You should look into existing solutions, rolling your own is a massive effort. But to answer your question: You don't "need" JavaScript for simple web sites that reload the page on every action. JavaScript is used to make the page more interactive and faster - see how on Reddit, comments are saved instantly and you don't need to wait for a whole page load? That's enabled by JavaScript.

These days, web applications are often single page applications (SPAs), meaning there are no page reloads and all content is displayed and made interactive through JavaScript. These type of applications are what frameworks such as React, Angular or Aurelia are for. Instead of emitting a full HTML page, the backend would become a JSON API, which then is also accessible by other applications.

Set the value node.availability_zone in each host's Elasticsearch config and use allocation awareness to prevent replicas from being routed to other nodes in the same AZ:

node.availability_zone = "us-east-1b" cluster.routing.allocation.awareness.attributes: availability_zone

If you'd prefer not to fight with setting the AZ in each config, you can also use the elasticsearch-cloud-aws plugin to add the attribute for you.

To avoid cluttering your machine with stuff for your dev environment, I’d recommend Vagrant (https://www.vagrantup.com). Plus, Vagrant is great for standardizing your workstation- someone on a different machine running a different OS can get their environment up and running really quickly if you’re using it right.

Plus, a lot of the VagrantFile concepts are similar to things like Dockerfiles, meaning it’s pretty simple to get your head around it all if you’re familiar with that stuff.

AWS just announced and released Fargate which let's you deploy docker containers to ECS without having to manage a cluster of Ec2 instances which might be something for you to consider: https://aws.amazon.com/blogs/aws/aws-fargate/

I've been working with it now since they released it and it is nice, but a little expensive so definitely have to weigh the cost versus having to manage your own instances. Although keep in mind, depending on your setup, letting ECS handle blue-green deployments requires you to have n+1 instances to deploy a new container version unless you are using dynamic port mapping so that is additional compute cost to weigh when looking at fargate.

Gitlab CI is my default one now. With one tool you have basically everything a project needs:

Repositories
CI/CD driven via yaml
Project management
Web/cli interafce
Documentation rendered for many different types
Docker repository
Many many other things

I am maintaining gitlab-runner chocolatey package for Windows that can turn your machine into gitlab node (does the builds) in one shell call.

> All web requests (by pure chance) checked out fine

This sounds like you're doing some kind of sticky sessions, where the same request from the same client is routed to the same worker.

Using error tracking is a good way to determine if you've broken a particular path or one of your workers died. This can be implemented in a number of ways:

Scraping your logs and looking for 500 errors, then pushing it to a monitoring system that alerts when it breaches a threshold
Directly pushing to your monitoring system when a 5xx error occurs and alerting when a threshold is breached
Using a paid service like Rollbar which implements a client library that pushes all errors to a central system and can be configured to alert you when there are several exceptions. This also provides an easy way to trace user errors to find more transient issues.

To followup on this, IPMI. This is what we use to remotely manage our SuperMicro chassis. But most of the aftermarket cards I've found seem to be made by SuperMicro, so I'm not sure if they'd work on a different chassis. Although one answered question on Amazon seems to indicate that it may work as long as you have a 64-bit pic slot on your mobo, which most server boards likely do.

https://www.amazon.com/Supermicro-AOC-SIMLP-B-add-card-KVM-over-lan/dp/B000PYHHTI/ref=pd_lpo_sbs_147_t_1?_encoding=UTF8&psc=1&refRID=PYV76AZD96YKYDEGP05M#customerReviews

But this works similarly to the Dell DRAC wherein you configure the IPMI device with an IP on your network, give it a port on a switch, give the IPMI device a password via the bios, then you can navigate to that IP in a browser and get a little java based console window that acts like a local connection to the server. (remote keyboard/mouse/monitor + power control options)

Ansible is a configuration management tool that would seems to fit the bill for what you want. It would allow you to push config files and restart services idempotently, reboot servers in rolling batches, let you remove a host from load balancer before taking these actions, etc.

I think Jenkins is great, and I use it at work as a glorified cron replacement, but it would be a pain to set up the workflow you want, with the delays and rolling reboots and whatnot.

Note: I've only really used Ansible, so I don't have a good foundation of comparison for it against other config management tools like puppet, chef, or salt, so those may be good options for you as well.

You can check Grafana 8. They already integrated alertmanager and cortex as one of their main alerting rule along with Grafana alerts.

https://plugins.jenkins.io/prometheus will give you an endpoint that prometheus can scrape. Then you can put grafana in front of prometheus and generate some really cool dashboards. https://grafana.com/dashboards/306 is a nice one.

Circle CI has positioned themselves as a modern Jenkins replacement, but I should note I haven't seen them used over a long period of time which is the real measure of quality. Honestly most pipelines seem to be better suited to workflow frameworks like Apache Airflow than these CI/CD specific tools, they may be a little harder to learn up front but they seem to be more resilient to the type of customization that usually gets you in trouble with Jenkins down the line.

Networking is a fundamental skill that you need to have. Without focusing that much on the technical implementations (e.g., how to set up MPLS on a Cisco Switch) the basics will get you very far. In my experience, 80% of the problems that you will encounter are very basic networking problems, like "my device cannot reach that other device, why is that?" Maybe the two are on a different subnet -> look into routing tables -> that table missed an entry.

If I were you, I'd buy the CCNA Guide (https://www.amazon.it/Ccna-Certification-Study-Guide-Administering/dp/1119659183) and start from there.

For linux, I'd recommend developing some simple automation scripts, understand cron and what the different directories and subdirectories mean.

If you're not using autoscaling groups and have no plans to do so, using AWS Lambda to start/stop instances on a schedule using a Cloudwatch Event trigger is the way to go.

AWS has a full tutorial on it posted here: https://aws.amazon.com/premiumsupport/knowledge-center/start-stop-lambda-cloudwatch/

Learn the tools your company already has implemented. Understand those systems and the reasons they were built the way the are.

Shell scripting will be important regardless of what other systems you use. Learn Bash. Learn git, read the whole book. https://git-scm.com/book/en/v2. Ansible is very useful for performing the same actions in multiple systems or repeatedly performing the same actions. Terraform is great for managing cloud infrastructure regardless of your provider. Learn containers: build one from scratch, and learn why it's more secure to do so. Learn Kubernetes for hosting containerized applications.

There's plenty of other tools too, and that's why I first suggest learning those tools which your company already use because you'll see them at scale and encounter real situations that you'll have to deal with.

Promethius + Grafana are the goto solutions when it comes to K8s. Also see here, pretty good article on getting started with Prometheus and K8s monitoring.

Hello :)

I followed the guide with kubeadm (see here ) to install a K8s cluster on fedora nodes and all is perfect ;)

But cause you haven't any googleized load-balancer, you had to configure yourself an ingress ( see here for a yaml file that I used to do it).

IRC is a very pervasive and simple protocol - which might be reason enough to build an application around a dedicated IRC service back end.

Check out the XMPP protocol, however, and it's extensions. Think IRC, but with an open XML transport - optionally over https for easy secure channel operation and browser<->webservice integration.

XMPP started life in telephony to augment or even replace SIP - and is often still referred to as the Jabber protocol, both inside and outside that context.

Many modern chat platforms use it, like HipChat, GTalk/ Google Talk, and Whatsapp. XMPP's use cases are not just for multi-user chat ( https://xmpp.org/extensions/xep-0045.html ) but also for machine-to-machine messaging ( like NATS or AMPQ -like deployment scenarios).

I went through a similar interview 5 months ago for Google SRE.

A few people have mentioned the thought process which is essential. Few recommendations for the material:

https://www.amazon.co.uk/DevOps-Troubleshooting-Linux-Server-Practices/dp/0321832043 - to learn about several scenario-based questions
https://www.amazon.co.uk/Debugging-Indispensable-Software-Hardware-Problems/dp/0814474578/ref=sr_1_3?dchild=1&keywords=debugging+book&qid=1625775550&sr=8-3 - this will help you develop the right thought process for the interview.

I spent 1 month preparing for the interview. Read the first book twice and the second book once.

Let me know if you have any questions.

This post does a very good job of explaining the psychology of the culture vs. just the technical side of. Thanks for the thought put into this.

I'm being promoted into a position of authority but the one or two people who are likely benefiting from the chaos are also likely the ones that I'd have the most difficulty getting buy-in to replace. They're the type of people who know how to kiss the rights asses and it's only those in the trenches who know what they are really like. Obviously, this is a problem I'm going to have to fix to be successful though and I look forward to the challenge.

I've read parts of Tom L's books (has anybody read them cover to cover?) but will go back and reread them and pick up a copy of his new cloud book. Also need to pickup a copy of The Art of War...I didn't realize there were things from that book which could be used in the context of my career, thanks for bringing that to my attention.

I was a previous datadog customer and loved their services but their pricing was expensive for us, and we did not need all the features.

Take a look at site24x7.com I have been absolutely blown away by the array of features that they offer.
There are agents for data collection, external port monitors, webhooks for alerting / pagerduty, slack.

We even use it in conjunction with commando.io to handle process restarts, system reboots if problems are detected with external monitoring that our internal watchdog processes don't catch.

There are smtp monitors, ping, tcp, http(s), ssl certificate expiration notifications - its extremely comprehensive.

rhel atomic and coreos are sharing lots of solutions I would say that they share some of the solutions with each other.

https://coreos.com/os/docs/latest/install-debugging-tools.html

https://www.google.com/amp/developers.redhat.com/blog/2015/04/21/introducing-the-atomic-command/amp/

Warning: I have zero knowledge on any Microsoft technology.

BUT: http://azure.microsoft.com/en-us/documentation/articles/cloud-services-python-how-to-use-service-management/

Apparently you just need to use the API, invoke them from your own Python scripts on your Linux environment and then you can manage your instances/vms/whatever they call in the MS world using regular orchestration tools like Ansible.

I think most cloud technologies/providers nowadays are basically a big abstract layer which you don't need to care about.

Getting an ELK stack up and running acceptably in production is non-trivial, and requires a lot of research. If you're looking to determine whether ELK can replace Splunk quickly, it's probably worth the money (hey, if you can afford Splunk...) to get some support directly from elastic.co about sizing.

Or you could try their 'as a service' offering where they manage the hosting of it for you, https://www.elastic.co/found. You can develop on a small cluster quickly, see if it fits, and if it does, $$$ to scale it up, or less $$$, more time and hassle to build it yourself.

We use Concourse CI which defines pipelines in yaml config files and runs everything in a container, so everything can be declarative. (IIRC, Jenkins Pipeline wasn't available when tools were being evaluated.)

I had a design for a pipeline builder framework which would automatically generate pipeline config to build/test/promote/deploy anything from individual microservice Docker containers up to entire products made up of groups of services/components.

We ended up building a Python library to abstract the Concourse job/task/resource model, which made it much more concise and easier than hand-editing a yaml file. It's still config-driven in the sense that a Python script defines the pipeline, and a separate job monitors the script for changes and automatically pushes the changes to Concourse.

The challenge for scaling to the enterprise-level ended up being non-technical: groups were reluctant to switch from their existing solutions, even if they weren't far along at all.

The goal should always be magic: Tests and code get checked in, gets deployed if everything looks good, rejected if not.

I agree with the overall point of learning and continuous improvement, but I think a lot of common sense and research indicates that Mean Time to Restore is a very important metric to measure and improve. And if I had to choose, I would definitely pick Mean Time to Restore over Mean Time to Retrospective. If you can measure both, great.

As an example, Time to Restore is one of four metrics included in Software Delivery and Operational Performance which predicts organizational performance, as shown in the State of DevOps reports and the related Accelerate book.

like this?

https://www.digitalocean.com/community/tutorials/how-to-set-up-nginx-load-balancing

Set your DNS to point to your LB have your LB serve up the connections. You can also you haproxy, squid, or any other flavor of loadbalancer.

If you're looking to learn step by step, these 2 guides take the cake.

Kelsey Hightowers guide is my kubernetes bible:

https://github.com/kelseyhightower/kubernetes-the-hard-way

CoreOS guide is also damn awesome:

https://coreos.com/kubernetes/docs/latest/getting-started.html

I've been using prometheus/grafana for a while, they are (in my opinion) the standard stack for Kubernetes clusters, so running in containers is not a concern, Prometheus gets its data making requests, if you are not using k8s you might need to expose a service that serves metrics, as for speed i'm not sure if it is faster, but it does use less resources on the hosts it monitors, which is always good for a tool that you are using because you care about performance on your instances.

Here you have a list of exporters, exporters are libraries to expose an endpoint like the one Prometheus expects, most of them are quite simple to implement

https://prometheus.io/docs/instrumenting/exporters/

I used to be a huge supporter of Sensu, but it never really took off and the barrier to entry was not worth it. When they came out with the paid version I thought they would get better, but for some reason the main thing was now you can run it with Java instead of ruby which made me ask why not push that to open-source version.

I wish them luck, but if anyone asks me for advice on monitoring I have 3 answers:

Are you using containers? Do you have the time to re-think how you want to do monitoring? Go with Prometheus https://prometheus.io/ it is amazing.
Want something better than Nagios? Icinga2 is a solid choice.
If you have Nagios, maybe try some automation and take a look at how Etsy run their Nagios, you might be surprised.

Devops would probably be treated as a meta tag on serverfault.

It seems very unlikely that you have a question that fits within the serverfault scope that only had the tag devops
There is a lot of miss-use and confusion around the term devops, and so it wouldn't really add much to the question.

Tags get added when people use them. Do you have a question for serverfault that needs the devops tag added? If you don't have enough rep yet, and the question is a good fit on the site, and actually needs the devops tag applied I'll add it to the question for you.

https://www.site24x7.com/ is what i have been using for a few years now. A lot cheaper then many of the alternatives and never hear much about them. Has quite a few options as far as monitoring goes (including SSL expiry)

Kubernetes is way overkill for almost everybody. Unless you are a huge corp with a need for an in-house cloud platform, it will be overkill. Kubernetes is closer to OpenStack than to AWS ECS, which is what OP was considering.

Kubernetes is also notoriously difficult to install and maintain. There is currently 43 different ways to install Kubernets and it is highly unclear which ones will still be around in a year. Once it is up an running, it is a fine product, but be ready to allocate large resources to admin it.

If you really want Kubernetes, buy it as a service from Google Cloud.

I've recently joined an ops team for a startup, with an aim to take on devops responsibilities over time.

over the last week I've been introduced to:

-kubernetes-ibm Cloud-in-house applications-Concourse CI

I know the basics, where services live, how we manage service config and env config templates using git, and how our deployments get triggered using concourse and am starting to look at helm charts.

I've been introduced to how we manage secrets, some of our logging , and internal documentation on where to learn more.

It's unreasonable to expect someone to have these skills in any significant depth when they specifically aren't referenced in your experience in some form during the hiring process.

If your immediate manager is out/in hospital, i'd be surprised if another manager isn't taking on some of his responsibilities. Ask about that. Otherwise, go to whoever is next in the chain. If you are in a team, talk to them for advice. Perhaps they can budget time for this if it's communicated.

This isn't anything to be embarrassed about, just focus on good communication.

"I want to be successful.

A lot of the tooling in this role is new to me, and I'd like help getting the lay of the land and planning for my success, can you point me in the direction of someone that can help with that while <x> is out of office?"

What does Helm solve that manifests don't? Start there and build a slideshow, a presentation, or something like that using it. If Helm doesn't solve any problems they have, why should they learn it?

For what it's worth, I'm moving off of Helm for personal projects and at work to Pulumi because it allows using a single tool as both a terraform replacement and Kubernetes manager. It solves the problem of having to teach multiple tools to engineers trying to deploy a system.

“people learn and use Jenkins because people already use Jenkins”

Accurate in my experience. Like Python for Machine Learning, it's the accrued the most venerable library of tools for the domain. That does not equate to good, but it can equate to necessary for many, many teams.

Since we like to containerize everything that isn't serverless, including all our tools, I've had success porting things to Drone. I much prefer writing standardized tools to do all our CI/CD over the sprawl of a hundred Jenkinsfiles constantly accruing idiom drift.

I´d say go with the official AWS certification learning prep material:

~~https://aws.amazon.com/de/certification/certification-prep/~~

https://aws.amazon.com/certification/certification-prep/?nc1=h_ls

or did you already?

edit: changed link to be in english

Thanks for posting this. Some of these tools were new to me.

I'd also recommend Dash (https://kapeli.com/dash) which is a "API Documentation Browser and Code Snippet Manager" and also supports numerous editors including Sublime Text (although I use it through Atom).

You get some lightweight testing out of the box using declarative pipelines if the job is configured for Multibranch Pipelines: alter the Jenkinsfile and open a PR. If there's a syntax error or some other build error in the changes, it will report to your commit a failure and if you have branch protection turned on cannot be merged. Of course it will only test what was run in the `when`blocks, so if you skip a stage for pull requests then that code won't be tested.

Agreed that there isn't any useful info about how it is supposed to be done, and along the way, you're gonna find a plugin that you want to use, only to find out after 2 hours of configuring, that it does not support the 2.0 pipeline.

Some good documentation for me, surprisingly wasn't online, but rather the pipeline syntax snippet generator button for writing Jenkinsfiles. Those are necessary as it is what gets the plugins to work, as the rest of it is simply Groovy code.

I would say: - brush up on Groovy - use the pipeline syntax generator - understanding this: https://jenkins.io/doc/pipeline/steps/workflow-durable-task-step/#code-sh-code-shell-script - and using these variables https://jenkins.trustyou.com/env-vars.html/

This will allow you to go quite far. Feel free to PM to if you have questions as I've painstakingly written quite a lot of Jenkinsfiles.

(i wanna move away from Jenkins too, it feels like this plugin is held together by plugins, without plugins, Jenkins on its own is sorta useless)

It seems to me that you might be trying to solve two problems at once here:

>it doesn't seem to have any concept of configuration as code for GCP resources like Data Store, Pubsub etc.

It strikes me that the problem you're trying to solve here is best addressed with terraform

>I'd like a tool that [our] different teams could use somewhat independently to CI and then CD their respective microservices in a nearly fully automated fashion.

There are a lot of details that this post doesn't go into, but it seems that:

You want as many batteries included as possible
You want a foundational framework that would enable you to manage infrastructure as code with something like Terraform
You want tooling that is capable of deploying to your GKE cluster(s)

Again, this is without a lot of detail specific to your environment that might be relevant, but it seems to me that Gitlab CI would be a convenient way to tick these boxes. In order to make it consumable for different teams within your organization, you'd probably want to come up with a "standard" (or as close to it as possible) .gitlab-ci.yml for teams to base theirs off.

The other responses are great too but I like this one the best.

To add a few things to the list: Jenkins, Docker, git and branching strategies, distributed lock service like Etcd, stateless applications, etc. If you have time, check out protobufs and gRPC. protos are great for defining system interfaces/contracts. I do work in Vagrant boxes running on VirtualBox but I don't get too fancy nor use Vagrant for anything more than local development/prototyping.

As for SW dev practices, I recommend reading Clean Code and/or Code Complete.

If you think you might apply to engineering-heavy companies, you should do some exercises at Leetcode. For "devops", the easy questions are fine but obviously the more you know about data structures and algos the better.

I'd recommend to use a ready made Puppet module and masterless puppet setup as described here:

https://www.digitalocean.com/community/tutorials/how-to-set-up-a-masterless-puppet-environment-on-ubuntu-14-04

Then use a ready made module to install Zabbis: https://forge.puppet.com/puppet/zabbix seems active and popular.

Try setting up a end-to-end CI/CD pipeline for an application. For your purposes, it can be a small simple app like this python app that runs in a container. You don't have to deploy using a container, but you could for bonus points.

Start by setting up an ec2 instance, and launching it from the box. Then set up a Jenkins instance on another box, and have it launch the app on a commit to the develop branch (or any branch) of your app. The last step should be configuring the instance your app is on to be deployed behind a load balancer, and possible even have a DNS entry associated with the load balancer's IP.

The end goal here is for you to push your code to GitHub from your local machine and have that app be deployed to the web without any extra steps from you.

Interesting, thanks for sharing. I employ a similar setup but prefer VS code which does this as well: https://code.visualstudio.com/docs/remote/containers

For backups, you should use a Wordpress backup plugin like updraftplus and backups should not be stored on your VPS. They should be stored remotely, such as on an Amazon S3 bucket, Dropbox or your local computer. If you lose the VM for any reason then you will lose the backups as well if they exist together.

> How can I automate backups of the webfiles (/var/www/html)?

You should not be hand editing files here and anything modified by WordPress should be backed up by a backup plugin like above. Learn to set up a clean Wordpress site and restore your backups to that (you should keep a clean version of the deployed version with your backups).

You should then be able to restore a backup to your staging environment if you wish. Although this seems backwards - most people develop in a version control repo like git and deploy to an environment (staging, then when they are happy production).

You can find the configured IP address in the registry using a registry editing tool like the one included with later versions of chntpw, called reged.

~~The IP should be in the same folder as the path on this page:~~

http://serverfault.com/questions/545032/where-to-find-wins-ip-address-in-registry

Scratch that, should be in a similar location though.

Edit: here

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces{GUID}

Hi, I think you can read the extensive doc that gitlab made about the subject. You should be able to find the answer you'r looking for there.

https://about.gitlab.com/company/culture/all-remote/guide/

Atom editor + https://atom.io/packages/search?q=ansible

Sublime Text + https://packagecontrol.io/search/ansible

YAML is pretty simple, I don't think it's worth doing much more than the above. I'm sure there are some emacs/vim options as well.

Appreciate the question! Servers are hosted in Germany -- Hetzner is an awesome hosting provider with some unbelievable pricing and I'm pretty close to them on price (if you were to use a Hetzner Cloud instance). Of course you can't just rock up and order like 50 machines so I have to do a tiny bit more management -- basically multiplexing the dedicated servers, it's so easy a caveman (who could set up a k8s cluster) could do it (tm)!

For GitLab I use the Docker executor since it has the most features available and is easiest to manage. All the runners are in their own VMs though, so it's basically docker running on a VM on a real dedicated machine.

If you don’t need complex logging system for all your apps and need to solve only webserver logs system - try https://goaccess.io I can’t say for high volume usage, but for average newsmaker sites it’s suitable simple analysis tool based on logs

Macbook Pro

iterm2 - Setup a Quake console drop down terminal, I use this all day every day. Switch virtual desktops, hit my hotkey (I use cmd-~), my terminal follows me around to each VD. Learn the hot keys for splitting windows. My drop down is usually split into two terminal sessions side by side.

http://www.karam.io/2018/Turning-iTerm-in-to-a-Quake-style-terminal-on-Mac-OS/

Spectacle - Hot keys for window management

In other words: Your best bet is to instrument the code and publish a /metrics endpoint.

You can find Prometheus libraries for many languages at: clientlibs

Package building instructions that mention neither debian/control nor debian/rules nor debian/changelog can hardly result in good packages. And parts of the article are plainly wrong. The canonical resource for learning to package for debian is the Debian New Maintainers' Guide.

The best way to learn kubernetes is from the official kubernetes documentation . The explanation is clear and newbie friendly

https://kubernetes.io/docs/home/

You can learn kubernetes with minikube and try to deploy jenkins inside it

Stateful containers (ones that need to maintain persistent data through restarts) are still somewhat of a challenge to manage. Effectively running your own Dockerized MySQL means leveraging storage options correctly, making sure the right EBS volumes are mounted to the right container instance, etc.

Kubernetes is tackling this with StatefulSets (https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/), but AWS is pushing the use of their Elastic File System for this pattern (https://aws.amazon.com/blogs/compute/using-amazon-efs-to-persist-data-from-amazon-ecs-containers/).

The easy way is to just use RDS instead. Then, you're not actually Dockerizing MySQL because you're not really managing a MySQL cluster manually. Instead, your other, stateless containers simply connect to RDS using variables defined in the environment like RDS_HOST, RDS_USERNAME, RDS_PASSWORD.

My shop uses liquibase to keep the database under change control. It's not perfect (some schema changes need to be done with an online schema change tool instead of LB so that we don't lock up tables for hours and bring the site down) but it's way better than cutting and pasting.

LSF has been replaced by Filebeat for a little while now.

Filebeat can push logs directly into elasticsearch if you don't need logstash to do parsing.

https://aws.amazon.com/certification/certified-solutions-architect-associate/

Amazon offers classes and certs for Aws solutions architect. Not saying go get a cert, but the classes or material may help you get up to speed on the Amazon products.

https://www.vagrantup.com/docs/providers/

You don't need to know virtualbox or vmware, vagrant will abstract it away. They recommend vmware though I always used virtualbox and never had any performance issues other than guest hardware limitations

I second the recommendation for Vagrant. I've been using it for local dev environments for the past few months, and it's insanely simple to use for getting a base VM up and running. Then you just provision it however else you find appropriate (I'm not familiar with Docker Swarm, but Ansible is a good route).

You can also pack everything together and make your own private base box that includes the general configuration of your software, which will reduce the ops team load as well.

https://www.vagrantup.com/docs/boxes/base.html

THIS! Here's the link on Amazon

“Don’t use chef” is good, but unhelpful advice.

Still no sign of ansible, but OpsWorks supports puppet now: https://aws.amazon.com/about-aws/whats-new/2017/11/announcing-aws-opsworks-for-puppet-enterprise/

ELK is what comes to mind if you are in a budget. Elastisearch, Kibana and LogStash.

https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-elk-stack-on-ubuntu-14-04

archbee.io - has mermaid diagrams and a native diagrams block with drag and drop and the editor is great for writing documentation.

but if you need just to describe bugs, problems, and the solution, any editor would work. I would just use the simple path (even big companies just use word for some of these things)

Configure existing ES cluster so that it can snapshot (snapshot is ES term for backup) to S3 or Google Cloud Storage.

Create a new cluster in AWS, then restore from S3. You can also do incrementally to minimize downtime: i.e. one first large sync, then a last update to catch up with latest changes since the first sync

Documentation for current ES version: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html

I was using traildash for this

https://github.com/AppliedTrust/traildash

But now it looks deprecated. AWS itself recommends some solution with help of Elastic Search and Kibana

https://aws.amazon.com/blogs/aws/cloudwatch-logs-subscription-consumer-elasticsearch-kibana-dashboards/

http://rundeck.org is not bad. It can be ran as a single vm acting as a master/slave, but can be scaled to any number of slaves. You can have any machine act as a slave via ssh keys and then rundeck will execute whatever you want on them and will track the result of that. It also has error handling (I.e. If the job fails, take "X" action). It's as simple as you want to make it but can grow to what you want/need it to be.

This is what we are using: https://nixos.org/nix/

It's like those native package managers but we wont have the problem of packages breaking each other and we can ask for EXACTLY the config we want and thats what we get

Ahh, okie. I was a little confused.

I would focus less on vendor certifications, those come and go, and IMO don't hold a lot of value.

I would also recommend focusing on looking for internship positions, or finding an open source project or similar organization to volunteer for. What kind of project depends on what you're interested in.

Shameless plug, but Prometheus is always looking for more contributors. :-)

https://prometheus.io

It’s not exactly straightforward to get going (assuming no knowledge) but it’s very well built and has demonstrated operational resilience for us over the last ~1 year we’ve been using it.

Works exceptionally well for what you want.

Almost every SaaS log management tools will allow you to do this very easily.

Logentries uses the concept of a log set and log for aggregation and search.

You could create a log set for all of the web server logs. Your colleague could create a separate log set for all the SysAdmin logs. Our agents can do this for you automatically. The free version will handle aggregation and search of data.

These videos will walk through the features and capabilities of Logentries

If you are interested in a live demo send me a direct message and I'll connect you with one of our engineers.

New Relic and Data Dog monitor your server's resources 24/7 and allow you to set up alerts.

It would seem to make a hell of a lot more sense to use those tools and be alerted when an issue appears, rather than manually have someone check to make sure it is working.

Show your legal and marketing team this: https://www.datadoghq.com/docker-adoption/

People always look forward to their reports because they give detailed analysis and useful insight on real data.

Having been on the receiving end of a marketing whitewash I can sympathize with your plight, hopefully your marketing department can learn that removing technical content from articles intended for a technical audience does nothing but make people not take it seriously. Right now it just reads that one vendor really understands docker trends, and one does not.

Having different monitoring vendors doing this kind of work is actually useful for people and I hope they just let you dig in deeper and come out with an updated version with more meat, good luck!

K8S is explicitly designed with no maintenance model. It's designed with rip/replace in mind; if you leave a k8s cluster up for 6+ months you're literally doing it wrong.

Or right from the official site: https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not

"Does not provide nor adopt any comprehensive machine configuration, maintenance, management, or self-healing systems."

Even hosted to solve the rebuild, your still need CICD to get the containers on the new cluster ... Because I'm certainly not taking on that technical debt.

No... You don't play with K8S unless you plan full CICD.

For 1)

https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ There's an easier way though. Just set your request cpu and memory to be as high as one full node, and no pods will be scheduled on the same node.

Running in docker or not doesn't solve the horizontal scaling problem, but it does solve the resiliency. One node dies, another one takes over with no human interaction. Mix that with proxysql and mysql orchestrator (which you would do anyway if you're not running in kubernetes), and you have something quite nice.

it's called Kubernetes Federation. There's no magic formula for this, making a mysql cross region work with or without docker/kubernetes is as hard. My feeling is that, if you need to go cross region, it would probably make sense to start sharing your database. Until then, multi zone is fine (and easy)

Thats a great shout, thanks. To be honest, I don't have any readiness and liveness probes deployed, only healthecks for the Services.

Looking at this doc it seems like readiness probe is what I need. But I'll deploy liveness probes as well, they are super useful.

What are /r/devops' favorite Products & Services? From 3.5 billion Reddit comments

The most popular Products mentioned in /r/devops:

The most popular Services mentioned in /r/devops:

Kubernetes

elasticsearch

GitLab

Jenkins

Amazon Web Services

Prometheus

Grafana

Git Bash

CircleCI

Drone.io

DigitalOcean

Concourse

Datadog

Vagrant

Let's Encrypt

The most popular Android Apps mentioned in /r/devops:

Sync for reddit

Termux

Cloud Pros- AWS Certified Arch

The most popular VPNs mentioned in /r/devops:

The most popular reviews in /r/devops:

What are /r/devops' favorite Products & Services?
From 3.5 billion Reddit comments