Phoenix Project is a solid high-level overview of the concepts, and leads directly into the author's next book, The DevOps Handbook, which really digs into the details.
Ok, this was painful to read, as are many of the comments around making deployments harder to do.
1) Please read DevOps Handbook and preach the learnings from your reading from the rooftops of your company.
2) Building off of that, it sounds like y'all need some better telemetry and the ability to do gradual deployments/dark-ship features so that you can do an easy cutover. "Big Bang" deployments like this will always, at some point, cause catastrophic failures like you've seen - this is a cultural/company problem, not a YOU problem.
3) Why on earth, with a deployment strategy/application setup like this, would your sales team EVER be demoing against prod? Use a dedicated demo environment.
4) Until you're able to get a more mature approach to handling deployments, definitely give more advance notice. Week out, day out, day of, 1 hour before, 30 minutes before, etc...
It's kind of jumping directly into the deep end, but I'd recommend The DevOps Handbook as it covers every imaginable modern software development process. If you're not familiar with a lot of the stuff in it, don't sweat it. Learn what you can from what sounds somewhat understandable.
Love the downvotes without comments... the assertion above is taken almost verbatum from a talk I went to where Gene Kim was presenting a couple of years ago.
> give a vague impression that they all do the same thing.
Lots of tooling does overlap but each one has one area it excels at - some excel at the same area.
So, you have done a good job so far, it seems like most of your stuff is automated to a good degree and you have identified where your weaknesses are.
You should tackle one thing at a time, identify your largest bottle neck or problem and work to solve that first. In the same vain, only introduce one new tool at a time. Each takes some time to learn and to implement it correctly. Trying to do too much at once will just cause problems.
You have already identified the weaknesses so focus on solving these, starting with what you think is causing the most issues.
> - One server per environment is obviously not super scalable
Look into HA setups. How you do this and how much work it is depends on your application. Typically there are two parts to applications, work and state. Work (such as processing requests) is easy to scale if it contains no state. Just add another server to the environment and load balance between it. For this you need a loadbalancer (HAProxy or Nginx work well, though there are many others to chose from) and to move any state off the node you want to scale.
There are many forms of state, most will be stored in a database but you should also pay attention to session state which is sometimes stored in memory on the node - if you have anything like this you will need to do work to move it into some sort of storage, like a database or storage solution (such as your existing database or redis or memcached etc).
> - No sense of automatic provisioning, we do that "by hand" and write the IPs to a config file per environment
There are loads of tools to help with this.
Terraform for provisioning infrastructure.
Ansible or Chef or Saltstack or Puppet for provisioning nodes (I recommend starting with ansible, though any of them will work).
There is nothing wrong with using bash scripts to glue things together or even do provisioning while you learn to use these tools. I would not shy away from them, but do recognize the benefits each tool provides over just bash scripts. Take your time to learn them and stick with what you know and what works for you while you do. Introduce them a little bit at a time rather than trying to convert your entire infrastructure to use them in one go.
> - Small amounts of downtime per deploy, even if tests pass
This is easiest if you have a HA setup. You can do it without one but it involves just as much work and basically follows the same steps as creating a HA setup. In short, with multiple nodes you can upgrade them one at a time until everything has been upgraded. There are always some nodes running on either the old or new version so everything will continue to work.
You can either update nodes in place, or create new ones (if you have automated their provisioning) and delete the old ones when the new ones are up and working (see immutable infrastructure for this pattern, also canary deploys and blue/green deploys for different strategies).
> - If tests fail, manual intervention required (no rollback or anything) - though we do usually catch problems somewhere before production
Tests should be run before you deploy. These should run on a build server, or ideally a CI system. Ideally these should not only run before all deployments, but also for all commits to your code base. This way you can spot things failing much sooner and thus fix them when they are cheaper to fix. You also likely want to expand on the number of tests you do and what they cover (though this is always true).
Rollbacks should also be as easy as deploying the old version of the code. They should be no more complex than deploying any other version of your code.
> - Bash scripts to do all this get pretty hairy and stay that way
Nothing wrong with some bash scripts, work to keep them in order and replace them with better tooling as you learn/discover it.
I have mentioned a few tools here, but there are many more depending on exactly the problems you need to solve. Tackle each problem one at a time and do your research around the areas you have identified. Learn the tools you think will be helpful before you try to put them in production (ie do some small scale trails for them to see if they are fit for purpose). Then slowly roll them out to your infrastructure, using them to control more and more things as you gain confidence in them.
For everything you have said there is no one solution and as long as you incrementally improve things towards the goal you have you will be adding a lot of value to your business.
For now you need to decide on which is the biggest problem you face and focus your efforts on solving that - or at least making it less of a problem for now so you can focus on the next biggest problem. Quite often you will resolve the same problems in different, hopefully better, ways as you learn more and as your overall infrastructure, developmental practices and knowledge improves.
Also the 12 factor app is worth a read as is googles SRE book and the devops handbook. The Phenoix Project is also a good read.
Though these are more about the philosophy of DevOps, they are worth a read but wont solve your immediate issues. Reading around different topics is always a good idea, especially about what others have done to solve the problems you are facing. It will give you different perspectives and links to good tools you can use to solve the problems you face.
Not OP, but interested to read.
Doing a quick search, found this on amazon: The DevOps Handbook (October 2016)
Is that the correct book?
I just read about this in the DevOps Handbook. OP, if you have a copy, take a look at Chapter 19: Enable and Inject Learning into Daily Work.
It talks a lot about creating a culture of blameless postmortems and stuff, but here's an excerpt about Etsy's Morgue you might find interesting:
> This desire to conduct as many blameless post-mortem meetings as necessary at Etsy led to some problems—over the course of four years, Etsy accumulated a large number of post-mortem meeting notes in wiki pages, which became increasingly difficult to search, save, and collaborate from.
> To help with this issue, they developed a tool called Morgue to easily record aspects of each accident, such as the incident MTTR and severity, better address time zones (which became relevant as more Etsy employees were working remotely), and include other data, such as rich text in Markdown format, embedded images, tags, and history.
> Morgue was designed to make it easy for the team to record:
> - Whether the problem was due to a scheduled or an unscheduled incident > - The post-mortem owner > - Relevant IRC chat logs (especially important for 3 a.m. issues when accurate note-taking may not happen) > - Relevant JIRA tickets for corrective actions and their due dates (information particularly important to management) > - Links to customer forum posts (where customers complain about issues)
> After developing and using Morgue, the number of recorded post-mortems at Etsy increased significantly compared to when they used wiki pages, especially for P2, P3, and P4 incidents (i.e., lower severity problems). This result reinforced the hypothesis that if they made it easier to document post-mortems through tools such as Morgue, more people would record and detail the outcomes of their post-mortem meetings, enabling more organizational learning.
Well for DevOps you have to know programming languages but not necessarily everything related to front-end.
Kubernetes and Docker are a must you can't escape it ;D then add cloud what you like. I'm focusing on AWS as my whole organization is on it but google and Microsoft are popular as well.
I'm reading a decent book right now The DevOps Handbook by Gene Kim
https://www.amazon.co.uk/Devops-Handbook-World-Class-Reliability-Organizations/dp/1942788002
​
highly recommended, if you can grab it for cheap its a good book before bed :D
You need to step back and talk to the developers, each and every one of them take note of their comments, let them speak their mind reassuring them that the comments will get to management filtered and anonymized . Questions to ask are things like, what do you spend too much time that you think is wasted (ie are there pain points with the development workflow). Do you like your tools ? Do you need resources to get more proficient with your tools ? When talking about tools I group all the software that a developer comes in contact with from the ticketing/traking solution to the IDE the build tools etc. In my case the big pain points can be solved with verry little monetary investment. Every system was tech debt you need to reduce it to the minimum. Devops is all about the developer experience, you need to be accessible and let all the team tell you in confidence all of their pains on the job, if someone does not like something usually he has a poor understanding of it. Also read this book https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002 it does not mention tech at all and goes on to describe the management structure of various high performance orgs like an auto factory, how people look at a flow chart of how work is done and then finetune the processes and the tooling to get the job done with minimum labor, cost and interpersonal friction with the ultimate goal of making a good and reliable org that can withstand high output with great results.
Ας κάνω κι εγώ ένα comment αν και ήδη τα παιδιά πιο πάνω σου έδωσαν αρκετές πληροφορίες.
Γενικά DevOps ως ρόλος επίσημα δεν υφίσταται είναι μια νοοτροπία που κουμπώνει κυρίως σε ομάδες που κάνουν Scrum ή και Kanban ώστε οι ίδιοι οι maintainers ενός product είναι και owners του deployment, οπότε θεωρητικά εκείνοι γράφουν το CI/CD κλπ.
Τώρα όχι μόνο Ελλάδα αλλά και παγκόσμια το να έχεις μια ομάδα που το κάνει αυτό είναι αρκετά δύσκολο γιατί εκτός ότι κάθε λίγους μήνες βγαίνει κάτι καινούριο δεν είναι απαραίτητο ότι όλοι θα έχουν τις γνώσεις για να κάνουν maintain και το deployment κομμάτι μιας εφαρμογής.
Οπότε δημιουργήθηκε ο ρόλος του DevOps engineer που ουσιαστικά έχεις τις γνώσεις ενός παλιού Sysadmin αλλά και γνώσεις καινούριων practices όπως πως να φτιάχνεις ένα CI/CD pipeline να έχεις όλο το infrastructure σε μορφή κώδικα(Infrastructure as Code) κλπ.
Τώρα στο σημαντικό κομμάτι συνήθως αν δεν έχεις εμπειρία ένα certificate ίσως σε βοηθήσει, γενικά προσπάθησε να έχεις ένα καλό github και μην φοβάσαι να στείλεις ακόμα και χωρίς εμπειρία γιατί από ότι γνωρίζω κι εγώ στην Ελλάδα είναι πολύ λίγοι ακόμα αυτοί που ασχολούνται.
Γενικά tips είναι ότι όντως είσαι λίγο και σαν support call για τους υπόλοιπους developers μιας ομάδας. Καλό είναι να έχεις ενα automate everything mindset και να προσπαθήσεις να το περάσεις και στην υπόλοιπη ομάδα αυτό ( εδώ κολλάνε τα social skills αρκετά).
Τρια πολύ καλά βιβλία αν και δεν είμαι fan των tech books γενικά: https://sre.google/books/ https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002
P.S Ελπίζω να σαρεσει να γράφεις YAML files 😅 και sorry για το σεντόνι.
There are quite a few good books in the topic:
And the are a few great online trainings on:
Amazon, Google and Microsoft also have great materials specific to their technologies.
Also there's definitely a few good YouTube channels and sub Reddits, but I'm not familiar with those.
Glad to see there is a recommended readings at the top that lists a few of the books I was looking at. That being said, I still have a few questions as I've seen these three recommended quite a bit:
https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002
https://www.amazon.com/Accelerate-Software-Performing-Technology-Organizations/dp/1942788339
Looks like Gene Kim is involved in all of them and Jez Humble in two of them...
Anyway, regarding The Phoenix Project and The DevOps Handbook, is it one, the other or both? Does Accelerate pretty much cover the same material? Is there one book that would cover all bases?
Have you read any of the DevOps Reports or The DevOps Handbook?
Example in DevOps Handbook there are case studies from these companies Google, Amazon, Facebook, Etsy and Netflix. I suppose they know what they have doing
I could also say that in my career trunk based development has been one of the key factor of going towards DevOps. Talking about +200 devs projects. But for you I'm just one voice on the internet. Better that you read that DevOps Handbook also reading The Phoenix Project will not harm you.
The DevOps Handbook has some good stuff on these topics, in addition to what others here have said.
This is the Holy Bible of DevOps. Well, it's the Old Testament. The New Testament is https://www.amazon.com/DevOps-Handbook-World-Class-Reliability-Organizations/dp/1942788002