Honestly, just write more code. Practice, practice, practice.
IMO, the best way to do this is to contribute to SRE-related open source projects. I started contributing lots to one 6 years ago or so, and my coding skills have greatly improved.
Shameless plug for the project. There are tons of open issues where you could add things, fix things, and get good code feedback on.
"High performing" is very vague, but i can recommend Prezi, the company i'm currently working for.
Remote position (Budapest)
Remote position (Berlin)
Each component should be actively monitored. Servers/Services going down are covered individually. Same goes for load balancers, etc.
You also want synthetic probes to cover your end-to-end needs. This will help catch the unknown-unknown problems.
The thing you want to avoid is alerting on things that users don't care about. Users don't care if the traffic rate drops to zero. Users only care about their requests. There's a subtle difference there.
Evaluated both VictorOps and Opsgenie. Both of them were so-so.
Like u/michaeld0 mentioned, the on-call options were not as flexible as I would have liked it.
We're currently evaluating Squadcast and it looks quite promising. Found their on-call rotations to be very flexible and they also seem to have few readily available incident postmortem templates which we found useful
I agree with the overall point of learning and continuous improvement, but I think a lot of common sense and research indicates that Mean Time to Restore is a very important metric to measure and improve. And if I had to choose, I would definitely pick Mean Time to Restore over Mean Time to Retrospective. If you can measure both, great.
As an example, Time to Restore is one of four metrics included in Software Delivery and Operational Performance which predicts organizational performance, as shown in the State of DevOps reports and the related Accelerate book.
It might not be the ideal solution but I recently switched from Confluence to Joplin - the iOS app seems pretty solid so far, and the desktop app is great. I'm paying 2 euros a month for their cloud service but you could also just store your files in Dropbox/Google Drive/OneDrive etc.
Automate the Boring Stuff is my fave book to recommend to complete novices https://www.amazon.com/dp/1593279922/
Keep Golang on your radar too once you get good with Python! It's an easy transition and give you a bit more flexibility
Something along the lines of the GitLab postmortem of their database outage?
https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/
So from the perspective of someone who used backstage extensively: this is an AWESOME tool! But only if you're at a scale at which you spin up one fresh microservice a day. And you're pretty big at this point. I'm building (full disclaimer here) humanitec.com which is a way to build an Internal Developer Platform really lean, and really fast. Happy to give you a sneak peak: calendly.com/gruenberg. Also: sorry for advertising so bluntly, but this was too much spot on :)
haha. yeah, i had a similar problem running operations at my last company. they wanted to implement a true SRE model, and I said sure - just give me the head count and i would be glad to have a system admin/developer on every one of your teams. Or, let's get an existing engineer some sys admin training and he can be your sre. Of course, yeah, that will probably cut into your productivity of product development.....soo......you still want do "SRE"?
Quickly they realized that Google's "SRE" model - pedantically implemented with dev/sys-admin skill cross overs - doesn't make sense for every organization.
I agree with your document approach, although no-one seems to ever want to fill them out and can never keep track of where they are. If solving for it today, I would probably setup a beautiful questionair using https://www.typeform.com/ It's surprising how beauty and simplicity can inspire good behavior from engineers. No one likes a fucking Jira form, bleh!
I think you are asking all the right questions though. good luck!
MacOS has switched to zsh as the default shell. They stopped shipping new versions of bash a long time ago because of the license change to GPLv3.
I highly recommend the homebrew package manager. One first hint is to tweak your PATH to use the GNU base tools like awk/grep. MacOS defaults to BSD variants of these tools.
Honestly, once you get the system setup, doing SRE work from MacOS isn't terrible.
Splunk made some cool acquisitions and launched this recently - it’s all very high tech and a bit more for cutting edge SRE (fully programmable, OpenTelemetry based) - of the ones you mentioned you can’t go wrong with any of them if you just want a place to look at your data. Datadog probably most popular but my shop switched to splunk/SignalFx because we wanted next gen features and we are bought into openTelemetry.
https://www.splunk.com/en_us/blog/conf-splunklive/introducing-the-splunk-observability-suite.html
Hosted grafana also doing some cool stuff albeit it’s a little further out till it’s fully developed.
There are slightly differing opinions on this, but they're all kinda side-of-the-same-coin.
My old work used custom Zabbix alerts to kick off Python and Ansible scripts. Zabbix can also do remote commands. I was not the one to set it up so I unfortunately can't give too many details, but I know it wasn't super complicated.
This might be if some use to you. It's came to me highly recommended, but I haven't gotten a chance to read it yet.
Practice of System and Network Administration, The: Volume 1: DevOps and other Best Practices for Enterprise IT https://www.amazon.com/dp/B01MFCSNQZ/ref=cm_sw_r_apan_glt_SN7SNH8YT21XY3QTBGYZ
My opinion, this may not bring you close to a traditional SRE role...although if you continue with a devops mindset in DB world, you could rebrand yourself as a “Database Reliability Engineer”. I am not kidding and this role is real in companies, there is also a book on this topic by Charity Majors.
I do see it as an opportunity to pivot into reliability engineering with a specific focus (database in this case)
Try amazon smile to donate to a charity of your choice automatically at no cost to you!
https://smile.amazon.com/Practice-Cloud-System-Administration-Practices/dp/032194318X
^^^I'm ^^^a ^^^bot ^^^and ^^^this ^^^action ^^^was ^^^performed ^^^automatically.