# How they SRE ![Alt](banner.png "banner") > A curated collection of publicly available resources on how technology or tech-savvy organizations around the world practice Site Reliability Engineering (SRE) ## Introduction Inspired by [Howtheytest](https://github.com/abhivaikar/howtheytest) by [Abhijeet Vaikar](https://github.com/abhivaikar), __How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations. Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository. ### Topics * Site Reliability Engineering * Hiring and Building SRE teams * SRE Culture * DevOps * Monitoring & Observability * Alerting * Incident Management & Incident Response * Post-Mortem * On-Call * Testing in Production * Chaos Engineering * Automation * Performance ## Organizations
Airbnb #### Blog Posts * [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec) * [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f) * [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)
Algolia #### Blog Posts * [May 30 SSL incident](https://www.algolia.com/blog/may-30-ssl-incident/) * [A Journey Into SRE](https://www.algolia.com/blog/a-journey-into-sre/)
Asana #### Blog Posts * [How Asana ships stable web application releases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/) * [Analysis of recent downtime & what we’re doing to prevent future incidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/) * [Developer environment: Achieving reliability by making it fast to reset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/)
ASOS #### Blog Posts * [Cyber Security @ ASOS.com](https://medium.com/asos-techblog/cyber-security-asos-com-7d1d1f346e57) * [Security Operations 24x7](https://medium.com/asos-techblog/security-operations-24-x-7-2e90c8e5e7e) * [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)
Atlassian #### Blog Posts * [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops) * [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code) * [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting)
Baidu #### Videos * [Anomaly Detection on Golden Signals](https://www.usenix.org/conference/srecon19asia/presentation/chen-yu) * [NetRadar: Monitoring the Datacenter Network](https://www.usenix.org/conference/srecon19asia/presentation/chen-yun)
Basecamp #### Blog Posts * [Inside a CODE RED: Network Edition](https://m.signalvnoise.com/inside-a-code-red-network-edition/) * [Three Basecamp outages. One week. What happened?](https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/) * [Basecamp 2 and Basecamp 3 search outage report](https://m.signalvnoise.com/basecamp-2-and-basecamp-3-search-outage-report/) * [Reducing Incident Escalations at Basecamp](https://m.signalvnoise.com/reducing-incident-escalations-at-basecamp/)
Bloomberg #### Videos * [Capacity Planning and Performance Enhancement with Page Reference Sampling](https://www.usenix.org/conference/srecon20americas/presentation/chen) * [Why SREs can't afford to NOT do Chaos Engineering](https://www.usenix.org/conference/srecon20americas/presentation/pawlikowski) * [Tracing Real-Time Distributed Systems](https://www.usenix.org/conference/srecon19emea/presentation/yakimov) * [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen) * [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen)
Booking.com * [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet) * [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia)
Capital One #### Blog Posts * [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075) * [Active-Active Shared-Nothing Database Architecture](https://medium.com/capital-one-tech/active-active-shared-nothing-database-architecture-304957ffb89) * [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b) * [5 Steps to Getting Your App Chaos Ready](https://medium.com/capital-one-tech/5-steps-to-getting-your-app-chaos-ready-capital-one-a5b7b3cb8e09) * [4 Real-World Scenarios That Read Like Chaos Engineering Experiments](https://medium.com/capital-one-tech/4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247) * [Embrace the Chaos … Engineering](https://medium.com/capital-one-tech/embrace-the-chaos-engineering-203fd6fc6ff7) * [3 Lessons Learned From Implementing Chaos Engineering at Enterprise](https://medium.com/capital-one-tech/3-lessons-learned-from-implementing-chaos-engineering-at-enterprise-28eb3ffecc57) * [A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy](https://medium.com/capital-one-tech/seamless-blue-green-deployment-using-aws-codedeploy-4c36c0bbeef4) * [Secure Docker Containers Require Secure Applications](https://medium.com/capital-one-tech/secure-docker-containers-require-secure-applications-75eb358abef9) * [4 Steps for Pairing the Cloud and DevOps to Improve Resiliency](https://medium.com/capital-one-tech/4-steps-for-pairing-cloud-and-devops-to-improve-resiliency-c72fe2e52b05) * [Container Ready Applications with Twelve-Factor App and Microservices Architecture](https://medium.com/capital-one-tech/container-ready-applications-with-twelve-factor-app-and-microservices-architecture-16af683a767f) * [Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS](https://medium.com/capital-one-tech/deploying-with-confidence-strategies-for-canary-deployments-on-aws-7cab3798823e) * [Architecting for Resiliency](https://medium.com/capital-one-tech/architecting-for-resiliency-9ec663db5c94) * [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d) * [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765) #### Major incidents & analysis reports * [Information on the Capital One Cyber Incident](https://www.capitalone.com/facts2019/) * [A Case Study of the Capital One Data Breach](http://web.mit.edu/smadnick/www/wp/2020-16.pdf) #### Videos * [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo) * [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI) * [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ) * [Automating the Management of the Operational Health of Cloud Accounts at Scale](https://www.usenix.org/conference/srecon19americas/presentation/walls)
DBS #### Blog Posts * [Site Reliability Engineering at DBS Bank](https://medium.com/dbs-tech-blog/site-reliability-engineering-at-dbs-bank-32c02228ccf4) #### Videos * [SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS](https://www.youtube.com/watch?v=URwkaRbOLxI&feature=emb_title)
Dropbox #### Blog Posts * [Monitoring server applications with Vortex](https://dropbox.tech/infrastructure/monitoring-server-applications-with-vortex) * [Athena: Our automated build health management system](https://dropbox.tech/infrastructure/athena-our-automated-build-health-management-system) #### Videos * [Service Discovery Challenges at Scale](https://www.usenix.org/conference/srecon19americas/presentation/nigmatullin)
Facebook #### Videos * [A Customer Service Approach to SRE](https://www.usenix.org/conference/srecon19emea/presentation/looney) * [How (Not) to Scale a Project: A Post-Mortem](https://www.usenix.org/conference/srecon19asia/presentation/bagnoli) * [Releasing the World's Largest Python Site Every 7 Minutes](https://www.usenix.org/conference/srecon19asia/presentation/wong-shuhong) * [Using ML to Automate Dynamic Error Categorization](https://www.usenix.org/conference/srecon19asia/presentation/davoli)
Fastly * [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner) * [Resilience Engineering Mythbusting](https://www.usenix.org/conference/srecon19americas/presentation/gallego)
eBay #### Blog Posts * [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/) * [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/) * [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/) * [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/) ### Video * [Madaari: Ordering for the Monkeys](https://www.usenix.org/conference/srecon19americas/presentation/raina)
Etsy #### Blog Posts * [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/) * [Opsweekly: Measuring on-call experience with alert classification](https://codeascraft.com/2014/06/19/opsweekly-measuring-on-call-experience-with-alert-classification/) * [Demystifying Site Outages](https://blog.etsy.com/news/2012/demystifying-site-outages/) * [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/) #### Videos * [Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe](https://www.youtube.com/watch?v=LdOe18KhtT4) * [Migrating a Monolith to the Cloud](https://www.usenix.org/conference/srecon19americas/presentation/govande)
Expedia #### Blog Posts * [The Cost of 100% Reliability](https://medium.com/expedia-group-tech/the-cost-of-100-reliability-ecb2901f23a4) * [Creating Monitoring Dashboards](https://medium.com/expedia-group-tech/creating-monitoring-dashboards-1f3fbe0ae1ac) * [Using Bash for DevOps](https://medium.com/expedia-group-tech/using-bash-for-devops-7046eed1aa63)
GitHub #### Blog Posts * [Deployment reliability at GitHub](https://github.blog/2021-02-03-deployment-reliability-at-github/) * [Improving how we deploy GitHub](https://github.blog/2021-01-25-improving-how-we-deploy-github/) * [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/) * [Reducing flaky builds by 18x](https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/) * [The evolving role of operations in DevOps](https://github.blog/2020-12-03-the-evolving-role-of-operations-in-devops/) * [Getting started with DevOps automation](https://github.blog/2020-10-29-getting-started-with-devops-automation/) * [MySQL High Availability at GitHub](https://github.blog/2018-06-20-mysql-high-availability-at-github/) #### Major incidents & analysis reports * [GitHub Availability Report: January 2021](https://github.blog/2021-02-02-github-availability-report-january-2021/) * [GitHub Availability Report: December 2020](https://github.blog/2021-01-06-github-availability-report-december-2020/) * [GitHub Availability Report: November 2020](https://github.blog/2020-12-02-availability-report-november-2020/) * [GitHub Availability Report: August 2020](https://github.blog/2020-09-02-github-availability-report-august-2020/) * [GitHub Availability Report: July 2020](https://github.blog/2020-08-05-github-availability-report-july-2020/) * [Introducing the GitHub Availability Report](https://github.blog/2020-07-08-introducing-the-github-availability-report/) * [February service disruptions post-incident analysis](https://github.blog/2020-03-26-february-service-disruptions-post-incident-analysis/) * [October 21 post-incident analysis](https://github.blog/2018-10-30-oct21-post-incident-analysis/) * [February 28th DDoS Incident Report](https://github.blog/2018-03-01-ddos-incident-report/) * [Incident Report: Inadvertent Private Repository Disclosure](https://github.blog/2016-10-28-incident-report-inadvertent-private-repository-disclosure/) #### Videos * [One on One SRE](https://www.usenix.org/conference/srecon19americas/presentation/tobey)
Google #### Blog Posts * [SRE Practices & Processes](https://sre.google/resources/#practicesandprocesses) * [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19) * [SRE Classroom: Distributed PubSub](https://sre.google/resources/practices-and-processes/distributed-pubsub/) #### Books * [Building Secure & Reliable Systems](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf) * [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/) * [The Site Reliability Workbook](https://sre.google/workbook/table-of-contents/) #### Videos * [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk) * [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU) * [Pragmatic Automation’ with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s) * [Must Watch! - Google SRE YouTube Playlist](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj) * [Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit](https://www.usenix.org/conference/srecon20americas/presentation/stanke) * [Implementing Distributed Consensus](https://www.usenix.org/conference/srecon20americas/presentation/ludtke) * [The SRE I Aspire to Be](https://www.usenix.org/conference/srecon19emea/presentation/aknin) * [SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19emea/presentation/perry) * [Zero Touch Prod: Towards Safer and More Secure Production Environments](https://www.usenix.org/conference/srecon19emea/presentation/czapinski) * [All of Our ML Ideas Are Bad (and We Should Feel Bad)](https://www.usenix.org/conference/srecon19emea/presentation/underwood) * [The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It](https://www.usenix.org/conference/srecon19emea/presentation/desai) * [Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program](https://www.usenix.org/conference/srecon19emea/presentation/petoff) * [Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way](https://www.usenix.org/conference/srecon19emea/presentation/gleason) * [Practical Instrumentation for Observability](https://www.usenix.org/conference/srecon19asia/presentation/krabbe) * [What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services](https://www.usenix.org/conference/srecon19asia/presentation/sato) * [Unified Reporting of Service Reliability](https://www.usenix.org/conference/srecon19asia/presentation/zhang) * [How to Trade off Server Utilization and Tail Latency](https://www.usenix.org/conference/srecon19asia/presentation/plenz) * [Keeping the Balance: Internet-Scale Loadbalancing Demystified](https://www.usenix.org/conference/srecon19americas/presentation/nolan-loadbalancing) * [From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services](https://www.usenix.org/conference/srecon19americas/presentation/virji) * [Mindfulness in SRE: Monitoring and Alerting for One's Self](https://www.usenix.org/conference/srecon19americas/presentation/lutz) * [Pragmatic Automation](https://www.usenix.org/conference/srecon19americas/presentation/luebbe) * [Sublinear Scaling in Practice: The 1k SRE Project](https://www.usenix.org/conference/srecon19americas/presentation/rath) * [Strategies to Edit Production Data](https://www.usenix.org/conference/srecon19americas/presentation/qiu) * [The Curse of SRE Autonomy and How to Manage It](https://www.usenix.org/conference/srecon19americas/presentation/bondi) * [Scaling SRE Organizations: The Journey from 1 to Many Teams](https://www.usenix.org/conference/srecon19americas/presentation/franco) * [SRE Classroom - How to Design a Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19americas/presentation/thomas) * [Using PRDs and User Journeys to Design User-Friendly Tools](https://www.usenix.org/conference/srecon19americas/presentation/stockman)
Gojek ## Gojek #### Blog Posts * [Why We Swear by the RCA](https://blog.gojekengineering.com/why-we-swear-by-the-rca-f535fd5abbcb)
Grab #### Blog Posts * [Our Journey to Continuous Delivery at Grab (Part 1)](https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab) * [Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering](https://engineering.grab.com/beyond-retries-part-3) * [Orchestrating Chaos using Grab's Experimentation Platform](https://engineering.grab.com/chaos-engineering)
Grammarly #### Blog Posts * [Security Operations in an AWS Environment](https://www.grammarly.com/blog/engineering/security-infrastructure-aws/)
Indeed #### Blog Posts * [Being Just Reliable Enough](https://engineering.indeedblog.com/blog/2019/10/being-just-reliable-enough/) * [Automating Indeed’s Release Process](https://engineering.indeedblog.com/blog/2017/03/automating-release-process/) * [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan) #### Videos * [Are We Getting Better Yet? Progress Toward Safer Operations](https://www.usenix.org/conference/srecon20americas/presentation/elman)
Heroku #### Blog Posts * [Incident Response at Heroku](https://blog.heroku.com/incident-response-at-heroku-2020)
LinkedIn #### Blog Posts * [Open source update: School of SRE](https://engineering.linkedin.com/blog/2021/open-source-update--school-of-sre) * [Production testing with dark canaries](https://engineering.linkedin.com/blog/2020/production-testing-with-dark-canaries) * [Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform](https://engineering.linkedin.com/blog/2019/06/smart-alerts-in-thirdeye--linkedins-real-time-monitoring-platfor) * [Iris mobile: An open source, mobile interface for incident management](https://engineering.linkedin.com/blog/2019/05/iris-mobile--an-open-source--mobile-interface-for-incident-manag) * [LinkedOut: A Request-Level Failure Injection Framework](https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework) * [Eliminating toil with fully automated load testing](https://engineering.linkedin.com/blog/2019/eliminating-toil-with-fully-automated-load-testing) * [The Makeup of Successful Geographically-Distributed SRE Teams: Part 1](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p) * [The Makeup of Successful Geographically-Distributed SRE Teams: Part 2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0) * [Project STAR*: Streamlining Our On-Call Process](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process) * [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch) * [Resilience Engineering at LinkedIn with Project Waterbear](https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear) * [Hiring SREs at LinkedIn](https://engineering.linkedin.com/blog/2017/07/hiring-sres-at-linkedin) * [Open Sourcing Iris and Oncall](https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall) * [Building the SRE Culture at LinkedIn](https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin) * [Failure is Not an Option](https://engineering.linkedin.com/blog/2017/01/failure-is-not-an-option) * [MTTD and MTTR Are Key](https://engineering.linkedin.com/blog/2016/12/mttd-and-mttr-are-key) * [What Gets Measured Gets Fixed](https://engineering.linkedin.com/blog/2016/12/what-gets-measured-gets-fixed) * [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin) #### Videos * [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA) * [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty) * [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin) * [Unconference: Unsolved Problems in SRE](https://www.usenix.org/conference/srecon19emea/presentation/andersen) * [Leading without Managing: Becoming an SRE Technical Leader](https://www.usenix.org/conference/srecon19asia/presentation/palino-leading) * [Why Does (My) Monitoring Suck?](https://www.usenix.org/conference/srecon19asia/presentation/palino-monitoring) * [Traffic Forecasting and Stress Testing Infrastructure](https://www.usenix.org/conference/srecon19asia/presentation/sulakhe) * [Collective Mindfulness for Better Decisions in SRE](https://www.usenix.org/conference/srecon19asia/presentation/andersen-mindfulness) * [TCP—Architecture, Enhancements, and Tuning](https://www.usenix.org/conference/srecon19asia/presentation/dhakal) * [Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up](https://www.usenix.org/conference/srecon19asia/presentation/lamba) * [Understanding Business Metrics Can Make You a Better SRE](https://www.usenix.org/conference/srecon19asia/presentation/suley) * [Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way](https://www.usenix.org/conference/srecon19americas/presentation/kehoe) * [Differences in SRE Implementations across Companies](https://www.usenix.org/conference/srecon19americas/presentation/andersen)
Mercari ## Mercari #### Blog Posts * [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/) * [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/) * [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/)
Microsoft #### Videos * [SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft](https://www.youtube.com/watch?v=1iMo3SkdQqQ) * [Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft](https://www.youtube.com/watch?v=U3ubcoNzx9k) * [Sustainable Software Engineering & SREs](https://www.usenix.org/conference/srecon20americas/presentation/johnson) * [Study on Human Factors and Team Culture to Improve Pager Fatigue](https://www.usenix.org/conference/srecon20americas/presentation/barteneva) * [Prioritizing Trust While Creating Applications](https://www.usenix.org/conference/srecon19emea/presentation/davis) * [Building Resilience: How to Learn More from Incidents](https://www.usenix.org/conference/srecon19emea/presentation/stenning) * [A Tale of Two Postmortems: A Human Factors View](https://www.usenix.org/conference/srecon19asia/presentation/lund-postmortem) * [Availability—Thinking beyond 9s](https://www.usenix.org/conference/srecon19asia/presentation/srinivasamurthy) * [Ironies of Automation: A Comedy in Three Parts](https://www.usenix.org/conference/srecon19asia/presentation/lund-comedy) * [The Ops in Serverless](https://www.usenix.org/conference/srecon19americas/presentation/davis)
MIRO ## MIRO #### Blog Posts * [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e) * [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699) * [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)
Monzo #### Blog Posts * [Autoscaling Monzo: How we optimise our platform to be just the right size](https://monzo.com/blog/2020/10/19/autoscaling-monzo) * [How we’ve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo) * [How we respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents) * [How we monitor Monzo](https://monzo.com/blog/2018/07/27/how-we-monitor-monzo) #### Videos * [Eventually Consistent Service Discovery](https://www.usenix.org/conference/srecon19emea/presentation/patel)
Netflix #### Blog Posts * [Building Netflix’s Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304) * [Edgar: Solving Mysteries Faster with Observability](https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f) * [Telltale: Netflix Application Monitoring Simplified](https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba) * [Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix](https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb) * [Introducing Dispatch](https://netflixtechblog.com/introducing-dispatch-da4b8a2a8072) * [Applying Netflix DevOps Patterns to Windows](https://netflixtechblog.com/applying-netflix-devops-patterns-to-windows-2a57f2dbbf79) * [ChAP: Chaos Automation Platform](https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f) * [Starting the Avalanche](https://netflixtechblog.com/starting-the-avalanche-640e69b14a06) * [Netflix Chaos Monkey Upgraded](https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d) * [Chaos Engineering Upgraded](https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa) * [From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform](https://netflixtechblog.com/from-chaos-to-control-testing-the-resiliency-of-netflixs-content-discovery-platform-ce5566aef0a4) #### Major incidents & analysis reports * [Post-mortem of October 22, 2012 AWS degradation](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5) #### Videos * [When /bin/sh Attacks: Revisiting "Automate All the Things"](https://www.usenix.org/conference/srecon20americas/presentation/reed) * [How Did Things Go Right? Learning More from Incidents](https://www.usenix.org/conference/srecon19americas/presentation/kitchens) * [Monitoring and Tracing @Netflix Streaming Data Infrastructure](https://www.youtube.com/watch?v=DlWYNoLmma8) * [Real user performance monitoring at Netflix scale ‐ Martin Spier](https://www.youtube.com/watch?v=4RG2DUK03_0) * [AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is](https://www.youtube.com/watch?v=rgfww8tLM0A) * [AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)](https://www.youtube.com/watch?v=LaKGx0dAUlo) * [Netflix: Multi-Regional Resiliency and Amazon Route 53](https://www.youtube.com/watch?v=WDDkLOT8SCk) * [Designing Services for Resilience: Netflix Lessons](https://www.youtube.com/watch?v=RWyZkNzvC-c) * [South Bay SRE Meetup - Netflix Cloud Performance Team](https://www.youtube.com/watch?v=uQ0flQOtQEA) * [AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)](https://www.youtube.com/watch?v=T_D1G42G0dE) * [How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows](https://www.youtube.com/watch?v=8tsIqfvizpU) * [Mastering Chaos - A Netflix Guide to Microservices](https://www.youtube.com/watch?v=CZ3wIuvmHeM) * [AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere​ Global Architecture (ARC204)](https://www.youtube.com/watch?v=leqUbSY55hY) * [SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs](https://www.youtube.com/watch?v=koGaH4ffXaU) * [From Sys Admin to Netflix SRE](https://www.youtube.com/watch?v=lZI51YzIgVE) * [Application Resilience Engineering and Operations at Netflix with Hystrix](https://www.youtube.com/watch?v=RzlluokGi1w) * [Injecting Failure at Netflix](https://www.youtube.com/watch?v=ioXV28GtXeo) * [LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability](https://www.youtube.com/watch?v=3D0zS3kPNUU)
PayPal #### Videos * [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title) * [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr) * [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan) * [Operating Elasticsearch with Ease at Scale](https://www.usenix.org/conference/srecon19asia/presentation/sankaravadivel) * [Ensuring Site Reliability through Security Controls](https://www.usenix.org/conference/srecon19asia/presentation/janakiraman)
Pinterest #### Blog Posts * [Simplifying web deploys](https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737) * [Upgrading Pinterest operational metrics](https://medium.com/pinterest-engineering/upgrading-pinterest-operational-metrics-8718d058079a) * [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b) * [Auto scaling Pinterest](https://medium.com/pinterest-engineering/auto-scaling-pinterest-df1d2beb4d64) #### Videos * [Building Actionable Code Ownership](https://www.usenix.org/conference/srecon20americas/presentation/mukherji) * [Evolution of Observability Tools at Pinterest](https://www.usenix.org/conference/srecon19emea/presentation/abbas) * [Automating OS/Platform Upgrades for Service Owners](https://www.usenix.org/conference/srecon19asia/presentation/menezes)
Postman #### Blog Posts * [Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana](https://medium.com/better-practices/chaos-d3ef238ec328)
Scribd #### Blog Posts * [Learning from incidents: getting Sidekiq ready to serve a billion jobs](https://tech.scribd.com/blog/2020/sidekiq-incident-learnings.html) * [A testimonial for using PagerDuty at Scribd](https://tech.scribd.com/blog/2020/pagerduty-at-scribd.html) * [Assigning pager duty to developers](https://tech.scribd.com/blog/2019/managing-pagerduty-rotations.html)
Shopify #### Blog Posts * [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events) * [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify) * [Using DNS Traffic Management to Add Resiliency to Shopify’s Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services) * [Four Steps to Creating Effective Game Day Tests](https://shopify.engineering/four-steps-creating-effective-game-day-tests) * [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure) * [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify) #### Videos * [Network Monitor: A Tale of ACKnowledging an Observability Gap](https://www.usenix.org/conference/srecon19emea/presentation/gedge) * [Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures](https://www.usenix.org/conference/srecon19emea/presentation/arthorne) * [Advanced Napkin Math: Estimating System Performance from First Principles](https://www.usenix.org/conference/srecon19emea/presentation/eskildsen)
Slack #### Blog Posts * [Slack’s Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/) * [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/) * [Deploys at Slack](https://slack.engineering/deploys-at-slack/) * [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/) #### Videos * [Slack at the Edge](https://www.usenix.org/conference/srecon19asia/presentation/pemberton) * [What Breaks Our Systems: A Taxonomy of Black Swans](https://www.usenix.org/conference/srecon19americas/presentation/nolan-taxonomy)
Soundcloud ## Soundcloud #### Blog Posts * [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos) * [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary) * [Prometheus has come of age – a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project) * [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud)
Spotify #### Blog Posts * [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/) * [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/) #### Videos * [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)
Squarespace #### Blog Posts * [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability) #### Videos * [Pushing through Friction](https://www.usenix.org/conference/srecon19emea/presentation/na) * [How to SRE When Everything's Already on Fire](https://www.usenix.org/conference/srecon19emea/presentation/hidalgo) * [Case Study: Implementing SLOs for a New Service](https://www.usenix.org/conference/srecon19americas/presentation/lawson) * [Creating a Code Review Culture](https://www.usenix.org/conference/srecon19americas/presentation/turner)
StackOverflow #### Blog Posts * [A deeper dive into our May 2019 security incident](https://stackoverflow.blog/2021/01/25/a-deeper-dive-into-our-may-2019-security-incident/) * [Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/) #### Videos * [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)
Stripe #### Blog Posts * [Fast and flexible observability with canonical log lines](https://stripe.com/blog/canonical-log-lines) * [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/engineering/page/3) #### Videos * [How Stripe Invests in Technical Infrastructure](https://www.usenix.org/conference/srecon19emea/presentation/larson) * [The AWS Billing Machine and Optimizing Cloud Costs](https://www.usenix.org/conference/srecon19asia/presentation/lopopolo)
Target #### Blog Posts * [Ɔhaos Ǝnginǝǝring @ Target - Part 2](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html) * [Ɔhaos Ǝnginǝǝring @ Target - Part 1](https://tech.target.com/2019/02/05/chaos-engineering-at-Target.html) * [GoAlert - Your Future Open Source, On-Call Notification Product](https://tech.target.com/2019/02/25/introducing-goalert.html) * [On Infrastructure at Scale: A Cascading Failure of Distributed Systems](https://tech.target.com/2019/01/14/cascading-failure-of-distributed-systems.html) * [Distributed Troubleshooting](https://tech.target.com/2017/04/05/distributed-troubleshooting.html) * [Outage Resolution Through Automation](https://tech.target.com/2014/12/29/outage-resolution-through-automation.html)
Trivago ## Trivago #### Blog Posts * [How To Get Fooled By Metrics](https://tech.trivago.com/2020/12/04/how-to-get-fooled-by-metrics/)
Uber #### Blog Posts * [Disaster Recovery for Multi-Region Kafka at Uber](https://eng.uber.com/kafka/) * [Engineering Failover Handling in Uber’s Mobile Networking Infrastructure](https://eng.uber.com/eng-failover-handling/) * [Optimizing Observability with Jaeger, M3, and XYS at Uber](https://eng.uber.com/optimizing-observability/) #### Videos * [A Tale of Two Rotations: Building a Humane & Effective On-Call](https://www.usenix.org/conference/srecon19emea/presentation/lee) * [Testing in Production at Scale](https://www.usenix.org/conference/srecon19americas/presentation/gud) * [A History of SRE at Uber’ with Rick Boone of Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE)
Wikimedia Foundation #### Videos * [Testing Encyclopedias in Production](https://www.usenix.org/conference/srecon20americas/presentation/mouzeli) * [What Happens When You Type en.wikipedia.org?](https://www.usenix.org/conference/srecon19emea/presentation/mouzeli)
Zerodha #### Blog Posts * [Infrastructure monitoring with Prometheus at Zerodha](https://zerodha.tech/blog/infra-monitoring-at-zerodha/)
SRECon Mix Playlist #### Videos * [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla) * [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki) * [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent) * [Alaska Airlines - Capacity Prediction in External Services](https://www.usenix.org/conference/srecon19americas/presentation/kraus) * [BuzzFeed - Optimizing for Learning](https://www.usenix.org/conference/srecon19americas/presentation/mcdonald) * [BT - Challenges of Starting an SRE Team from Scratch in an Enterprise](https://www.usenix.org/conference/srecon20americas/presentation/narvas) * [Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions](https://www.usenix.org/conference/srecon19emea/presentation/ali) * [Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken](https://www.usenix.org/conference/srecon19americas/presentation/lykke) * [IBM - Why Automating Everything Adds to Your Toil](https://www.usenix.org/conference/srecon19emea/presentation/thorne) * [Genesys - The Smallest Possible SRE Team](https://www.usenix.org/conference/srecon20americas/presentation/thomas) * [G-Research - My Life as a Solo SRE](https://www.usenix.org/conference/srecon19emea/presentation/murphy) * [Grafana Labs - SRE in the Third Age](https://www.usenix.org/conference/srecon19emea/presentation/rabenstein) * [Kenna Security - Building a Scalable Monitoring System](https://www.usenix.org/conference/srecon19emea/presentation/struve) * [Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better](https://www.usenix.org/conference/srecon20americas/presentation/spoonhower) * [MessageBird - Autopsy of a MySQL Automation Disaster](https://www.usenix.org/conference/srecon19emea/presentation/gagne) * [Netlify - Perks and Pitfalls of Building a Remote First Team](https://www.usenix.org/conference/srecon19emea/presentation/neal) * [ReactiveOps - Zero to SRE](https://www.usenix.org/conference/srecon19americas/presentation/schlesinger) * [Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19](https://www.usenix.org/conference/srecon20americas/presentation/collins) * [Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations](https://www.usenix.org/conference/srecon19emea/presentation/huxtable) * [The New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events](https://www.usenix.org/conference/srecon19emea/presentation/wan) * [Twitter - Hiring Great SREs](https://www.usenix.org/conference/srecon19emea/presentation/rutkin) * [United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value](https://www.usenix.org/conference/srecon19americas/presentation/wieczorek) * [Unity Technologies - Being Reasonable about SRE](https://www.usenix.org/conference/srecon19emea/presentation/urbanec) * [Udemy - How to Do SRE When You Have No SRE](https://www.usenix.org/conference/srecon19emea/presentation/ocallaghan) * [Vanguard - Cloudy with a Chance of Chaos](https://www.usenix.org/conference/srecon20americas/presentation/yakomin) * [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup) * [Yelp - What I Wish I Knew before Going On-Call](https://www.usenix.org/conference/srecon19emea/presentation/shu) * [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer)
--- ## Resources ### Books * [97 Things Every SRE Should Know](https://www.oreilly.com/library/view/97-things-every/9781492081487/) * [SLO Adoption and Usage in Site Reliability Engineering](https://www.oreilly.com/library/view/slo-adoption-and/9781492075370/) * [Practical Site Reliability Engineering](https://www.oreilly.com/library/view/practical-site-reliability/9781788839563/) * [Implementing Service Level Objectives](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/) * [Chaos Engineering](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/) * [Seeking SRE](https://www.oreilly.com/library/view/seeking-sre/9781491978856/) * [Security Chaos Engineering](https://www.oreilly.com/library/view/security-chaos-engineering/9781492080350/) * [Chaos Engineering Observability](https://www.oreilly.com/library/view/chaos-engineering-observability/9781492051046/) * [Training Site Reliability Engineers](https://www.oreilly.com/library/view/training-site-reliability/9781492076018/) * [Database Reliability Engineering](https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/) * [What Is SRE?](https://www.oreilly.com/library/view/what-is-sre/9781492054429/) * [Database Reliability Engineering: What, Why, and How?](https://www.oreilly.com/library/view/database-reliability-engineering/9781492030942/) * [Observability Engineering](https://www.oreilly.com/library/view/observability-engineering/9781492076438/) ### Events * [SRECon Past Events](https://www.usenix.org/srecon#past) * [ChaosConf](https://www.chaosconf.io/) ### Others * [Awesome SRE](https://github.com/dastergon/awesome-sre) * [Awesome Site Reliability Engineering Tools](https://github.com/SquadcastHub/awesome-sre-tools) * [Google SRE Page](https://sre.google/) * [Microsoft SRE Page](https://docs.microsoft.com/en-us/azure/site-reliability-engineering/) * [SRE Weekly Newsletter](https://sreweekly.com/) * [Chaos Engineering Newsletter](https://chaosengineering.news/) * [DevOps Weekly Newsletter](http://devopsweekly.com) ## Credits * Banner image [Cartoon vector created by vectorjuice - www.freepik.com](https://www.freepik.com/vectors/cartoon) ## Contribute Contributions welcome! Read the [contribution guidelines](contributing.md) first. ## License [![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0) To the extent possible under law, Unmesh Gundecha has waived all copyright and related or neighboring rights to this work.