mirror of
https://github.com/upgundecha/howtheysre
synced 2026-01-03 15:58:03 +00:00
751 lines
47 KiB
Markdown
751 lines
47 KiB
Markdown
# How they SRE
|
||
|
||

|
||
|
||
> A curated collection of publicly available resources on how technology or tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
|
||
|
||
## Introduction
|
||
|
||
Inspired by [Howtheytest](https://github.com/abhivaikar/howtheytest) by [Abhijeet Vaikar](https://github.com/abhivaikar), __How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.
|
||
|
||
Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.
|
||
|
||
### Topics
|
||
|
||
* Site Reliability Engineering
|
||
* Hiring and Building SRE teams
|
||
* SRE Culture
|
||
* DevOps
|
||
* Monitoring & Observability
|
||
* Alerting
|
||
* Incident Management & Incident Response
|
||
* Post-Mortem
|
||
* On-Call
|
||
* Testing in Production
|
||
* Chaos Engineering
|
||
* Automation
|
||
* Performance
|
||
|
||
## Organizations
|
||
|
||
<details>
|
||
<summary>Airbnb</summary>
|
||
|
||
#### Blog Posts
|
||
* [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec)
|
||
* [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f)
|
||
* [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Algolia</summary>
|
||
|
||
#### Blog Posts
|
||
* [May 30 SSL incident](https://www.algolia.com/blog/may-30-ssl-incident/)
|
||
* [A Journey Into SRE](https://www.algolia.com/blog/a-journey-into-sre/)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Asana</summary>
|
||
|
||
#### Blog Posts
|
||
* [How Asana ships stable web application releases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/)
|
||
* [Analysis of recent downtime & what we’re doing to prevent future incidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/)
|
||
* [Developer environment: Achieving reliability by making it fast to reset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>ASOS</summary>
|
||
|
||
#### Blog Posts
|
||
* [Cyber Security @ ASOS.com](https://medium.com/asos-techblog/cyber-security-asos-com-7d1d1f346e57)
|
||
* [Security Operations 24x7](https://medium.com/asos-techblog/security-operations-24-x-7-2e90c8e5e7e)
|
||
* [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Atlassian</summary>
|
||
|
||
#### Blog Posts
|
||
* [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops)
|
||
* [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code)
|
||
* [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Baidu</summary>
|
||
|
||
#### Videos
|
||
* [Anomaly Detection on Golden Signals](https://www.usenix.org/conference/srecon19asia/presentation/chen-yu)
|
||
* [NetRadar: Monitoring the Datacenter Network](https://www.usenix.org/conference/srecon19asia/presentation/chen-yun)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Basecamp</summary>
|
||
|
||
#### Blog Posts
|
||
* [Inside a CODE RED: Network Edition](https://m.signalvnoise.com/inside-a-code-red-network-edition/)
|
||
* [Three Basecamp outages. One week. What happened?](https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/)
|
||
* [Basecamp 2 and Basecamp 3 search outage report](https://m.signalvnoise.com/basecamp-2-and-basecamp-3-search-outage-report/)
|
||
* [Reducing Incident Escalations at Basecamp](https://m.signalvnoise.com/reducing-incident-escalations-at-basecamp/)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Bloomberg</summary>
|
||
|
||
#### Videos
|
||
* [Capacity Planning and Performance Enhancement with Page Reference Sampling](https://www.usenix.org/conference/srecon20americas/presentation/chen)
|
||
* [Why SREs can't afford to NOT do Chaos Engineering](https://www.usenix.org/conference/srecon20americas/presentation/pawlikowski)
|
||
* [Tracing Real-Time Distributed Systems](https://www.usenix.org/conference/srecon19emea/presentation/yakimov)
|
||
* [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen)
|
||
* [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Booking.com</summary>
|
||
|
||
* [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet)
|
||
* [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Capital One</summary>
|
||
|
||
#### Blog Posts
|
||
* [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075)
|
||
* [Active-Active Shared-Nothing Database Architecture](https://medium.com/capital-one-tech/active-active-shared-nothing-database-architecture-304957ffb89)
|
||
* [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b)
|
||
* [5 Steps to Getting Your App Chaos Ready](https://medium.com/capital-one-tech/5-steps-to-getting-your-app-chaos-ready-capital-one-a5b7b3cb8e09)
|
||
* [4 Real-World Scenarios That Read Like Chaos Engineering Experiments](https://medium.com/capital-one-tech/4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247)
|
||
* [Embrace the Chaos … Engineering](https://medium.com/capital-one-tech/embrace-the-chaos-engineering-203fd6fc6ff7)
|
||
* [3 Lessons Learned From Implementing Chaos Engineering at Enterprise](https://medium.com/capital-one-tech/3-lessons-learned-from-implementing-chaos-engineering-at-enterprise-28eb3ffecc57)
|
||
* [A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy](https://medium.com/capital-one-tech/seamless-blue-green-deployment-using-aws-codedeploy-4c36c0bbeef4)
|
||
* [Secure Docker Containers Require Secure Applications](https://medium.com/capital-one-tech/secure-docker-containers-require-secure-applications-75eb358abef9)
|
||
* [4 Steps for Pairing the Cloud and DevOps to Improve Resiliency](https://medium.com/capital-one-tech/4-steps-for-pairing-cloud-and-devops-to-improve-resiliency-c72fe2e52b05)
|
||
* [Container Ready Applications with Twelve-Factor App and Microservices Architecture](https://medium.com/capital-one-tech/container-ready-applications-with-twelve-factor-app-and-microservices-architecture-16af683a767f)
|
||
* [Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS](https://medium.com/capital-one-tech/deploying-with-confidence-strategies-for-canary-deployments-on-aws-7cab3798823e)
|
||
* [Architecting for Resiliency](https://medium.com/capital-one-tech/architecting-for-resiliency-9ec663db5c94)
|
||
* [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d)
|
||
* [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765)
|
||
|
||
#### Major incidents & analysis reports
|
||
* [Information on the Capital One Cyber Incident](https://www.capitalone.com/facts2019/)
|
||
* [A Case Study of the Capital One Data Breach](http://web.mit.edu/smadnick/www/wp/2020-16.pdf)
|
||
|
||
#### Videos
|
||
* [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo)
|
||
* [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI)
|
||
* [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ)
|
||
* [Automating the Management of the Operational Health of Cloud Accounts at Scale](https://www.usenix.org/conference/srecon19americas/presentation/walls)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>DBS</summary>
|
||
|
||
#### Blog Posts
|
||
* [Site Reliability Engineering at DBS Bank](https://medium.com/dbs-tech-blog/site-reliability-engineering-at-dbs-bank-32c02228ccf4)
|
||
|
||
#### Videos
|
||
* [SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS](https://www.youtube.com/watch?v=URwkaRbOLxI&feature=emb_title)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Dropbox</summary>
|
||
|
||
#### Blog Posts
|
||
* [Monitoring server applications with Vortex](https://dropbox.tech/infrastructure/monitoring-server-applications-with-vortex)
|
||
* [Athena: Our automated build health management system](https://dropbox.tech/infrastructure/athena-our-automated-build-health-management-system)
|
||
|
||
#### Videos
|
||
* [Service Discovery Challenges at Scale](https://www.usenix.org/conference/srecon19americas/presentation/nigmatullin)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Facebook</summary>
|
||
|
||
#### Videos
|
||
* [A Customer Service Approach to SRE](https://www.usenix.org/conference/srecon19emea/presentation/looney)
|
||
* [How (Not) to Scale a Project: A Post-Mortem](https://www.usenix.org/conference/srecon19asia/presentation/bagnoli)
|
||
* [Releasing the World's Largest Python Site Every 7 Minutes](https://www.usenix.org/conference/srecon19asia/presentation/wong-shuhong)
|
||
* [Using ML to Automate Dynamic Error Categorization](https://www.usenix.org/conference/srecon19asia/presentation/davoli)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Fastly</summary>
|
||
|
||
* [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner)
|
||
* [Resilience Engineering Mythbusting](https://www.usenix.org/conference/srecon19americas/presentation/gallego)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>eBay</summary>
|
||
|
||
#### Blog Posts
|
||
* [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/)
|
||
* [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/)
|
||
* [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/)
|
||
* [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/)
|
||
|
||
### Video
|
||
* [Madaari: Ordering for the Monkeys](https://www.usenix.org/conference/srecon19americas/presentation/raina)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Etsy</summary>
|
||
|
||
#### Blog Posts
|
||
* [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)
|
||
* [Opsweekly: Measuring on-call experience with alert classification](https://codeascraft.com/2014/06/19/opsweekly-measuring-on-call-experience-with-alert-classification/)
|
||
* [Demystifying Site Outages](https://blog.etsy.com/news/2012/demystifying-site-outages/)
|
||
* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
|
||
|
||
#### Videos
|
||
* [Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe](https://www.youtube.com/watch?v=LdOe18KhtT4)
|
||
* [Migrating a Monolith to the Cloud](https://www.usenix.org/conference/srecon19americas/presentation/govande)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Expedia</summary>
|
||
|
||
#### Blog Posts
|
||
* [The Cost of 100% Reliability](https://medium.com/expedia-group-tech/the-cost-of-100-reliability-ecb2901f23a4)
|
||
* [Creating Monitoring Dashboards](https://medium.com/expedia-group-tech/creating-monitoring-dashboards-1f3fbe0ae1ac)
|
||
* [Using Bash for DevOps](https://medium.com/expedia-group-tech/using-bash-for-devops-7046eed1aa63)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>GitHub</summary>
|
||
|
||
#### Blog Posts
|
||
* [Deployment reliability at GitHub](https://github.blog/2021-02-03-deployment-reliability-at-github/)
|
||
* [Improving how we deploy GitHub](https://github.blog/2021-01-25-improving-how-we-deploy-github/)
|
||
* [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/)
|
||
* [Reducing flaky builds by 18x](https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/)
|
||
* [The evolving role of operations in DevOps](https://github.blog/2020-12-03-the-evolving-role-of-operations-in-devops/)
|
||
* [Getting started with DevOps automation](https://github.blog/2020-10-29-getting-started-with-devops-automation/)
|
||
* [MySQL High Availability at GitHub](https://github.blog/2018-06-20-mysql-high-availability-at-github/)
|
||
|
||
#### Major incidents & analysis reports
|
||
* [GitHub Availability Report: January 2021](https://github.blog/2021-02-02-github-availability-report-january-2021/)
|
||
* [GitHub Availability Report: December 2020](https://github.blog/2021-01-06-github-availability-report-december-2020/)
|
||
* [GitHub Availability Report: November 2020](https://github.blog/2020-12-02-availability-report-november-2020/)
|
||
* [GitHub Availability Report: August 2020](https://github.blog/2020-09-02-github-availability-report-august-2020/)
|
||
* [GitHub Availability Report: July 2020](https://github.blog/2020-08-05-github-availability-report-july-2020/)
|
||
* [Introducing the GitHub Availability Report](https://github.blog/2020-07-08-introducing-the-github-availability-report/)
|
||
* [February service disruptions post-incident analysis](https://github.blog/2020-03-26-february-service-disruptions-post-incident-analysis/)
|
||
* [October 21 post-incident analysis](https://github.blog/2018-10-30-oct21-post-incident-analysis/)
|
||
* [February 28th DDoS Incident Report](https://github.blog/2018-03-01-ddos-incident-report/)
|
||
* [Incident Report: Inadvertent Private Repository Disclosure](https://github.blog/2016-10-28-incident-report-inadvertent-private-repository-disclosure/)
|
||
|
||
#### Videos
|
||
* [One on One SRE](https://www.usenix.org/conference/srecon19americas/presentation/tobey)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Google</summary>
|
||
|
||
#### Blog Posts
|
||
* [SRE Practices & Processes](https://sre.google/resources/#practicesandprocesses)
|
||
* [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19)
|
||
* [SRE Classroom: Distributed PubSub](https://sre.google/resources/practices-and-processes/distributed-pubsub/)
|
||
|
||
#### Books
|
||
* [Building Secure & Reliable Systems](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf)
|
||
* [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/)
|
||
* [The Site Reliability Workbook](https://sre.google/workbook/table-of-contents/)
|
||
|
||
#### Videos
|
||
* [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk)
|
||
* [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU)
|
||
* [Pragmatic Automation’ with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s)
|
||
* [Must Watch! - Google SRE YouTube Playlist](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)
|
||
* [Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit](https://www.usenix.org/conference/srecon20americas/presentation/stanke)
|
||
* [Implementing Distributed Consensus](https://www.usenix.org/conference/srecon20americas/presentation/ludtke)
|
||
* [The SRE I Aspire to Be](https://www.usenix.org/conference/srecon19emea/presentation/aknin)
|
||
* [SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19emea/presentation/perry)
|
||
* [Zero Touch Prod: Towards Safer and More Secure Production Environments](https://www.usenix.org/conference/srecon19emea/presentation/czapinski)
|
||
* [All of Our ML Ideas Are Bad (and We Should Feel Bad)](https://www.usenix.org/conference/srecon19emea/presentation/underwood)
|
||
* [The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It](https://www.usenix.org/conference/srecon19emea/presentation/desai)
|
||
* [Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program](https://www.usenix.org/conference/srecon19emea/presentation/petoff)
|
||
* [Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way](https://www.usenix.org/conference/srecon19emea/presentation/gleason)
|
||
* [Practical Instrumentation for Observability](https://www.usenix.org/conference/srecon19asia/presentation/krabbe)
|
||
* [What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services](https://www.usenix.org/conference/srecon19asia/presentation/sato)
|
||
* [Unified Reporting of Service Reliability](https://www.usenix.org/conference/srecon19asia/presentation/zhang)
|
||
* [How to Trade off Server Utilization and Tail Latency](https://www.usenix.org/conference/srecon19asia/presentation/plenz)
|
||
* [Keeping the Balance: Internet-Scale Loadbalancing Demystified](https://www.usenix.org/conference/srecon19americas/presentation/nolan-loadbalancing)
|
||
* [From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services](https://www.usenix.org/conference/srecon19americas/presentation/virji)
|
||
* [Mindfulness in SRE: Monitoring and Alerting for One's Self](https://www.usenix.org/conference/srecon19americas/presentation/lutz)
|
||
* [Pragmatic Automation](https://www.usenix.org/conference/srecon19americas/presentation/luebbe)
|
||
* [Sublinear Scaling in Practice: The 1k SRE Project](https://www.usenix.org/conference/srecon19americas/presentation/rath)
|
||
* [Strategies to Edit Production Data](https://www.usenix.org/conference/srecon19americas/presentation/qiu)
|
||
* [The Curse of SRE Autonomy and How to Manage It](https://www.usenix.org/conference/srecon19americas/presentation/bondi)
|
||
* [Scaling SRE Organizations: The Journey from 1 to Many Teams](https://www.usenix.org/conference/srecon19americas/presentation/franco)
|
||
* [SRE Classroom - How to Design a Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19americas/presentation/thomas)
|
||
* [Using PRDs and User Journeys to Design User-Friendly Tools](https://www.usenix.org/conference/srecon19americas/presentation/stockman)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Gojek</summary>
|
||
|
||
## Gojek
|
||
#### Blog Posts
|
||
* [Why We Swear by the RCA](https://blog.gojekengineering.com/why-we-swear-by-the-rca-f535fd5abbcb)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Grab</summary>
|
||
|
||
#### Blog Posts
|
||
* [Our Journey to Continuous Delivery at Grab (Part 1)](https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab)
|
||
* [Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering](https://engineering.grab.com/beyond-retries-part-3)
|
||
* [Orchestrating Chaos using Grab's Experimentation Platform](https://engineering.grab.com/chaos-engineering)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Grammarly</summary>
|
||
|
||
#### Blog Posts
|
||
* [Security Operations in an AWS Environment](https://www.grammarly.com/blog/engineering/security-infrastructure-aws/)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Indeed</summary>
|
||
|
||
#### Blog Posts
|
||
* [Being Just Reliable Enough](https://engineering.indeedblog.com/blog/2019/10/being-just-reliable-enough/)
|
||
* [Automating Indeed’s Release Process](https://engineering.indeedblog.com/blog/2017/03/automating-release-process/)
|
||
* [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan)
|
||
|
||
#### Videos
|
||
* [Are We Getting Better Yet? Progress Toward Safer Operations](https://www.usenix.org/conference/srecon20americas/presentation/elman)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Heroku</summary>
|
||
|
||
#### Blog Posts
|
||
* [Incident Response at Heroku](https://blog.heroku.com/incident-response-at-heroku-2020)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>LinkedIn</summary>
|
||
|
||
#### Blog Posts
|
||
* [Open source update: School of SRE](https://engineering.linkedin.com/blog/2021/open-source-update--school-of-sre)
|
||
* [Production testing with dark canaries](https://engineering.linkedin.com/blog/2020/production-testing-with-dark-canaries)
|
||
* [Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform](https://engineering.linkedin.com/blog/2019/06/smart-alerts-in-thirdeye--linkedins-real-time-monitoring-platfor)
|
||
* [Iris mobile: An open source, mobile interface for incident management](https://engineering.linkedin.com/blog/2019/05/iris-mobile--an-open-source--mobile-interface-for-incident-manag)
|
||
* [LinkedOut: A Request-Level Failure Injection Framework](https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework)
|
||
* [Eliminating toil with fully automated load testing](https://engineering.linkedin.com/blog/2019/eliminating-toil-with-fully-automated-load-testing)
|
||
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 1](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p)
|
||
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0)
|
||
* [Project STAR*: Streamlining Our On-Call Process](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)
|
||
* [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch)
|
||
* [Resilience Engineering at LinkedIn with Project Waterbear](https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear)
|
||
* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/blog/2017/07/hiring-sres-at-linkedin)
|
||
* [Open Sourcing Iris and Oncall](https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall)
|
||
* [Building the SRE Culture at LinkedIn](https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin)
|
||
* [Failure is Not an Option](https://engineering.linkedin.com/blog/2017/01/failure-is-not-an-option)
|
||
* [MTTD and MTTR Are Key](https://engineering.linkedin.com/blog/2016/12/mttd-and-mttr-are-key)
|
||
* [What Gets Measured Gets Fixed](https://engineering.linkedin.com/blog/2016/12/what-gets-measured-gets-fixed)
|
||
* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin)
|
||
|
||
#### Videos
|
||
* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA)
|
||
* [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty)
|
||
* [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin)
|
||
* [Unconference: Unsolved Problems in SRE](https://www.usenix.org/conference/srecon19emea/presentation/andersen)
|
||
* [Leading without Managing: Becoming an SRE Technical Leader](https://www.usenix.org/conference/srecon19asia/presentation/palino-leading)
|
||
* [Why Does (My) Monitoring Suck?](https://www.usenix.org/conference/srecon19asia/presentation/palino-monitoring)
|
||
* [Traffic Forecasting and Stress Testing Infrastructure](https://www.usenix.org/conference/srecon19asia/presentation/sulakhe)
|
||
* [Collective Mindfulness for Better Decisions in SRE](https://www.usenix.org/conference/srecon19asia/presentation/andersen-mindfulness)
|
||
* [TCP—Architecture, Enhancements, and Tuning](https://www.usenix.org/conference/srecon19asia/presentation/dhakal)
|
||
* [Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up](https://www.usenix.org/conference/srecon19asia/presentation/lamba)
|
||
* [Understanding Business Metrics Can Make You a Better SRE](https://www.usenix.org/conference/srecon19asia/presentation/suley)
|
||
* [Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way](https://www.usenix.org/conference/srecon19americas/presentation/kehoe)
|
||
* [Differences in SRE Implementations across Companies](https://www.usenix.org/conference/srecon19americas/presentation/andersen)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Mercari</summary>
|
||
|
||
## Mercari
|
||
#### Blog Posts
|
||
* [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/)
|
||
* [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/)
|
||
* [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Microsoft</summary>
|
||
|
||
#### Videos
|
||
* [SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft](https://www.youtube.com/watch?v=1iMo3SkdQqQ)
|
||
* [Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft](https://www.youtube.com/watch?v=U3ubcoNzx9k)
|
||
* [Sustainable Software Engineering & SREs](https://www.usenix.org/conference/srecon20americas/presentation/johnson)
|
||
* [Study on Human Factors and Team Culture to Improve Pager Fatigue](https://www.usenix.org/conference/srecon20americas/presentation/barteneva)
|
||
* [Prioritizing Trust While Creating Applications](https://www.usenix.org/conference/srecon19emea/presentation/davis)
|
||
* [Building Resilience: How to Learn More from Incidents](https://www.usenix.org/conference/srecon19emea/presentation/stenning)
|
||
* [A Tale of Two Postmortems: A Human Factors View](https://www.usenix.org/conference/srecon19asia/presentation/lund-postmortem)
|
||
* [Availability—Thinking beyond 9s](https://www.usenix.org/conference/srecon19asia/presentation/srinivasamurthy)
|
||
* [Ironies of Automation: A Comedy in Three Parts](https://www.usenix.org/conference/srecon19asia/presentation/lund-comedy)
|
||
* [The Ops in Serverless](https://www.usenix.org/conference/srecon19americas/presentation/davis)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>MIRO</summary>
|
||
|
||
## MIRO
|
||
#### Blog Posts
|
||
* [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e)
|
||
* [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699)
|
||
* [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Monzo</summary>
|
||
|
||
#### Blog Posts
|
||
* [Autoscaling Monzo: How we optimise our platform to be just the right size](https://monzo.com/blog/2020/10/19/autoscaling-monzo)
|
||
* [How we’ve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo)
|
||
* [How we respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents)
|
||
* [How we monitor Monzo](https://monzo.com/blog/2018/07/27/how-we-monitor-monzo)
|
||
|
||
#### Videos
|
||
* [Eventually Consistent Service Discovery](https://www.usenix.org/conference/srecon19emea/presentation/patel)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Netflix</summary>
|
||
|
||
#### Blog Posts
|
||
* [Building Netflix’s Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304)
|
||
* [Edgar: Solving Mysteries Faster with Observability](https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f)
|
||
* [Telltale: Netflix Application Monitoring Simplified](https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba)
|
||
* [Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix](https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb)
|
||
* [Introducing Dispatch](https://netflixtechblog.com/introducing-dispatch-da4b8a2a8072)
|
||
* [Applying Netflix DevOps Patterns to Windows](https://netflixtechblog.com/applying-netflix-devops-patterns-to-windows-2a57f2dbbf79)
|
||
* [ChAP: Chaos Automation Platform](https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f)
|
||
* [Starting the Avalanche](https://netflixtechblog.com/starting-the-avalanche-640e69b14a06)
|
||
* [Netflix Chaos Monkey Upgraded](https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d)
|
||
* [Chaos Engineering Upgraded](https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa)
|
||
* [From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform](https://netflixtechblog.com/from-chaos-to-control-testing-the-resiliency-of-netflixs-content-discovery-platform-ce5566aef0a4)
|
||
|
||
#### Major incidents & analysis reports
|
||
* [Post-mortem of October 22, 2012 AWS degradation](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)
|
||
|
||
#### Videos
|
||
* [When /bin/sh Attacks: Revisiting "Automate All the Things"](https://www.usenix.org/conference/srecon20americas/presentation/reed)
|
||
* [How Did Things Go Right? Learning More from Incidents](https://www.usenix.org/conference/srecon19americas/presentation/kitchens)
|
||
* [Monitoring and Tracing @Netflix Streaming Data Infrastructure](https://www.youtube.com/watch?v=DlWYNoLmma8)
|
||
* [Real user performance monitoring at Netflix scale ‐ Martin Spier](https://www.youtube.com/watch?v=4RG2DUK03_0)
|
||
* [AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is](https://www.youtube.com/watch?v=rgfww8tLM0A)
|
||
* [AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)](https://www.youtube.com/watch?v=LaKGx0dAUlo)
|
||
* [Netflix: Multi-Regional Resiliency and Amazon Route 53](https://www.youtube.com/watch?v=WDDkLOT8SCk)
|
||
* [Designing Services for Resilience: Netflix Lessons](https://www.youtube.com/watch?v=RWyZkNzvC-c)
|
||
* [South Bay SRE Meetup - Netflix Cloud Performance Team](https://www.youtube.com/watch?v=uQ0flQOtQEA)
|
||
* [AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)](https://www.youtube.com/watch?v=T_D1G42G0dE)
|
||
* [How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows](https://www.youtube.com/watch?v=8tsIqfvizpU)
|
||
* [Mastering Chaos - A Netflix Guide to Microservices](https://www.youtube.com/watch?v=CZ3wIuvmHeM)
|
||
* [AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)](https://www.youtube.com/watch?v=leqUbSY55hY)
|
||
* [SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs](https://www.youtube.com/watch?v=koGaH4ffXaU)
|
||
* [From Sys Admin to Netflix SRE](https://www.youtube.com/watch?v=lZI51YzIgVE)
|
||
* [Application Resilience Engineering and Operations at Netflix with Hystrix](https://www.youtube.com/watch?v=RzlluokGi1w)
|
||
* [Injecting Failure at Netflix](https://www.youtube.com/watch?v=ioXV28GtXeo)
|
||
* [LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability](https://www.youtube.com/watch?v=3D0zS3kPNUU)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>PayPal</summary>
|
||
|
||
#### Videos
|
||
* [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title)
|
||
* [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr)
|
||
* [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan)
|
||
* [Operating Elasticsearch with Ease at Scale](https://www.usenix.org/conference/srecon19asia/presentation/sankaravadivel)
|
||
* [Ensuring Site Reliability through Security Controls](https://www.usenix.org/conference/srecon19asia/presentation/janakiraman)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Pinterest</summary>
|
||
|
||
#### Blog Posts
|
||
* [Simplifying web deploys](https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737)
|
||
* [Upgrading Pinterest operational metrics](https://medium.com/pinterest-engineering/upgrading-pinterest-operational-metrics-8718d058079a)
|
||
* [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b)
|
||
* [Auto scaling Pinterest](https://medium.com/pinterest-engineering/auto-scaling-pinterest-df1d2beb4d64)
|
||
|
||
#### Videos
|
||
* [Building Actionable Code Ownership](https://www.usenix.org/conference/srecon20americas/presentation/mukherji)
|
||
* [Evolution of Observability Tools at Pinterest](https://www.usenix.org/conference/srecon19emea/presentation/abbas)
|
||
* [Automating OS/Platform Upgrades for Service Owners](https://www.usenix.org/conference/srecon19asia/presentation/menezes)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Postman</summary>
|
||
|
||
#### Blog Posts
|
||
* [Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana](https://medium.com/better-practices/chaos-d3ef238ec328)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Scribd</summary>
|
||
|
||
#### Blog Posts
|
||
* [Learning from incidents: getting Sidekiq ready to serve a billion jobs](https://tech.scribd.com/blog/2020/sidekiq-incident-learnings.html)
|
||
* [A testimonial for using PagerDuty at Scribd](https://tech.scribd.com/blog/2020/pagerduty-at-scribd.html)
|
||
* [Assigning pager duty to developers](https://tech.scribd.com/blog/2019/managing-pagerduty-rotations.html)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Shopify</summary>
|
||
|
||
#### Blog Posts
|
||
* [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events)
|
||
* [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify)
|
||
* [Using DNS Traffic Management to Add Resiliency to Shopify’s Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services)
|
||
* [Four Steps to Creating Effective Game Day Tests](https://shopify.engineering/four-steps-creating-effective-game-day-tests)
|
||
* [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure)
|
||
* [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify)
|
||
|
||
#### Videos
|
||
* [Network Monitor: A Tale of ACKnowledging an Observability Gap](https://www.usenix.org/conference/srecon19emea/presentation/gedge)
|
||
* [Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures](https://www.usenix.org/conference/srecon19emea/presentation/arthorne)
|
||
* [Advanced Napkin Math: Estimating System Performance from First Principles](https://www.usenix.org/conference/srecon19emea/presentation/eskildsen)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Slack</summary>
|
||
|
||
#### Blog Posts
|
||
* [Slack’s Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/)
|
||
* [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
|
||
* [Deploys at Slack](https://slack.engineering/deploys-at-slack/)
|
||
* [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/)
|
||
#### Videos
|
||
* [Slack at the Edge](https://www.usenix.org/conference/srecon19asia/presentation/pemberton)
|
||
* [What Breaks Our Systems: A Taxonomy of Black Swans](https://www.usenix.org/conference/srecon19americas/presentation/nolan-taxonomy)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Soundcloud</summary>
|
||
|
||
## Soundcloud
|
||
#### Blog Posts
|
||
* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
|
||
* [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary)
|
||
* [Prometheus has come of age – a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project)
|
||
* [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud)
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Spotify</summary>
|
||
|
||
#### Blog Posts
|
||
* [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/)
|
||
* [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/)
|
||
|
||
#### Videos
|
||
* [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Squarespace</summary>
|
||
|
||
#### Blog Posts
|
||
* [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability)
|
||
|
||
#### Videos
|
||
* [Pushing through Friction](https://www.usenix.org/conference/srecon19emea/presentation/na)
|
||
* [How to SRE When Everything's Already on Fire](https://www.usenix.org/conference/srecon19emea/presentation/hidalgo)
|
||
* [Case Study: Implementing SLOs for a New Service](https://www.usenix.org/conference/srecon19americas/presentation/lawson)
|
||
* [Creating a Code Review Culture](https://www.usenix.org/conference/srecon19americas/presentation/turner)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>StackOverflow</summary>
|
||
|
||
#### Blog Posts
|
||
* [A deeper dive into our May 2019 security incident](https://stackoverflow.blog/2021/01/25/a-deeper-dive-into-our-may-2019-security-incident/)
|
||
* [Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/)
|
||
|
||
#### Videos
|
||
* [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Stripe</summary>
|
||
|
||
#### Blog Posts
|
||
* [Fast and flexible observability with canonical log lines](https://stripe.com/blog/canonical-log-lines)
|
||
* [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/engineering/page/3)
|
||
|
||
#### Videos
|
||
* [How Stripe Invests in Technical Infrastructure](https://www.usenix.org/conference/srecon19emea/presentation/larson)
|
||
* [The AWS Billing Machine and Optimizing Cloud Costs](https://www.usenix.org/conference/srecon19asia/presentation/lopopolo)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Target</summary>
|
||
|
||
#### Blog Posts
|
||
* [Ɔhaos Ǝnginǝǝring @ Target - Part 2](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html)
|
||
* [Ɔhaos Ǝnginǝǝring @ Target - Part 1](https://tech.target.com/2019/02/05/chaos-engineering-at-Target.html)
|
||
* [GoAlert - Your Future Open Source, On-Call Notification Product](https://tech.target.com/2019/02/25/introducing-goalert.html)
|
||
* [On Infrastructure at Scale: A Cascading Failure of Distributed Systems](https://tech.target.com/2019/01/14/cascading-failure-of-distributed-systems.html)
|
||
* [Distributed Troubleshooting](https://tech.target.com/2017/04/05/distributed-troubleshooting.html)
|
||
* [Outage Resolution Through Automation](https://tech.target.com/2014/12/29/outage-resolution-through-automation.html)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Trivago</summary>
|
||
|
||
## Trivago
|
||
#### Blog Posts
|
||
* [How To Get Fooled By Metrics](https://tech.trivago.com/2020/12/04/how-to-get-fooled-by-metrics/)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Uber</summary>
|
||
|
||
#### Blog Posts
|
||
* [Disaster Recovery for Multi-Region Kafka at Uber](https://eng.uber.com/kafka/)
|
||
* [Engineering Failover Handling in Uber’s Mobile Networking Infrastructure](https://eng.uber.com/eng-failover-handling/)
|
||
* [Optimizing Observability with Jaeger, M3, and XYS at Uber](https://eng.uber.com/optimizing-observability/)
|
||
|
||
|
||
#### Videos
|
||
* [A Tale of Two Rotations: Building a Humane & Effective On-Call](https://www.usenix.org/conference/srecon19emea/presentation/lee)
|
||
* [Testing in Production at Scale](https://www.usenix.org/conference/srecon19americas/presentation/gud)
|
||
* [A History of SRE at Uber’ with Rick Boone of Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Wikimedia Foundation</summary>
|
||
|
||
#### Videos
|
||
* [Testing Encyclopedias in Production](https://www.usenix.org/conference/srecon20americas/presentation/mouzeli)
|
||
* [What Happens When You Type en.wikipedia.org?](https://www.usenix.org/conference/srecon19emea/presentation/mouzeli)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>Zerodha</summary>
|
||
|
||
#### Blog Posts
|
||
* [Infrastructure monitoring with Prometheus at Zerodha](https://zerodha.tech/blog/infra-monitoring-at-zerodha/)
|
||
|
||
</details>
|
||
|
||
<details>
|
||
<summary>SRECon Mix Playlist</summary>
|
||
|
||
#### Videos
|
||
* [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla)
|
||
* [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki)
|
||
* [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent)
|
||
* [Alaska Airlines - Capacity Prediction in External Services](https://www.usenix.org/conference/srecon19americas/presentation/kraus)
|
||
* [BuzzFeed - Optimizing for Learning](https://www.usenix.org/conference/srecon19americas/presentation/mcdonald)
|
||
* [BT - Challenges of Starting an SRE Team from Scratch in an Enterprise](https://www.usenix.org/conference/srecon20americas/presentation/narvas)
|
||
* [Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions](https://www.usenix.org/conference/srecon19emea/presentation/ali)
|
||
* [Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken](https://www.usenix.org/conference/srecon19americas/presentation/lykke)
|
||
* [IBM - Why Automating Everything Adds to Your Toil](https://www.usenix.org/conference/srecon19emea/presentation/thorne)
|
||
* [Genesys - The Smallest Possible SRE Team](https://www.usenix.org/conference/srecon20americas/presentation/thomas)
|
||
* [G-Research - My Life as a Solo SRE](https://www.usenix.org/conference/srecon19emea/presentation/murphy)
|
||
* [Grafana Labs - SRE in the Third Age](https://www.usenix.org/conference/srecon19emea/presentation/rabenstein)
|
||
* [Kenna Security - Building a Scalable Monitoring System](https://www.usenix.org/conference/srecon19emea/presentation/struve)
|
||
* [Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better](https://www.usenix.org/conference/srecon20americas/presentation/spoonhower)
|
||
* [MessageBird - Autopsy of a MySQL Automation Disaster](https://www.usenix.org/conference/srecon19emea/presentation/gagne)
|
||
* [Netlify - Perks and Pitfalls of Building a Remote First Team](https://www.usenix.org/conference/srecon19emea/presentation/neal)
|
||
* [ReactiveOps - Zero to SRE](https://www.usenix.org/conference/srecon19americas/presentation/schlesinger)
|
||
* [Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19](https://www.usenix.org/conference/srecon20americas/presentation/collins)
|
||
* [Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations](https://www.usenix.org/conference/srecon19emea/presentation/huxtable)
|
||
* [The New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events](https://www.usenix.org/conference/srecon19emea/presentation/wan)
|
||
* [Twitter - Hiring Great SREs](https://www.usenix.org/conference/srecon19emea/presentation/rutkin)
|
||
* [United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value](https://www.usenix.org/conference/srecon19americas/presentation/wieczorek)
|
||
* [Unity Technologies - Being Reasonable about SRE](https://www.usenix.org/conference/srecon19emea/presentation/urbanec)
|
||
* [Udemy - How to Do SRE When You Have No SRE](https://www.usenix.org/conference/srecon19emea/presentation/ocallaghan)
|
||
* [Vanguard - Cloudy with a Chance of Chaos](https://www.usenix.org/conference/srecon20americas/presentation/yakomin)
|
||
* [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup)
|
||
* [Yelp - What I Wish I Knew before Going On-Call](https://www.usenix.org/conference/srecon19emea/presentation/shu)
|
||
* [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer)
|
||
</details>
|
||
|
||
---
|
||
## Resources
|
||
### Books
|
||
* [97 Things Every SRE Should Know](https://www.oreilly.com/library/view/97-things-every/9781492081487/)
|
||
* [SLO Adoption and Usage in Site Reliability Engineering](https://www.oreilly.com/library/view/slo-adoption-and/9781492075370/)
|
||
* [Practical Site Reliability Engineering](https://www.oreilly.com/library/view/practical-site-reliability/9781788839563/)
|
||
* [Implementing Service Level Objectives](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/)
|
||
* [Chaos Engineering](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)
|
||
* [Seeking SRE](https://www.oreilly.com/library/view/seeking-sre/9781491978856/)
|
||
* [Security Chaos Engineering](https://www.oreilly.com/library/view/security-chaos-engineering/9781492080350/)
|
||
* [Chaos Engineering Observability](https://www.oreilly.com/library/view/chaos-engineering-observability/9781492051046/)
|
||
* [Training Site Reliability Engineers](https://www.oreilly.com/library/view/training-site-reliability/9781492076018/)
|
||
* [Database Reliability Engineering](https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/)
|
||
* [What Is SRE?](https://www.oreilly.com/library/view/what-is-sre/9781492054429/)
|
||
* [Database Reliability Engineering: What, Why, and How?](https://www.oreilly.com/library/view/database-reliability-engineering/9781492030942/)
|
||
* [Observability Engineering](https://www.oreilly.com/library/view/observability-engineering/9781492076438/)
|
||
|
||
### Events
|
||
* [SRECon Past Events](https://www.usenix.org/srecon#past)
|
||
* [ChaosConf](https://www.chaosconf.io/)
|
||
|
||
### Others
|
||
* [Awesome SRE](https://github.com/dastergon/awesome-sre)
|
||
* [Awesome Site Reliability Engineering Tools](https://github.com/SquadcastHub/awesome-sre-tools)
|
||
* [Google SRE Page](https://sre.google/)
|
||
* [Microsoft SRE Page](https://docs.microsoft.com/en-us/azure/site-reliability-engineering/)
|
||
* [SRE Weekly Newsletter](https://sreweekly.com/)
|
||
* [Chaos Engineering Newsletter](https://chaosengineering.news/)
|
||
* [DevOps Weekly Newsletter](http://devopsweekly.com)
|
||
|
||
## Credits
|
||
* Banner image [Cartoon vector created by vectorjuice - www.freepik.com](https://www.freepik.com/vectors/cartoon)
|
||
|
||
## Contribute
|
||
Contributions welcome! Read the [contribution guidelines](contributing.md) first.
|
||
|
||
## License
|
||
[](https://creativecommons.org/publicdomain/zero/1.0)
|
||
|
||
To the extent possible under law, Unmesh Gundecha has waived all copyright and
|
||
related or neighboring rights to this work. |