Initial Commit

This commit is contained in:
Unmesh Gundecha
2021-02-14 22:03:29 +08:00
commit 68bee0a687
12 changed files with 1047 additions and 0 deletions

2
.gitattributes vendored Normal file
View File

@@ -0,0 +1,2 @@
* text=auto
readme.md merge=union

16
.github/workflows/link_check.yml vendored Normal file
View File

@@ -0,0 +1,16 @@
name: Check Markdown links
on:
push:
branches:
- master
schedule:
# Run every Monday
- cron: "0 0 * * 1"
jobs:
markdown-link-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master
- uses: gaurav-nelson/github-action-markdown-link-check@v1

16
.github/workflows/workflow.yml vendored Normal file
View File

@@ -0,0 +1,16 @@
name: CI
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v1
- name: markdown syntax
uses: nosborn/github-action-markdown-cli@v1.1.1
with:
files: README.md
config_file: ".markdownlint.json"

5
.markdownlint.json Normal file
View File

@@ -0,0 +1,5 @@
{
"default": true,
"line-length": false,
"no-duplicate-header": false
}

8
.yo-rc.json Normal file
View File

@@ -0,0 +1,8 @@
{
"generator-awesome-list": {
"promptValues": {
"username": "Unmesh Gundecha",
"email": "upgundecha@gmail.com"
}
}
}

121
LICENSE Normal file
View File

@@ -0,0 +1,121 @@
Creative Commons Legal Code
CC0 1.0 Universal
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
HEREUNDER.
Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator
and subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for
the purpose of contributing to a commons of creative, cultural and
scientific works ("Commons") that the public can reliably and without fear
of later claims of infringement build upon, modify, incorporate in other
works, reuse and redistribute as freely as possible in any form whatsoever
and for any purposes, including without limitation commercial purposes.
These owners may contribute to the Commons to promote the ideal of a free
culture and the further production of creative, cultural and scientific
works, or to gain reputation or greater distribution for their Work in
part through the use and efforts of others.
For these and/or other purposes and motivations, and without any
expectation of additional consideration or compensation, the person
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
is an owner of Copyright and Related Rights in the Work, voluntarily
elects to apply CC0 to the Work and publicly distribute the Work under its
terms, with knowledge of his or her Copyright and Related Rights in the
Work and the meaning and intended legal effect of CC0 on those rights.
1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not
limited to, the following:
i. the right to reproduce, adapt, distribute, perform, display,
communicate, and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or
likeness depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data
in a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation
thereof, including any amended or successor version of such
directive); and
vii. other similar, equivalent or corresponding rights throughout the
world based on applicable law or treaty, and any national
implementations thereof.
2. Waiver. To the greatest extent permitted by, but not in contravention
of, applicable law, Affirmer hereby overtly, fully, permanently,
irrevocably and unconditionally waives, abandons, and surrenders all of
Affirmer's Copyright and Related Rights and associated claims and causes
of action, whether now known or unknown (including existing as well as
future claims and causes of action), in the Work (i) in all territories
worldwide, (ii) for the maximum duration provided by applicable law or
treaty (including future time extensions), (iii) in any current or future
medium and for any number of copies, and (iv) for any purpose whatsoever,
including without limitation commercial, advertising or promotional
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
member of the public at large and to the detriment of Affirmer's heirs and
successors, fully intending that such Waiver shall not be subject to
revocation, rescission, cancellation, termination, or any other legal or
equitable action to disrupt the quiet enjoyment of the Work by the public
as contemplated by Affirmer's express Statement of Purpose.
3. Public License Fallback. Should any part of the Waiver for any reason
be judged legally invalid or ineffective under applicable law, then the
Waiver shall be preserved to the maximum extent permitted taking into
account Affirmer's express Statement of Purpose. In addition, to the
extent the Waiver is so judged Affirmer hereby grants to each affected
person a royalty-free, non transferable, non sublicensable, non exclusive,
irrevocable and unconditional license to exercise Affirmer's Copyright and
Related Rights in the Work (i) in all territories worldwide, (ii) for the
maximum duration provided by applicable law or treaty (including future
time extensions), (iii) in any current or future medium and for any number
of copies, and (iv) for any purpose whatsoever, including without
limitation commercial, advertising or promotional purposes (the
"License"). The License shall be deemed effective as of the date CC0 was
applied by Affirmer to the Work. Should any part of the License for any
reason be judged legally invalid or ineffective under applicable law, such
partial invalidity or ineffectiveness shall not invalidate the remainder
of the License, and in such case Affirmer hereby affirms that he or she
will not (i) exercise any of his or her remaining Copyright and Related
Rights in the Work or (ii) assert any associated claims and causes of
action with respect to the Work, in either case contrary to Affirmer's
express Statement of Purpose.
4. Limitations and Disclaimers.
a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or
warranties of any kind concerning the Work, express, implied,
statutory or otherwise, including without limitation warranties of
title, merchantability, fitness for a particular purpose, non
infringement, or the absence of latent or other defects, accuracy, or
the present or absence of errors, whether or not discoverable, all to
the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without
limitation any person's Copyright and Related Rights in the Work.
Further, Affirmer disclaims responsibility for obtaining any necessary
consents, permissions or other rights required for any use of the
Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to
this CC0 or use of the Work.

751
README.md Normal file
View File

@@ -0,0 +1,751 @@
# How they SRE
![Alt](banner.png "banner")
> A curated collection of publicly available resources on how technology or tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
## Introduction
Inspired by [Howtheytest](https://github.com/abhivaikar/howtheytest) by [Abhijeet Vaikar](https://github.com/abhivaikar), __How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.
Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.
### Topics
* Site Reliability Engineering
* Hiring and Building SRE teams
* SRE Culture
* DevOps
* Monitoring & Observability
* Alerting
* Incident Management & Incident Response
* Post-Mortem
* On-Call
* Testing in Production
* Chaos Engineering
* Automation
* Performance
## Organizations
<details>
<summary>Airbnb</summary>
#### Blog Posts
* [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec)
* [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f)
* [When The Cloud Gets Dark — How Amazons Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)
</details>
<details>
<summary>Algolia</summary>
#### Blog Posts
* [May 30 SSL incident](https://www.algolia.com/blog/may-30-ssl-incident/)
* [A Journey Into SRE](https://www.algolia.com/blog/a-journey-into-sre/)
</details>
<details>
<summary>Asana</summary>
#### Blog Posts
* [How Asana ships stable web application releases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/)
* [Analysis of recent downtime & what were doing to prevent future incidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/)
* [Developer environment: Achieving reliability by making it fast to reset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/)
</details>
<details>
<summary>ASOS</summary>
#### Blog Posts
* [Cyber Security @ ASOS.com](https://medium.com/asos-techblog/cyber-security-asos-com-7d1d1f346e57)
* [Security Operations 24x7](https://medium.com/asos-techblog/security-operations-24-x-7-2e90c8e5e7e)
* [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)
</details>
<details>
<summary>Atlassian</summary>
#### Blog Posts
* [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops)
* [Automated testing: 5 lessons from Atlassians Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code)
* [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting)
</details>
<details>
<summary>Baidu</summary>
#### Videos
* [Anomaly Detection on Golden Signals](https://www.usenix.org/conference/srecon19asia/presentation/chen-yu)
* [NetRadar: Monitoring the Datacenter Network](https://www.usenix.org/conference/srecon19asia/presentation/chen-yun)
</details>
<details>
<summary>Basecamp</summary>
#### Blog Posts
* [Inside a CODE RED: Network Edition](https://m.signalvnoise.com/inside-a-code-red-network-edition/)
* [Three Basecamp outages. One week. What happened?](https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/)
* [Basecamp 2 and Basecamp 3 search outage report](https://m.signalvnoise.com/basecamp-2-and-basecamp-3-search-outage-report/)
* [Reducing Incident Escalations at Basecamp](https://m.signalvnoise.com/reducing-incident-escalations-at-basecamp/)
</details>
<details>
<summary>Bloomberg</summary>
#### Videos
* [Capacity Planning and Performance Enhancement with Page Reference Sampling](https://www.usenix.org/conference/srecon20americas/presentation/chen)
* [Why SREs can't afford to NOT do Chaos Engineering](https://www.usenix.org/conference/srecon20americas/presentation/pawlikowski)
* [Tracing Real-Time Distributed Systems](https://www.usenix.org/conference/srecon19emea/presentation/yakimov)
* [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen)
* [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen)
</details>
<details>
<summary>Booking.com</summary>
* [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet)
* [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia)
</details>
<details>
<summary>Capital One</summary>
#### Blog Posts
* [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075)
* [Active-Active Shared-Nothing Database Architecture](https://medium.com/capital-one-tech/active-active-shared-nothing-database-architecture-304957ffb89)
* [The 3 Rs of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b)
* [5 Steps to Getting Your App Chaos Ready](https://medium.com/capital-one-tech/5-steps-to-getting-your-app-chaos-ready-capital-one-a5b7b3cb8e09)
* [4 Real-World Scenarios That Read Like Chaos Engineering Experiments](https://medium.com/capital-one-tech/4-real-world-scenarios-that-read-like-chaos-engineering-experiments-8dbf40c5f247)
* [Embrace the Chaos … Engineering](https://medium.com/capital-one-tech/embrace-the-chaos-engineering-203fd6fc6ff7)
* [3 Lessons Learned From Implementing Chaos Engineering at Enterprise](https://medium.com/capital-one-tech/3-lessons-learned-from-implementing-chaos-engineering-at-enterprise-28eb3ffecc57)
* [A Deep Dive Into Seamless Blue/Green Deployment Using AWS CodeDeploy](https://medium.com/capital-one-tech/seamless-blue-green-deployment-using-aws-codedeploy-4c36c0bbeef4)
* [Secure Docker Containers Require Secure Applications](https://medium.com/capital-one-tech/secure-docker-containers-require-secure-applications-75eb358abef9)
* [4 Steps for Pairing the Cloud and DevOps to Improve Resiliency](https://medium.com/capital-one-tech/4-steps-for-pairing-cloud-and-devops-to-improve-resiliency-c72fe2e52b05)
* [Container Ready Applications with Twelve-Factor App and Microservices Architecture](https://medium.com/capital-one-tech/container-ready-applications-with-twelve-factor-app-and-microservices-architecture-16af683a767f)
* [Deploying with Confidence — Minimize Risk, Maximize Resiliency With Canary Deployments on AWS](https://medium.com/capital-one-tech/deploying-with-confidence-strategies-for-canary-deployments-on-aws-7cab3798823e)
* [Architecting for Resiliency](https://medium.com/capital-one-tech/architecting-for-resiliency-9ec663db5c94)
* [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d)
* [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765)
#### Major incidents & analysis reports
* [Information on the Capital One Cyber Incident](https://www.capitalone.com/facts2019/)
* [A Case Study of the Capital One Data Breach](http://web.mit.edu/smadnick/www/wp/2020-16.pdf)
#### Videos
* [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo)
* [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI)
* [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ)
* [Automating the Management of the Operational Health of Cloud Accounts at Scale](https://www.usenix.org/conference/srecon19americas/presentation/walls)
</details>
<details>
<summary>DBS</summary>
#### Blog Posts
* [Site Reliability Engineering at DBS Bank](https://medium.com/dbs-tech-blog/site-reliability-engineering-at-dbs-bank-32c02228ccf4)
#### Videos
* [SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS](https://www.youtube.com/watch?v=URwkaRbOLxI&feature=emb_title)
</details>
<details>
<summary>Dropbox</summary>
#### Blog Posts
* [Monitoring server applications with Vortex](https://dropbox.tech/infrastructure/monitoring-server-applications-with-vortex)
* [Athena: Our automated build health management system](https://dropbox.tech/infrastructure/athena-our-automated-build-health-management-system)
#### Videos
* [Service Discovery Challenges at Scale](https://www.usenix.org/conference/srecon19americas/presentation/nigmatullin)
</details>
<details>
<summary>Facebook</summary>
#### Videos
* [A Customer Service Approach to SRE](https://www.usenix.org/conference/srecon19emea/presentation/looney)
* [How (Not) to Scale a Project: A Post-Mortem](https://www.usenix.org/conference/srecon19asia/presentation/bagnoli)
* [Releasing the World's Largest Python Site Every 7 Minutes](https://www.usenix.org/conference/srecon19asia/presentation/wong-shuhong)
* [Using ML to Automate Dynamic Error Categorization](https://www.usenix.org/conference/srecon19asia/presentation/davoli)
</details>
<details>
<summary>Fastly</summary>
* [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner)
* [Resilience Engineering Mythbusting](https://www.usenix.org/conference/srecon19americas/presentation/gallego)
</details>
<details>
<summary>eBay</summary>
#### Blog Posts
* [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/)
* [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/)
* [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/)
* [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/)
### Video
* [Madaari: Ordering for the Monkeys](https://www.usenix.org/conference/srecon19americas/presentation/raina)
</details>
<details>
<summary>Etsy</summary>
#### Blog Posts
* [Etsys Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)
* [Opsweekly: Measuring on-call experience with alert classification](https://codeascraft.com/2014/06/19/opsweekly-measuring-on-call-experience-with-alert-classification/)
* [Demystifying Site Outages](https://blog.etsy.com/news/2012/demystifying-site-outages/)
* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
#### Videos
* [Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe](https://www.youtube.com/watch?v=LdOe18KhtT4)
* [Migrating a Monolith to the Cloud](https://www.usenix.org/conference/srecon19americas/presentation/govande)
</details>
<details>
<summary>Expedia</summary>
#### Blog Posts
* [The Cost of 100% Reliability](https://medium.com/expedia-group-tech/the-cost-of-100-reliability-ecb2901f23a4)
* [Creating Monitoring Dashboards](https://medium.com/expedia-group-tech/creating-monitoring-dashboards-1f3fbe0ae1ac)
* [Using Bash for DevOps](https://medium.com/expedia-group-tech/using-bash-for-devops-7046eed1aa63)
</details>
<details>
<summary>GitHub</summary>
#### Blog Posts
* [Deployment reliability at GitHub](https://github.blog/2021-02-03-deployment-reliability-at-github/)
* [Improving how we deploy GitHub](https://github.blog/2021-01-25-improving-how-we-deploy-github/)
* [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/)
* [Reducing flaky builds by 18x](https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/)
* [The evolving role of operations in DevOps](https://github.blog/2020-12-03-the-evolving-role-of-operations-in-devops/)
* [Getting started with DevOps automation](https://github.blog/2020-10-29-getting-started-with-devops-automation/)
* [MySQL High Availability at GitHub](https://github.blog/2018-06-20-mysql-high-availability-at-github/)
#### Major incidents & analysis reports
* [GitHub Availability Report: January 2021](https://github.blog/2021-02-02-github-availability-report-january-2021/)
* [GitHub Availability Report: December 2020](https://github.blog/2021-01-06-github-availability-report-december-2020/)
* [GitHub Availability Report: November 2020](https://github.blog/2020-12-02-availability-report-november-2020/)
* [GitHub Availability Report: August 2020](https://github.blog/2020-09-02-github-availability-report-august-2020/)
* [GitHub Availability Report: July 2020](https://github.blog/2020-08-05-github-availability-report-july-2020/)
* [Introducing the GitHub Availability Report](https://github.blog/2020-07-08-introducing-the-github-availability-report/)
* [February service disruptions post-incident analysis](https://github.blog/2020-03-26-february-service-disruptions-post-incident-analysis/)
* [October 21 post-incident analysis](https://github.blog/2018-10-30-oct21-post-incident-analysis/)
* [February 28th DDoS Incident Report](https://github.blog/2018-03-01-ddos-incident-report/)
* [Incident Report: Inadvertent Private Repository Disclosure](https://github.blog/2016-10-28-incident-report-inadvertent-private-repository-disclosure/)
#### Videos
* [One on One SRE](https://www.usenix.org/conference/srecon19americas/presentation/tobey)
</details>
<details>
<summary>Google</summary>
#### Blog Posts
* [SRE Practices & Processes](https://sre.google/resources/#practicesandprocesses)
* [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19)
* [SRE Classroom: Distributed PubSub](https://sre.google/resources/practices-and-processes/distributed-pubsub/)
#### Books
* [Building Secure & Reliable Systems](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf)
* [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/)
* [The Site Reliability Workbook](https://sre.google/workbook/table-of-contents/)
#### Videos
* [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk)
* [Risk and Error Budgets with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU)
* [Pragmatic Automation with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s)
* [Must Watch! - Google SRE YouTube Playlist](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)
* [Squish Level Objectives: How SRE can Help Align Technical Work to User Benefit](https://www.usenix.org/conference/srecon20americas/presentation/stanke)
* [Implementing Distributed Consensus](https://www.usenix.org/conference/srecon20americas/presentation/ludtke)
* [The SRE I Aspire to Be](https://www.usenix.org/conference/srecon19emea/presentation/aknin)
* [SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19emea/presentation/perry)
* [Zero Touch Prod: Towards Safer and More Secure Production Environments](https://www.usenix.org/conference/srecon19emea/presentation/czapinski)
* [All of Our ML Ideas Are Bad (and We Should Feel Bad)](https://www.usenix.org/conference/srecon19emea/presentation/underwood)
* [The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It](https://www.usenix.org/conference/srecon19emea/presentation/desai)
* [Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program](https://www.usenix.org/conference/srecon19emea/presentation/petoff)
* [Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way](https://www.usenix.org/conference/srecon19emea/presentation/gleason)
* [Practical Instrumentation for Observability](https://www.usenix.org/conference/srecon19asia/presentation/krabbe)
* [What Is ML Ops: Solutions and Best Practices for DevOps of Production ML Services](https://www.usenix.org/conference/srecon19asia/presentation/sato)
* [Unified Reporting of Service Reliability](https://www.usenix.org/conference/srecon19asia/presentation/zhang)
* [How to Trade off Server Utilization and Tail Latency](https://www.usenix.org/conference/srecon19asia/presentation/plenz)
* [Keeping the Balance: Internet-Scale Loadbalancing Demystified](https://www.usenix.org/conference/srecon19americas/presentation/nolan-loadbalancing)
* [From Black Box to a Known Quantity: How to Build Predictable, Reliable ML-based Services](https://www.usenix.org/conference/srecon19americas/presentation/virji)
* [Mindfulness in SRE: Monitoring and Alerting for One's Self](https://www.usenix.org/conference/srecon19americas/presentation/lutz)
* [Pragmatic Automation](https://www.usenix.org/conference/srecon19americas/presentation/luebbe)
* [Sublinear Scaling in Practice: The 1k SRE Project](https://www.usenix.org/conference/srecon19americas/presentation/rath)
* [Strategies to Edit Production Data](https://www.usenix.org/conference/srecon19americas/presentation/qiu)
* [The Curse of SRE Autonomy and How to Manage It](https://www.usenix.org/conference/srecon19americas/presentation/bondi)
* [Scaling SRE Organizations: The Journey from 1 to Many Teams](https://www.usenix.org/conference/srecon19americas/presentation/franco)
* [SRE Classroom - How to Design a Distributed System in 3 Hours](https://www.usenix.org/conference/srecon19americas/presentation/thomas)
* [Using PRDs and User Journeys to Design User-Friendly Tools](https://www.usenix.org/conference/srecon19americas/presentation/stockman)
</details>
<details>
<summary>Gojek</summary>
## Gojek
#### Blog Posts
* [Why We Swear by the RCA](https://blog.gojekengineering.com/why-we-swear-by-the-rca-f535fd5abbcb)
</details>
<details>
<summary>Grab</summary>
#### Blog Posts
* [Our Journey to Continuous Delivery at Grab (Part 1)](https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab)
* [Designing Resilient Systems Beyond Retries (Part 3): Architecture Patterns and Chaos Engineering](https://engineering.grab.com/beyond-retries-part-3)
* [Orchestrating Chaos using Grab's Experimentation Platform](https://engineering.grab.com/chaos-engineering)
</details>
<details>
<summary>Grammarly</summary>
#### Blog Posts
* [Security Operations in an AWS Environment](https://www.grammarly.com/blog/engineering/security-infrastructure-aws/)
</details>
<details>
<summary>Indeed</summary>
#### Blog Posts
* [Being Just Reliable Enough](https://engineering.indeedblog.com/blog/2019/10/being-just-reliable-enough/)
* [Automating Indeeds Release Process](https://engineering.indeedblog.com/blog/2017/03/automating-release-process/)
* [Sloth, a Tool for Inducing Network Failures with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan)
#### Videos
* [Are We Getting Better Yet? Progress Toward Safer Operations](https://www.usenix.org/conference/srecon20americas/presentation/elman)
</details>
<details>
<summary>Heroku</summary>
#### Blog Posts
* [Incident Response at Heroku](https://blog.heroku.com/incident-response-at-heroku-2020)
</details>
<details>
<summary>LinkedIn</summary>
#### Blog Posts
* [Open source update: School of SRE](https://engineering.linkedin.com/blog/2021/open-source-update--school-of-sre)
* [Production testing with dark canaries](https://engineering.linkedin.com/blog/2020/production-testing-with-dark-canaries)
* [Smart alerts in ThirdEye, LinkedIns real-time monitoring platform](https://engineering.linkedin.com/blog/2019/06/smart-alerts-in-thirdeye--linkedins-real-time-monitoring-platfor)
* [Iris mobile: An open source, mobile interface for incident management](https://engineering.linkedin.com/blog/2019/05/iris-mobile--an-open-source--mobile-interface-for-incident-manag)
* [LinkedOut: A Request-Level Failure Injection Framework](https://engineering.linkedin.com/blog/2018/05/linkedout--a-request-level-failure-injection-framework)
* [Eliminating toil with fully automated load testing](https://engineering.linkedin.com/blog/2019/eliminating-toil-with-fully-automated-load-testing)
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 1](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p)
* [The Makeup of Successful Geographically-Distributed SRE Teams: Part 2](https://engineering.linkedin.com/blog/2018/03/the-makeup-of-successful-geographically-distributed-sre-teams--p0)
* [Project STAR*: Streamlining Our On-Call Process](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)
* [Automating Your Oncall: Open Sourcing Fossor and Ascii Etch](https://engineering.linkedin.com/blog/2017/12/open-sourcing-fossor-and-ascii-etch)
* [Resilience Engineering at LinkedIn with Project Waterbear](https://engineering.linkedin.com/blog/2017/11/resilience-engineering-at-linkedin-with-project-waterbear)
* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/blog/2017/07/hiring-sres-at-linkedin)
* [Open Sourcing Iris and Oncall](https://engineering.linkedin.com/blog/2017/06/open-sourcing-iris-and-oncall)
* [Building the SRE Culture at LinkedIn](https://engineering.linkedin.com/blog/2017/05/building-the-sre-culture-at-linkedin)
* [Failure is Not an Option](https://engineering.linkedin.com/blog/2017/01/failure-is-not-an-option)
* [MTTD and MTTR Are Key](https://engineering.linkedin.com/blog/2016/12/mttd-and-mttr-are-key)
* [What Gets Measured Gets Fixed](https://engineering.linkedin.com/blog/2016/12/what-gets-measured-gets-fixed)
* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin)
#### Videos
* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA)
* [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty)
* [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin)
* [Unconference: Unsolved Problems in SRE](https://www.usenix.org/conference/srecon19emea/presentation/andersen)
* [Leading without Managing: Becoming an SRE Technical Leader](https://www.usenix.org/conference/srecon19asia/presentation/palino-leading)
* [Why Does (My) Monitoring Suck?](https://www.usenix.org/conference/srecon19asia/presentation/palino-monitoring)
* [Traffic Forecasting and Stress Testing Infrastructure](https://www.usenix.org/conference/srecon19asia/presentation/sulakhe)
* [Collective Mindfulness for Better Decisions in SRE](https://www.usenix.org/conference/srecon19asia/presentation/andersen-mindfulness)
* [TCP—Architecture, Enhancements, and Tuning](https://www.usenix.org/conference/srecon19asia/presentation/dhakal)
* [Over 600 Million Members and Hundreds of Micro Services: How We Scaled Our Monitoring System to Keep up](https://www.usenix.org/conference/srecon19asia/presentation/lamba)
* [Understanding Business Metrics Can Make You a Better SRE](https://www.usenix.org/conference/srecon19asia/presentation/suley)
* [Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way](https://www.usenix.org/conference/srecon19americas/presentation/kehoe)
* [Differences in SRE Implementations across Companies](https://www.usenix.org/conference/srecon19americas/presentation/andersen)
</details>
<details>
<summary>Mercari</summary>
## Mercari
#### Blog Posts
* [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/)
* [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/)
* [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/)
</details>
<details>
<summary>Microsoft</summary>
#### Videos
* [SLI & Reliability Deep-Dive with David N. Blank-Edelman of Microsoft](https://www.youtube.com/watch?v=1iMo3SkdQqQ)
* [Ironies of Automation: A Comedy in Three Parts with Tanner Lund of Microsoft](https://www.youtube.com/watch?v=U3ubcoNzx9k)
* [Sustainable Software Engineering & SREs](https://www.usenix.org/conference/srecon20americas/presentation/johnson)
* [Study on Human Factors and Team Culture to Improve Pager Fatigue](https://www.usenix.org/conference/srecon20americas/presentation/barteneva)
* [Prioritizing Trust While Creating Applications](https://www.usenix.org/conference/srecon19emea/presentation/davis)
* [Building Resilience: How to Learn More from Incidents](https://www.usenix.org/conference/srecon19emea/presentation/stenning)
* [A Tale of Two Postmortems: A Human Factors View](https://www.usenix.org/conference/srecon19asia/presentation/lund-postmortem)
* [Availability—Thinking beyond 9s](https://www.usenix.org/conference/srecon19asia/presentation/srinivasamurthy)
* [Ironies of Automation: A Comedy in Three Parts](https://www.usenix.org/conference/srecon19asia/presentation/lund-comedy)
* [The Ops in Serverless](https://www.usenix.org/conference/srecon19americas/presentation/davis)
</details>
<details>
<summary>MIRO</summary>
## MIRO
#### Blog Posts
* [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e)
* [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699)
* [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)
</details>
<details>
<summary>Monzo</summary>
#### Blog Posts
* [Autoscaling Monzo: How we optimise our platform to be just the right size](https://monzo.com/blog/2020/10/19/autoscaling-monzo)
* [How weve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo)
* [How we respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents)
* [How we monitor Monzo](https://monzo.com/blog/2018/07/27/how-we-monitor-monzo)
#### Videos
* [Eventually Consistent Service Discovery](https://www.usenix.org/conference/srecon19emea/presentation/patel)
</details>
<details>
<summary>Netflix</summary>
#### Blog Posts
* [Building Netflixs Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304)
* [Edgar: Solving Mysteries Faster with Observability](https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f)
* [Telltale: Netflix Application Monitoring Simplified](https://netflixtechblog.com/telltale-netflix-application-monitoring-simplified-5c08bfa780ba)
* [Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix](https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fb)
* [Introducing Dispatch](https://netflixtechblog.com/introducing-dispatch-da4b8a2a8072)
* [Applying Netflix DevOps Patterns to Windows](https://netflixtechblog.com/applying-netflix-devops-patterns-to-windows-2a57f2dbbf79)
* [ChAP: Chaos Automation Platform](https://netflixtechblog.com/chap-chaos-automation-platform-53e6d528371f)
* [Starting the Avalanche](https://netflixtechblog.com/starting-the-avalanche-640e69b14a06)
* [Netflix Chaos Monkey Upgraded](https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d)
* [Chaos Engineering Upgraded](https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa)
* [From Chaos to Control — Testing the resiliency of Netflixs Content Discovery Platform](https://netflixtechblog.com/from-chaos-to-control-testing-the-resiliency-of-netflixs-content-discovery-platform-ce5566aef0a4)
#### Major incidents & analysis reports
* [Post-mortem of October 22, 2012 AWS degradation](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)
#### Videos
* [When /bin/sh Attacks: Revisiting "Automate All the Things"](https://www.usenix.org/conference/srecon20americas/presentation/reed)
* [How Did Things Go Right? Learning More from Incidents](https://www.usenix.org/conference/srecon19americas/presentation/kitchens)
* [Monitoring and Tracing @Netflix Streaming Data Infrastructure](https://www.youtube.com/watch?v=DlWYNoLmma8)
* [Real user performance monitoring at Netflix scale Martin Spier](https://www.youtube.com/watch?v=4RG2DUK03_0)
* [AWS re:Invent 2017 - Nora Jones Describes Why We Need More Chaos - Chaos Engineering, That Is](https://www.youtube.com/watch?v=rgfww8tLM0A)
* [AWS re:Invent 2017: Performing Chaos at Netflix Scale (DEV334)](https://www.youtube.com/watch?v=LaKGx0dAUlo)
* [Netflix: Multi-Regional Resiliency and Amazon Route 53](https://www.youtube.com/watch?v=WDDkLOT8SCk)
* [Designing Services for Resilience: Netflix Lessons](https://www.youtube.com/watch?v=RWyZkNzvC-c)
* [South Bay SRE Meetup - Netflix Cloud Performance Team](https://www.youtube.com/watch?v=uQ0flQOtQEA)
* [AWS re:Invent 2017: A Day in the Life of a Netflix Engineer III (ARC209)](https://www.youtube.com/watch?v=T_D1G42G0dE)
* [How Netflix Uses Kinesis Streams to Monitor Applications and Analyze Billions of Traffic Flows](https://www.youtube.com/watch?v=8tsIqfvizpU)
* [Mastering Chaos - A Netflix Guide to Microservices](https://www.youtube.com/watch?v=CZ3wIuvmHeM)
* [AWS re:Invent 2016: From Resilience to Ubiquity - #NetflixEverywhere Global Architecture (ARC204)](https://www.youtube.com/watch?v=leqUbSY55hY)
* [SREcon 2016 - Netflix: 190 Countries and 5 CORE SREs](https://www.youtube.com/watch?v=koGaH4ffXaU)
* [From Sys Admin to Netflix SRE](https://www.youtube.com/watch?v=lZI51YzIgVE)
* [Application Resilience Engineering and Operations at Netflix with Hystrix](https://www.youtube.com/watch?v=RzlluokGi1w)
* [Injecting Failure at Netflix](https://www.youtube.com/watch?v=ioXV28GtXeo)
* [LISA13 - How Netflix Embraces Failure to Improve Resilience and Maximize Availability](https://www.youtube.com/watch?v=3D0zS3kPNUU)
</details>
<details>
<summary>PayPal</summary>
#### Videos
* [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title)
* [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr)
* [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan)
* [Operating Elasticsearch with Ease at Scale](https://www.usenix.org/conference/srecon19asia/presentation/sankaravadivel)
* [Ensuring Site Reliability through Security Controls](https://www.usenix.org/conference/srecon19asia/presentation/janakiraman)
</details>
<details>
<summary>Pinterest</summary>
#### Blog Posts
* [Simplifying web deploys](https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737)
* [Upgrading Pinterest operational metrics](https://medium.com/pinterest-engineering/upgrading-pinterest-operational-metrics-8718d058079a)
* [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b)
* [Auto scaling Pinterest](https://medium.com/pinterest-engineering/auto-scaling-pinterest-df1d2beb4d64)
#### Videos
* [Building Actionable Code Ownership](https://www.usenix.org/conference/srecon20americas/presentation/mukherji)
* [Evolution of Observability Tools at Pinterest](https://www.usenix.org/conference/srecon19emea/presentation/abbas)
* [Automating OS/Platform Upgrades for Service Owners](https://www.usenix.org/conference/srecon19asia/presentation/menezes)
</details>
<details>
<summary>Postman</summary>
#### Blog Posts
* [Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana](https://medium.com/better-practices/chaos-d3ef238ec328)
</details>
<details>
<summary>Scribd</summary>
#### Blog Posts
* [Learning from incidents: getting Sidekiq ready to serve a billion jobs](https://tech.scribd.com/blog/2020/sidekiq-incident-learnings.html)
* [A testimonial for using PagerDuty at Scribd](https://tech.scribd.com/blog/2020/pagerduty-at-scribd.html)
* [Assigning pager duty to developers](https://tech.scribd.com/blog/2019/managing-pagerduty-rotations.html)
</details>
<details>
<summary>Shopify</summary>
#### Blog Posts
* [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events)
* [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify)
* [Using DNS Traffic Management to Add Resiliency to Shopifys Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services)
* [Four Steps to Creating Effective Game Day Tests](https://shopify.engineering/four-steps-creating-effective-game-day-tests)
* [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure)
* [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify)
#### Videos
* [Network Monitor: A Tale of ACKnowledging an Observability Gap](https://www.usenix.org/conference/srecon19emea/presentation/gedge)
* [Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures](https://www.usenix.org/conference/srecon19emea/presentation/arthorne)
* [Advanced Napkin Math: Estimating System Performance from First Principles](https://www.usenix.org/conference/srecon19emea/presentation/eskildsen)
</details>
<details>
<summary>Slack</summary>
#### Blog Posts
* [Slacks Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/)
* [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
* [Deploys at Slack](https://slack.engineering/deploys-at-slack/)
* [Disasterpiece Theater: Slacks process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/)
#### Videos
* [Slack at the Edge](https://www.usenix.org/conference/srecon19asia/presentation/pemberton)
* [What Breaks Our Systems: A Taxonomy of Black Swans](https://www.usenix.org/conference/srecon19americas/presentation/nolan-taxonomy)
</details>
<details>
<summary>Soundcloud</summary>
## Soundcloud
#### Blog Posts
* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
* [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary)
* [Prometheus has come of age a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project)
* [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud)
</details>
<details>
<summary>Spotify</summary>
#### Blog Posts
* [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/)
* [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/)
#### Videos
* [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)
</details>
<details>
<summary>Squarespace</summary>
#### Blog Posts
* [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability)
#### Videos
* [Pushing through Friction](https://www.usenix.org/conference/srecon19emea/presentation/na)
* [How to SRE When Everything's Already on Fire](https://www.usenix.org/conference/srecon19emea/presentation/hidalgo)
* [Case Study: Implementing SLOs for a New Service](https://www.usenix.org/conference/srecon19americas/presentation/lawson)
* [Creating a Code Review Culture](https://www.usenix.org/conference/srecon19americas/presentation/turner)
</details>
<details>
<summary>StackOverflow</summary>
#### Blog Posts
* [A deeper dive into our May 2019 security incident](https://stackoverflow.blog/2021/01/25/a-deeper-dive-into-our-may-2019-security-incident/)
* [Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/)
#### Videos
* [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)
</details>
<details>
<summary>Stripe</summary>
#### Blog Posts
* [Fast and flexible observability with canonical log lines](https://stripe.com/blog/canonical-log-lines)
* [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/engineering/page/3)
#### Videos
* [How Stripe Invests in Technical Infrastructure](https://www.usenix.org/conference/srecon19emea/presentation/larson)
* [The AWS Billing Machine and Optimizing Cloud Costs](https://www.usenix.org/conference/srecon19asia/presentation/lopopolo)
</details>
<details>
<summary>Target</summary>
#### Blog Posts
* [Ɔhaos Ǝnginǝǝring @ Target - Part 2](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html)
* [Ɔhaos Ǝnginǝǝring @ Target - Part 1](https://tech.target.com/2019/02/05/chaos-engineering-at-Target.html)
* [GoAlert - Your Future Open Source, On-Call Notification Product](https://tech.target.com/2019/02/25/introducing-goalert.html)
* [On Infrastructure at Scale: A Cascading Failure of Distributed Systems](https://tech.target.com/2019/01/14/cascading-failure-of-distributed-systems.html)
* [Distributed Troubleshooting](https://tech.target.com/2017/04/05/distributed-troubleshooting.html)
* [Outage Resolution Through Automation](https://tech.target.com/2014/12/29/outage-resolution-through-automation.html)
</details>
<details>
<summary>Trivago</summary>
## Trivago
#### Blog Posts
* [How To Get Fooled By Metrics](https://tech.trivago.com/2020/12/04/how-to-get-fooled-by-metrics/)
</details>
<details>
<summary>Uber</summary>
#### Blog Posts
* [Disaster Recovery for Multi-Region Kafka at Uber](https://eng.uber.com/kafka/)
* [Engineering Failover Handling in Ubers Mobile Networking Infrastructure](https://eng.uber.com/eng-failover-handling/)
* [Optimizing Observability with Jaeger, M3, and XYS at Uber](https://eng.uber.com/optimizing-observability/)
#### Videos
* [A Tale of Two Rotations: Building a Humane & Effective On-Call](https://www.usenix.org/conference/srecon19emea/presentation/lee)
* [Testing in Production at Scale](https://www.usenix.org/conference/srecon19americas/presentation/gud)
* [A History of SRE at Uber with Rick Boone of Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE)
</details>
<details>
<summary>Wikimedia Foundation</summary>
#### Videos
* [Testing Encyclopedias in Production](https://www.usenix.org/conference/srecon20americas/presentation/mouzeli)
* [What Happens When You Type en.wikipedia.org?](https://www.usenix.org/conference/srecon19emea/presentation/mouzeli)
</details>
<details>
<summary>Zerodha</summary>
#### Blog Posts
* [Infrastructure monitoring with Prometheus at Zerodha](https://zerodha.tech/blog/infra-monitoring-at-zerodha/)
</details>
<details>
<summary>SRECon Mix Playlist</summary>
#### Videos
* [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla)
* [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki)
* [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent)
* [Alaska Airlines - Capacity Prediction in External Services](https://www.usenix.org/conference/srecon19americas/presentation/kraus)
* [BuzzFeed - Optimizing for Learning](https://www.usenix.org/conference/srecon19americas/presentation/mcdonald)
* [BT - Challenges of Starting an SRE Team from Scratch in an Enterprise](https://www.usenix.org/conference/srecon20americas/presentation/narvas)
* [Cloudflare - Support Operations Engineering: Scaling Developer Products to the Millions](https://www.usenix.org/conference/srecon19emea/presentation/ali)
* [Hudson River Trading - Fixing On-Call When Nobody Thinks It's (Too) Broken](https://www.usenix.org/conference/srecon19americas/presentation/lykke)
* [IBM - Why Automating Everything Adds to Your Toil](https://www.usenix.org/conference/srecon19emea/presentation/thorne)
* [Genesys - The Smallest Possible SRE Team](https://www.usenix.org/conference/srecon20americas/presentation/thomas)
* [G-Research - My Life as a Solo SRE](https://www.usenix.org/conference/srecon19emea/presentation/murphy)
* [Grafana Labs - SRE in the Third Age](https://www.usenix.org/conference/srecon19emea/presentation/rabenstein)
* [Kenna Security - Building a Scalable Monitoring System](https://www.usenix.org/conference/srecon19emea/presentation/struve)
* [Lightstep - Building Service Ownership Using Documentation, Telemetry, and a Chance to Make Things Better](https://www.usenix.org/conference/srecon20americas/presentation/spoonhower)
* [MessageBird - Autopsy of a MySQL Automation Disaster](https://www.usenix.org/conference/srecon19emea/presentation/gagne)
* [Netlify - Perks and Pitfalls of Building a Remote First Team](https://www.usenix.org/conference/srecon19emea/presentation/neal)
* [ReactiveOps - Zero to SRE](https://www.usenix.org/conference/srecon19americas/presentation/schlesinger)
* [Salesforce - Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19](https://www.usenix.org/conference/srecon20americas/presentation/collins)
* [Sprax - From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations](https://www.usenix.org/conference/srecon19emea/presentation/huxtable)
* [The New York Times - SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events](https://www.usenix.org/conference/srecon19emea/presentation/wan)
* [Twitter - Hiring Great SREs](https://www.usenix.org/conference/srecon19emea/presentation/rutkin)
* [United States Digital Service - Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value](https://www.usenix.org/conference/srecon19americas/presentation/wieczorek)
* [Unity Technologies - Being Reasonable about SRE](https://www.usenix.org/conference/srecon19emea/presentation/urbanec)
* [Udemy - How to Do SRE When You Have No SRE](https://www.usenix.org/conference/srecon19emea/presentation/ocallaghan)
* [Vanguard - Cloudy with a Chance of Chaos](https://www.usenix.org/conference/srecon20americas/presentation/yakomin)
* [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup)
* [Yelp - What I Wish I Knew before Going On-Call](https://www.usenix.org/conference/srecon19emea/presentation/shu)
* [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer)
</details>
---
## Resources
### Books
* [97 Things Every SRE Should Know](https://www.oreilly.com/library/view/97-things-every/9781492081487/)
* [SLO Adoption and Usage in Site Reliability Engineering](https://www.oreilly.com/library/view/slo-adoption-and/9781492075370/)
* [Practical Site Reliability Engineering](https://www.oreilly.com/library/view/practical-site-reliability/9781788839563/)
* [Implementing Service Level Objectives](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/)
* [Chaos Engineering](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)
* [Seeking SRE](https://www.oreilly.com/library/view/seeking-sre/9781491978856/)
* [Security Chaos Engineering](https://www.oreilly.com/library/view/security-chaos-engineering/9781492080350/)
* [Chaos Engineering Observability](https://www.oreilly.com/library/view/chaos-engineering-observability/9781492051046/)
* [Training Site Reliability Engineers](https://www.oreilly.com/library/view/training-site-reliability/9781492076018/)
* [Database Reliability Engineering](https://www.oreilly.com/library/view/database-reliability-engineering/9781491925935/)
* [What Is SRE?](https://www.oreilly.com/library/view/what-is-sre/9781492054429/)
* [Database Reliability Engineering: What, Why, and How?](https://www.oreilly.com/library/view/database-reliability-engineering/9781492030942/)
* [Observability Engineering](https://www.oreilly.com/library/view/observability-engineering/9781492076438/)
### Events
* [SRECon Past Events](https://www.usenix.org/srecon#past)
* [ChaosConf](https://www.chaosconf.io/)
### Others
* [Awesome SRE](https://github.com/dastergon/awesome-sre)
* [Awesome Site Reliability Engineering Tools](https://github.com/SquadcastHub/awesome-sre-tools)
* [Google SRE Page](https://sre.google/)
* [Microsoft SRE Page](https://docs.microsoft.com/en-us/azure/site-reliability-engineering/)
* [SRE Weekly Newsletter](https://sreweekly.com/)
* [Chaos Engineering Newsletter](https://chaosengineering.news/)
* [DevOps Weekly Newsletter](http://devopsweekly.com)
## Credits
* Banner image [Cartoon vector created by vectorjuice - www.freepik.com](https://www.freepik.com/vectors/cartoon)
## Contribute
Contributions welcome! Read the [contribution guidelines](contributing.md) first.
## License
[![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](https://creativecommons.org/publicdomain/zero/1.0)
To the extent possible under law, Unmesh Gundecha has waived all copyright and
related or neighboring rights to this work.

1
_config.yml Normal file
View File

@@ -0,0 +1 @@
markdown: CommonMarkGhPages

BIN
banner.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 110 KiB

74
code-of-conduct.md Normal file
View File

@@ -0,0 +1,74 @@
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, gender identity and expression, level of experience,
nationality, personal appearance, race, religion, or sexual identity and
orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at upgundecha@gmail.com. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at [http://contributor-covenant.org/version/1/4][version]
[homepage]: http://contributor-covenant.org
[version]: http://contributor-covenant.org/version/1/4/

26
contributing.md Normal file
View File

@@ -0,0 +1,26 @@
# Contribution Guidelines
Please note that this project is released with a
[Contributor Code of Conduct](code-of-conduct.md). By participating in this
project you agree to abide by its terms.
---
Please feel free to fork the repo and add blog posts, videos, incidents reports or any other useful resource to the repo.
You may find the Markdown Link Generator extension for Google Chrome Browser easy to generate links in Markdown inline text link format. This tool will help to generate links for resources with minimal steps and context switching.
Ensure your pull request adheres to the following guidelines:
- Make sure the high level list is alphabetically ordered
Thank you for your suggestions!
## Updating your PR
A lot of times, making a PR adhere to the standards above can be difficult.
If the maintainers notice anything that we'd like changed, we'll ask you to
edit your PR before we merge it. There's no need to open a new PR, just edit
the existing one. If you're not sure how to do that,
[here is a guide](https://github.com/RichardLitt/knowledge/blob/master/github/amending-a-commit-guide.md)
on the different ways you can update your PR so that we can merge it.

27
index.html Normal file
View File

@@ -0,0 +1,27 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta
name="viewport"
content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0"
/>
<title>awesome-engineering</title>
<!-- Stylesheet -->
<link
rel="stylesheet"
href="https://unpkg.com/@egoist/docup@1/dist/docup.min.css"
/>
</head>
<body>
<!-- Script -->
<script src="https://unpkg.com/@egoist/docup@1/dist/docup.min.js"></script>
<!-- Start app -->
<script>
docup.init({
// ..options
})
</script>
</body>
</html>