mirror of
https://github.com/upgundecha/howtheysre
synced 2026-01-03 15:58:03 +00:00
Fix markdownlint errors
This commit is contained in:
267
README.md
267
README.md
@@ -1,12 +1,14 @@
|
||||
# How they SRE
|
||||
|
||||
[ 
|
||||
|
||||

|
||||
|
||||
> A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE)
|
||||
|
||||
## Introduction
|
||||
|
||||
__How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.
|
||||
__How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations.
|
||||
|
||||
Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository.
|
||||
|
||||
@@ -21,7 +23,7 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* Monitoring & Observability
|
||||
* Alerting
|
||||
* Incident Response & Post-Mortem
|
||||
* On-Call
|
||||
* On-Call
|
||||
* Testing in Production
|
||||
* Chaos Engineering
|
||||
* Automation
|
||||
@@ -32,7 +34,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Airbnb</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec)
|
||||
* [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f)
|
||||
* [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162)
|
||||
@@ -42,7 +45,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Algolia</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [May 30 SSL incident](https://www.algolia.com/blog/may-30-ssl-incident/)
|
||||
* [A Journey Into SRE](https://www.algolia.com/blog/a-journey-into-sre/)
|
||||
|
||||
@@ -51,16 +55,19 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Asana</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [How Asana ships stable web application releases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/)
|
||||
* [Analysis of recent downtime & what we’re doing to prevent future incidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/)
|
||||
* [Developer environment: Achieving reliability by making it fast to reset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>ASOS</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Cyber Security @ ASOS.com](https://medium.com/asos-techblog/cyber-security-asos-com-7d1d1f346e57)
|
||||
* [Security Operations 24x7](https://medium.com/asos-techblog/security-operations-24-x-7-2e90c8e5e7e)
|
||||
* [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38)
|
||||
@@ -70,7 +77,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Atlassian</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops)
|
||||
* [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code)
|
||||
* [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting)
|
||||
@@ -81,7 +89,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>BackMarket</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [How Back Market SREs prepared for Black Friday](https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408)
|
||||
|
||||
</details>
|
||||
@@ -89,7 +98,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Baidu</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Anomaly Detection on Golden Signals](https://www.usenix.org/conference/srecon19asia/presentation/chen-yu)
|
||||
* [NetRadar: Monitoring the Datacenter Network](https://www.usenix.org/conference/srecon19asia/presentation/chen-yun)
|
||||
|
||||
@@ -98,13 +108,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Basecamp</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Inside a CODE RED: Network Edition](https://m.signalvnoise.com/inside-a-code-red-network-edition/)
|
||||
* [Three Basecamp outages. One week. What happened?](https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/)
|
||||
* [Basecamp 2 and Basecamp 3 search outage report](https://m.signalvnoise.com/basecamp-2-and-basecamp-3-search-outage-report/)
|
||||
* [Reducing Incident Escalations at Basecamp](https://m.signalvnoise.com/reducing-incident-escalations-at-basecamp/)
|
||||
|
||||
#### Books
|
||||
### Books
|
||||
|
||||
* [Shape Up](https://basecamp.com/shapeup/webbook)
|
||||
|
||||
</details>
|
||||
@@ -112,31 +124,37 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Bloomberg</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Capacity Planning and Performance Enhancement with Page Reference Sampling](https://www.usenix.org/conference/srecon20americas/presentation/chen)
|
||||
* [Why SREs can't afford to NOT do Chaos Engineering](https://www.usenix.org/conference/srecon20americas/presentation/pawlikowski)
|
||||
* [Tracing Real-Time Distributed Systems](https://www.usenix.org/conference/srecon19emea/presentation/yakimov)
|
||||
* [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen)
|
||||
* [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Booking.com</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [How Reliability and Product Teams Collaborate at Booking.com](https://medium.com/booking-com-infrastructure/how-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb)
|
||||
* [Incidents, fixes, and the day after](https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3)
|
||||
* [Troubleshooting: A journey into the unknown](https://medium.com/booking-com-infrastructure/troubleshooting-a-journey-into-the-unknown-e31b524fa86)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet)
|
||||
* [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Capital One</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075)
|
||||
* [Active-Active Shared-Nothing Database Architecture](https://medium.com/capital-one-tech/active-active-shared-nothing-database-architecture-304957ffb89)
|
||||
* [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b)
|
||||
@@ -153,11 +171,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d)
|
||||
* [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765)
|
||||
|
||||
#### Major incidents & analysis reports
|
||||
### Major incidents & analysis reports
|
||||
|
||||
* [Information on the Capital One Cyber Incident](https://www.capitalone.com/facts2019/)
|
||||
* [A Case Study of the Capital One Data Breach](http://web.mit.edu/smadnick/www/wp/2020-16.pdf)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo)
|
||||
* [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI)
|
||||
* [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ)
|
||||
@@ -168,11 +188,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>DBS</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Site Reliability Engineering at DBS Bank](https://medium.com/dbs-tech-blog/site-reliability-engineering-at-dbs-bank-32c02228ccf4)
|
||||
* [Automating Configuration Management at Scale](https://medium.com/dbs-tech-blog/automating-configuration-management-at-scale-5c7927f83df3)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS](https://www.youtube.com/watch?v=URwkaRbOLxI&feature=emb_title)
|
||||
|
||||
</details>
|
||||
@@ -180,7 +202,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>DeepSource</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Redis diskless replication: What, how, why and the caveats](https://deepsource.io/blog/redis-diskless-replication/)
|
||||
* [How to setup Vault with Kubernetes](https://deepsource.io/blog/setup-vault-kubernetes/)
|
||||
* [Breaking down zero downtime deployments in Kubernetes](https://deepsource.io/blog/zero-downtime-deployment/)
|
||||
@@ -190,11 +213,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Dropbox</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Monitoring server applications with Vortex](https://dropbox.tech/infrastructure/monitoring-server-applications-with-vortex)
|
||||
* [Athena: Our automated build health management system](https://dropbox.tech/infrastructure/athena-our-automated-build-health-management-system)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Service Discovery Challenges at Scale](https://www.usenix.org/conference/srecon19americas/presentation/nigmatullin)
|
||||
|
||||
</details>
|
||||
@@ -202,7 +227,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Facebook</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [A Customer Service Approach to SRE](https://www.usenix.org/conference/srecon19emea/presentation/looney)
|
||||
* [How (Not) to Scale a Project: A Post-Mortem](https://www.usenix.org/conference/srecon19asia/presentation/bagnoli)
|
||||
* [Releasing the World's Largest Python Site Every 7 Minutes](https://www.usenix.org/conference/srecon19asia/presentation/wong-shuhong)
|
||||
@@ -213,7 +239,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Fastly</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner)
|
||||
* [Resilience Engineering Mythbusting](https://www.usenix.org/conference/srecon19americas/presentation/gallego)
|
||||
|
||||
@@ -222,13 +249,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>eBay</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/)
|
||||
* [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/)
|
||||
* [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/)
|
||||
* [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/)
|
||||
|
||||
### Video
|
||||
|
||||
* [Madaari: Ordering for the Monkeys](https://www.usenix.org/conference/srecon19americas/presentation/raina)
|
||||
|
||||
</details>
|
||||
@@ -236,14 +265,16 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Etsy</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/)
|
||||
* [Opsweekly: Measuring on-call experience with alert classification](https://codeascraft.com/2014/06/19/opsweekly-measuring-on-call-experience-with-alert-classification/)
|
||||
* [Demystifying Site Outages](https://blog.etsy.com/news/2012/demystifying-site-outages/)
|
||||
* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
|
||||
* [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe](https://www.youtube.com/watch?v=LdOe18KhtT4)
|
||||
* [Migrating a Monolith to the Cloud](https://www.usenix.org/conference/srecon19americas/presentation/govande)
|
||||
|
||||
@@ -252,7 +283,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Expedia</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [The Cost of 100% Reliability](https://medium.com/expedia-group-tech/the-cost-of-100-reliability-ecb2901f23a4)
|
||||
* [Creating Monitoring Dashboards](https://medium.com/expedia-group-tech/creating-monitoring-dashboards-1f3fbe0ae1ac)
|
||||
* [Using Bash for DevOps](https://medium.com/expedia-group-tech/using-bash-for-devops-7046eed1aa63)
|
||||
@@ -262,7 +294,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>GitHub</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Deployment reliability at GitHub](https://github.blog/2021-02-03-deployment-reliability-at-github/)
|
||||
* [Improving how we deploy GitHub](https://github.blog/2021-01-25-improving-how-we-deploy-github/)
|
||||
* [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/)
|
||||
@@ -271,7 +304,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [Getting started with DevOps automation](https://github.blog/2020-10-29-getting-started-with-devops-automation/)
|
||||
* [MySQL High Availability at GitHub](https://github.blog/2018-06-20-mysql-high-availability-at-github/)
|
||||
|
||||
#### Major incidents & analysis reports
|
||||
### Major incidents & analysis reports
|
||||
|
||||
* [GitHub Availability Report: January 2021](https://github.blog/2021-02-02-github-availability-report-january-2021/)
|
||||
* [GitHub Availability Report: December 2020](https://github.blog/2021-01-06-github-availability-report-december-2020/)
|
||||
* [GitHub Availability Report: November 2020](https://github.blog/2020-12-02-availability-report-november-2020/)
|
||||
@@ -283,7 +317,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [February 28th DDoS Incident Report](https://github.blog/2018-03-01-ddos-incident-report/)
|
||||
* [Incident Report: Inadvertent Private Repository Disclosure](https://github.blog/2016-10-28-incident-report-inadvertent-private-repository-disclosure/)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [One on One SRE](https://www.usenix.org/conference/srecon19americas/presentation/tobey)
|
||||
|
||||
</details>
|
||||
@@ -291,7 +326,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>GitLab</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...](https://about.gitlab.com/blog/2021/01/14/this-sre-attempted-to-roll-out-an-haproxy-change/)
|
||||
* [My week shadowing a GitLab Site Reliability Engineer](https://about.gitlab.com/blog/2019/12/16/sre-shadow/)
|
||||
* [Update: Elasticsearch lessons learnt for Advanced Global Search](https://about.gitlab.com/blog/2020/04/28/elasticsearch-update/)
|
||||
@@ -307,7 +343,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>GoCardless</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial](https://medium.com/gocardless-tech/deploying-software-at-gocardless-open-sourcing-our-getting-started-tutorial-ab857aa91c9e)
|
||||
* [How we compress Pub/Sub messages and more, saving a load of money](https://medium.com/gocardless-tech/how-we-compress-pub-sub-messages-and-more-saving-a-load-of-money-694b64c3458a)
|
||||
* [Fear-free PostgreSQL migrations for Rails](https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/)
|
||||
@@ -316,7 +353,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [Zero-downtime Postgres migrations - the hard parts](https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/)
|
||||
* [In search of performance - how we shaved 200ms off every POST request](https://gocardless.com/blog/in-search-of-performance-how-we-shaved-200ms-off-every-post-request/)
|
||||
|
||||
#### Major incidents & analysis reports
|
||||
### Major incidents & analysis reports
|
||||
|
||||
* [Incident review: Service outage on 25 October 2020, Vault TLS expiry](https://gocardless.com/blog/incident-review-service-outage-on-25-october-2020/)
|
||||
* [Incident review: API and Dashboard outage on 10 October 2017](https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/)
|
||||
|
||||
@@ -325,18 +363,21 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Google</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [SRE Practices & Processes](https://sre.google/resources/#practicesandprocesses)
|
||||
* [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19)
|
||||
* [SRE Classroom: Distributed PubSub](https://sre.google/resources/practices-and-processes/distributed-pubsub/)
|
||||
|
||||
#### Books
|
||||
### Books
|
||||
|
||||
* [Building Secure & Reliable Systems](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf)
|
||||
* [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/)
|
||||
* [The Site Reliability Workbook](https://sre.google/workbook/table-of-contents/)
|
||||
* [Training Site Reliability Engineers](https://static.googleusercontent.com/media/sre.google/en//static/pdf/training-sre.pdf)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk)
|
||||
* [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU)
|
||||
* [Pragmatic Automation’ with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s)
|
||||
@@ -370,7 +411,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Gojek</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Why We Swear by the RCA](https://blog.gojekengineering.com/why-we-swear-by-the-rca-f535fd5abbcb)
|
||||
|
||||
</details>
|
||||
@@ -378,7 +420,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Grab</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Our Journey to Continuous Delivery at Grab (Part 1)](https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab)
|
||||
* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 1)](https://engineering.grab.com/designing-resilient-systems-part-1)
|
||||
* [Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)](https://engineering.grab.com/designing-resilient-systems-part-2)
|
||||
@@ -392,7 +435,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Grammarly</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Security Operations in an AWS Environment](https://www.grammarly.com/blog/engineering/security-infrastructure-aws/)
|
||||
|
||||
</details>
|
||||
@@ -400,7 +444,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Heroku</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Incident Response at Heroku](https://blog.heroku.com/incident-response-at-heroku-2020)
|
||||
|
||||
</details>
|
||||
@@ -408,12 +453,14 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Indeed</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Being Just Reliable Enough](https://engineering.indeedblog.com/blog/2019/10/being-just-reliable-enough/)
|
||||
* [Automating Indeed’s Release Process](https://engineering.indeedblog.com/blog/2017/03/automating-release-process/)
|
||||
* [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Are We Getting Better Yet? Progress Toward Safer Operations](https://www.usenix.org/conference/srecon20americas/presentation/elman)
|
||||
|
||||
</details>
|
||||
@@ -421,7 +468,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Khan Academy</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [How Khan Academy Successfully Handled 2.5x Traffic in a Week](https://blog.khanacademy.org/how-khan-academy-successfully-handled-2-5x-traffic-in-a-week/)
|
||||
* [Evolving our content infrastructure](https://blog.khanacademy.org/evolving-our-content-infrastructure/)
|
||||
|
||||
@@ -430,7 +478,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>LinkedIn</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Insights into a Product SRE team at LinkedIn](https://www.linkedin.com/pulse/insights-product-sre-team-linkedin-zaina-afoulki/?trackingId=mxKJgZ3kp8l2WI9D4UZv7Q%3D%3D)
|
||||
* [Open source update: School of SRE](https://engineering.linkedin.com/blog/2021/open-source-update--school-of-sre)
|
||||
* [Fixing Linux filesystem performance regressions](https://engineering.linkedin.com/blog/2020/fixing-linux-filesystem-performance-regressions)
|
||||
@@ -452,7 +501,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [What Gets Measured Gets Fixed](https://engineering.linkedin.com/blog/2016/12/what-gets-measured-gets-fixed)
|
||||
* [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA)
|
||||
* [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty)
|
||||
* [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin)
|
||||
@@ -472,17 +522,19 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Mercari</summary>
|
||||
|
||||
## Mercari
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/)
|
||||
* [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/)
|
||||
* [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Microsoft</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft](https://www.youtube.com/watch?v=1iMo3SkdQqQ)
|
||||
* [Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft](https://www.youtube.com/watch?v=U3ubcoNzx9k)
|
||||
* [Sustainable Software Engineering & SREs](https://www.usenix.org/conference/srecon20americas/presentation/johnson)
|
||||
@@ -493,13 +545,14 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [Availability—Thinking beyond 9s](https://www.usenix.org/conference/srecon19asia/presentation/srinivasamurthy)
|
||||
* [Ironies of Automation: A Comedy in Three Parts](https://www.usenix.org/conference/srecon19asia/presentation/lund-comedy)
|
||||
* [The Ops in Serverless](https://www.usenix.org/conference/srecon19americas/presentation/davis)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>MIRO</summary>
|
||||
|
||||
## MIRO
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e)
|
||||
* [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699)
|
||||
* [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5)
|
||||
@@ -509,13 +562,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Monzo</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Autoscaling Monzo: How we optimise our platform to be just the right size](https://monzo.com/blog/2020/10/19/autoscaling-monzo)
|
||||
* [How we’ve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo)
|
||||
* [How we respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents)
|
||||
* [How we monitor Monzo](https://monzo.com/blog/2018/07/27/how-we-monitor-monzo)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Eventually Consistent Service Discovery](https://www.usenix.org/conference/srecon19emea/presentation/patel)
|
||||
|
||||
</details>
|
||||
@@ -523,7 +578,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Netflix</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Building Netflix’s Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304)
|
||||
* [Lessons from Building Observability Tools at Netflix](https://netflixtechblog.com/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17)
|
||||
* [Edgar: Solving Mysteries Faster with Observability](https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f)
|
||||
@@ -542,10 +598,12 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis](https://netflixtechblog.com/announcing-security-monkey-aws-security-configuration-monitoring-and-analysis-1f2bfb001708)
|
||||
* [Lessons Netflix Learned from the AWS Outage](https://netflixtechblog.com/lessons-netflix-learned-from-the-aws-outage-deefe5fd0c04)
|
||||
|
||||
#### Major incidents & analysis reports
|
||||
### Major incidents & analysis reports
|
||||
|
||||
* [Post-mortem of October 22, 2012 AWS degradation](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)](https://www.youtube.com/watch?v=0QS1TWLooo0)
|
||||
* [When /bin/sh Attacks: Revisiting "Automate All the Things"](https://www.usenix.org/conference/srecon20americas/presentation/reed)
|
||||
* [How Did Things Go Right? Learning More from Incidents](https://www.usenix.org/conference/srecon19americas/presentation/kitchens)
|
||||
@@ -571,7 +629,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>PayPal</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title)
|
||||
* [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr)
|
||||
* [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan)
|
||||
@@ -583,13 +642,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Pinterest</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Simplifying web deploys](https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737)
|
||||
* [Upgrading Pinterest operational metrics](https://medium.com/pinterest-engineering/upgrading-pinterest-operational-metrics-8718d058079a)
|
||||
* [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b)
|
||||
* [Auto scaling Pinterest](https://medium.com/pinterest-engineering/auto-scaling-pinterest-df1d2beb4d64)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Building Actionable Code Ownership](https://www.usenix.org/conference/srecon20americas/presentation/mukherji)
|
||||
* [Evolution of Observability Tools at Pinterest](https://www.usenix.org/conference/srecon19emea/presentation/abbas)
|
||||
* [Automating OS/Platform Upgrades for Service Owners](https://www.usenix.org/conference/srecon19asia/presentation/menezes)
|
||||
@@ -599,7 +660,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Postman</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana](https://medium.com/better-practices/chaos-d3ef238ec328)
|
||||
|
||||
</details>
|
||||
@@ -607,7 +669,7 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Slalom Build</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Beginners Guide to DevOps: How to Make It into the Industry](https://medium.com/slalom-build/beginners-guid-to-devops-how-to-make-it-into-the-industry-c1652d59807)
|
||||
* [GitHub Actions: Beyond CI/CD](https://medium.com/slalom-build/github-actions-beyond-ci-cd-cb3ddc6abaa)
|
||||
@@ -626,7 +688,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Scribd</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Learning from incidents: getting Sidekiq ready to serve a billion jobs](https://tech.scribd.com/blog/2020/sidekiq-incident-learnings.html)
|
||||
* [A testimonial for using PagerDuty at Scribd](https://tech.scribd.com/blog/2020/pagerduty-at-scribd.html)
|
||||
* [Assigning pager duty to developers](https://tech.scribd.com/blog/2019/managing-pagerduty-rotations.html)
|
||||
@@ -636,7 +699,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Shopify</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events)
|
||||
* [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify)
|
||||
* [Using DNS Traffic Management to Add Resiliency to Shopify’s Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services)
|
||||
@@ -644,7 +708,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure)
|
||||
* [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Network Monitor: A Tale of ACKnowledging an Observability Gap](https://www.usenix.org/conference/srecon19emea/presentation/gedge)
|
||||
* [Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures](https://www.usenix.org/conference/srecon19emea/presentation/arthorne)
|
||||
* [Advanced Napkin Math: Estimating System Performance from First Principles](https://www.usenix.org/conference/srecon19emea/presentation/eskildsen)
|
||||
@@ -654,12 +719,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Slack</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Slack’s Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/)
|
||||
* [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/)
|
||||
* [Deploys at Slack](https://slack.engineering/deploys-at-slack/)
|
||||
* [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/)
|
||||
#### Videos
|
||||
|
||||
### Videos
|
||||
|
||||
* [Slack at the Edge](https://www.usenix.org/conference/srecon19asia/presentation/pemberton)
|
||||
* [What Breaks Our Systems: A Taxonomy of Black Swans](https://www.usenix.org/conference/srecon19americas/presentation/nolan-taxonomy)
|
||||
|
||||
@@ -668,22 +736,25 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Soundcloud</summary>
|
||||
|
||||
## Soundcloud
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos)
|
||||
* [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary)
|
||||
* [Prometheus has come of age – a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project)
|
||||
* [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Spotify</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/)
|
||||
* [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)
|
||||
|
||||
</details>
|
||||
@@ -691,10 +762,12 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Squarespace</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Pushing through Friction](https://www.usenix.org/conference/srecon19emea/presentation/na)
|
||||
* [How to SRE When Everything's Already on Fire](https://www.usenix.org/conference/srecon19emea/presentation/hidalgo)
|
||||
* [Case Study: Implementing SLOs for a New Service](https://www.usenix.org/conference/srecon19americas/presentation/lawson)
|
||||
@@ -705,11 +778,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Stack Overflow</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [A deeper dive into our May 2019 security incident](https://stackoverflow.blog/2021/01/25/a-deeper-dive-into-our-may-2019-security-incident/)
|
||||
* [Guest Post - Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)
|
||||
|
||||
</details>
|
||||
@@ -717,11 +792,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Stripe</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Fast and flexible observability with canonical log lines](https://stripe.com/blog/canonical-log-lines)
|
||||
* [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/engineering/page/3)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [How Stripe Invests in Technical Infrastructure](https://www.usenix.org/conference/srecon19emea/presentation/larson)
|
||||
* [The AWS Billing Machine and Optimizing Cloud Costs](https://www.usenix.org/conference/srecon19asia/presentation/lopopolo)
|
||||
|
||||
@@ -730,7 +807,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Target</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Ɔhaos Ǝnginǝǝring @ Target - Part 2](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html)
|
||||
* [Ɔhaos Ǝnginǝǝring @ Target - Part 1](https://tech.target.com/2019/02/05/chaos-engineering-at-Target.html)
|
||||
* [GoAlert - Your Future Open Source, On-Call Notification Product](https://tech.target.com/2019/02/25/introducing-goalert.html)
|
||||
@@ -743,7 +821,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Trivago</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [How To Get Fooled By Metrics](https://tech.trivago.com/2020/12/04/how-to-get-fooled-by-metrics/)
|
||||
|
||||
</details>
|
||||
@@ -751,34 +830,38 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Uber</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Disaster Recovery for Multi-Region Kafka at Uber](https://eng.uber.com/kafka/)
|
||||
* [Engineering Failover Handling in Uber’s Mobile Networking Infrastructure](https://eng.uber.com/eng-failover-handling/)
|
||||
* [Optimizing Observability with Jaeger, M3, and XYS at Uber](https://eng.uber.com/optimizing-observability/)
|
||||
|
||||
### Videos
|
||||
|
||||
#### Videos
|
||||
* [A Tale of Two Rotations: Building a Humane & Effective On-Call](https://www.usenix.org/conference/srecon19emea/presentation/lee)
|
||||
* [Testing in Production at Scale](https://www.usenix.org/conference/srecon19americas/presentation/gud)
|
||||
* [A History of SRE at Uber’ with Rick Boone of Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE)
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>VGW</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [The SRE Incident Response game](https://medium.com/@bruce_25864/the-sre-incident-response-game-db242fff391c)
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Level Up Your Incident Response With Gameplay](https://youtu.be/c2-52EP8_7c)
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Wikimedia Foundation</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Testing Encyclopedias in Production](https://www.usenix.org/conference/srecon20americas/presentation/mouzeli)
|
||||
* [What Happens When You Type en.wikipedia.org?](https://www.usenix.org/conference/srecon19emea/presentation/mouzeli)
|
||||
|
||||
@@ -787,7 +870,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>Zerodha</summary>
|
||||
|
||||
#### Blog Posts
|
||||
### Blog Posts
|
||||
|
||||
* [Infrastructure monitoring with Prometheus at Zerodha](https://zerodha.tech/blog/infra-monitoring-at-zerodha/)
|
||||
|
||||
</details>
|
||||
@@ -795,7 +879,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
<details>
|
||||
<summary>SRECon Mix Playlist</summary>
|
||||
|
||||
#### Videos
|
||||
### Videos
|
||||
|
||||
* [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla)
|
||||
* [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki)
|
||||
* [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent)
|
||||
@@ -824,6 +909,7 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
* [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup)
|
||||
* [Yelp - What I Wish I Knew before Going On-Call](https://www.usenix.org/conference/srecon19emea/presentation/shu)
|
||||
* [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer)
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
@@ -875,10 +961,9 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
|
||||
|
||||
## Other How They... repos
|
||||
|
||||
* [HowTheyTest](https://github.com/abhivaikar/howtheytest)
|
||||
* [HowTheyDevOps](https://github.com/bregman-arie/howtheydevops)
|
||||
* [HowTheyAWS](https://github.com/upgundecha/howtheyaws)
|
||||
|
||||
* [Howtheytest](https://github.com/abhivaikar/howtheytest)
|
||||
* [Howtheydevops](https://github.com/bregman-arie/howtheydevops)
|
||||
* [Howtheyaws](https://github.com/upgundecha/howtheyaws)
|
||||
|
||||
## Contribute
|
||||
|
||||
@@ -893,4 +978,4 @@ related or neighboring rights to this work.
|
||||
|
||||
---
|
||||
|
||||
If you decide to use this anywhere please give a credit to [@upgundecha](https://www.twitter.com/upgundecha) on twitter, also If you like my work, check out other projects on my Github.
|
||||
If you decide to use this anywhere please give a credit to [@upgundecha](https://www.twitter.com/upgundecha) on twitter, also If you like my work, check out other projects on my Github.
|
||||
|
||||
Reference in New Issue
Block a user