From c7b301eb40a754628e3b442acf30444cb4b341f0 Mon Sep 17 00:00:00 2001 From: Unmesh Gundecha Date: Sun, 21 Feb 2021 00:05:37 +0800 Subject: [PATCH] Fix markdownlint errors --- .github/FUNDING.yml | 2 - .markdownlint.json | 3 +- README.md | 267 +++++++++++++++++++++++++++++--------------- contributing.md | 1 - 4 files changed, 178 insertions(+), 95 deletions(-) delete mode 100644 .github/FUNDING.yml diff --git a/.github/FUNDING.yml b/.github/FUNDING.yml deleted file mode 100644 index 7dfebd1..0000000 --- a/.github/FUNDING.yml +++ /dev/null @@ -1,2 +0,0 @@ -# These are supported funding model platforms -custom: https://www.buymeacoffee.com/upgundecha diff --git a/.markdownlint.json b/.markdownlint.json index 454e8b5..416488d 100644 --- a/.markdownlint.json +++ b/.markdownlint.json @@ -1,5 +1,6 @@ { "default": true, "line-length": false, - "no-duplicate-header": false + "no-duplicate-header": false, + "no-inline-html": false } diff --git a/README.md b/README.md index 7d03f2f..e3e8713 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,14 @@ # How they SRE +[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square) ![Check Markdown links](https://github.com/upgundecha/howtheysre/workflows/Check%20Markdown%20links/badge.svg) + ![Alt](banner.png "banner") > A curated collection of publicly available resources on how technology and tech-savvy organizations around the world practice Site Reliability Engineering (SRE) ## Introduction -__How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations. +__How They SRE__ is a curated knowledge repository of best practices, tools, techniques, and culture of SRE adopted by the leading technology or tech-savvy organizations. Many organizations regularly come forward and share their best practices, tools, techniques and offer an insight into engineering culture on various public platforms like engineering blogs, conferences & meetups. The content is curated from these avenues and shared in this repository. @@ -21,7 +23,7 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * Monitoring & Observability * Alerting * Incident Response & Post-Mortem -* On-Call +* On-Call * Testing in Production * Chaos Engineering * Automation @@ -32,7 +34,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Airbnb -#### Blog Posts +### Blog Posts + * [Detecting Vulnerabilities With Vulnture](https://medium.com/airbnb-engineering/detecting-vulnerabilities-with-vulnture-f5f23387f6ec) * [Alerting Framework at Airbnb](https://medium.com/airbnb-engineering/alerting-framework-at-airbnb-35ba48df894f) * [When The Cloud Gets Dark — How Amazon’s Outage Affected Airbnb](https://medium.com/airbnb-engineering/when-the-cloud-gets-dark-how-amazons-outage-affected-airbnb-66eaf8c0f162) @@ -42,7 +45,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Algolia -#### Blog Posts +### Blog Posts + * [May 30 SSL incident](https://www.algolia.com/blog/may-30-ssl-incident/) * [A Journey Into SRE](https://www.algolia.com/blog/a-journey-into-sre/) @@ -51,16 +55,19 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Asana -#### Blog Posts +### Blog Posts + * [How Asana ships stable web application releases](https://blog.asana.com/2021/01/asana-engineering-ships-web-application-releases/) * [Analysis of recent downtime & what we’re doing to prevent future incidents](https://blog.asana.com/2019/09/downtime-what-were-doing-to-prevent-future-downtime/) * [Developer environment: Achieving reliability by making it fast to reset](https://blog.asana.com/2017/07/developer-environment-making-it-reliable-by-making-it-fast-to-reset/) +
ASOS -#### Blog Posts +### Blog Posts + * [Cyber Security @ ASOS.com](https://medium.com/asos-techblog/cyber-security-asos-com-7d1d1f346e57) * [Security Operations 24x7](https://medium.com/asos-techblog/security-operations-24-x-7-2e90c8e5e7e) * [The skills we look for in Cyber Security Incident Response](https://medium.com/asos-techblog/the-skills-we-look-for-in-cyber-security-incident-response-12b327927e38) @@ -70,7 +77,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Atlassian -#### Blog Posts +### Blog Posts + * [Best practices for change management in the age of DevOps](https://www.atlassian.com/engineering/best-practices-for-change-management-in-the-age-of-devops) * [Automated testing: 5 lessons from Atlassian’s Kubernetes team on testing infrastructure as code](https://www.atlassian.com/engineering/automated-testing-5-lessons-from-atlassians-kubernetes-team-on-testing-infrastructure-as-code) * [How to export Kubernetes events for observability and alerting](https://www.atlassian.com/engineering/how-to-export-kubernetes-events-for-observability-and-alerting) @@ -81,7 +89,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
BackMarket -#### Blog Posts +### Blog Posts + * [How Back Market SREs prepared for Black Friday](https://medium.com/back-market-engineering/how-back-market-sres-prepared-for-black-friday-5f017f343408)
@@ -89,7 +98,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Baidu -#### Videos +### Videos + * [Anomaly Detection on Golden Signals](https://www.usenix.org/conference/srecon19asia/presentation/chen-yu) * [NetRadar: Monitoring the Datacenter Network](https://www.usenix.org/conference/srecon19asia/presentation/chen-yun) @@ -98,13 +108,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Basecamp -#### Blog Posts +### Blog Posts + * [Inside a CODE RED: Network Edition](https://m.signalvnoise.com/inside-a-code-red-network-edition/) * [Three Basecamp outages. One week. What happened?](https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/) * [Basecamp 2 and Basecamp 3 search outage report](https://m.signalvnoise.com/basecamp-2-and-basecamp-3-search-outage-report/) * [Reducing Incident Escalations at Basecamp](https://m.signalvnoise.com/reducing-incident-escalations-at-basecamp/) -#### Books +### Books + * [Shape Up](https://basecamp.com/shapeup/webbook)
@@ -112,31 +124,37 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Bloomberg -#### Videos +### Videos + * [Capacity Planning and Performance Enhancement with Page Reference Sampling](https://www.usenix.org/conference/srecon20americas/presentation/chen) * [Why SREs can't afford to NOT do Chaos Engineering](https://www.usenix.org/conference/srecon20americas/presentation/pawlikowski) * [Tracing Real-Time Distributed Systems](https://www.usenix.org/conference/srecon19emea/presentation/yakimov) * [The Bloomberg Story: Building SRE Teams in an "Immeasurable" Organisation](https://www.usenix.org/conference/srecon19asia/presentation/sorensen) * [Visibility into Loggers (and Other Low Level Services)—Seeing the Trees from the Forest](https://www.usenix.org/conference/srecon19americas/presentation/chen) +
Booking.com -#### Blog Posts +### Blog Posts + * [How Reliability and Product Teams Collaborate at Booking.com](https://medium.com/booking-com-infrastructure/how-reliability-and-product-teams-collaborate-at-booking-com-f6c317cc0aeb) * [Incidents, fixes, and the day after](https://medium.com/booking-com-infrastructure/incidents-fixes-and-the-day-after-c5d9aeae28c3) * [Troubleshooting: A journey into the unknown](https://medium.com/booking-com-infrastructure/troubleshooting-a-journey-into-the-unknown-e31b524fa86) -#### Videos +### Videos + * [SLOs for Data-Intensive Services](https://www.usenix.org/conference/srecon19emea/presentation/fouquet) * [Benefits of Taking the Less Traveled Road with Containers Infrastructure](https://www.usenix.org/conference/srecon19americas/presentation/iacoboaia) +
Capital One -#### Blog Posts +### Blog Posts + * [Automate AWS Infrastructure with Boto 3: AWS Health Check](https://medium.com/capital-one-tech/automate-aws-infrastructure-with-boto-3-aws-health-checks-e51338ba075) * [Active-Active Shared-Nothing Database Architecture](https://medium.com/capital-one-tech/active-active-shared-nothing-database-architecture-304957ffb89) * [The 3 R’s of SREs: Resiliency, Recovery & Reliability](https://medium.com/capital-one-tech/the-3-rs-of-sres-resiliency-recovery-reliability-5f2f5360a91b) @@ -153,11 +171,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [Continuous Chaos — Introducing Chaos Engineering into DevOps Practices](https://medium.com/capital-one-tech/continuous-chaos-introducing-chaos-engineering-into-devops-practices-75757e1cca6d) * [The Mon-ifesto Part 1: Metrics](https://medium.com/capital-one-tech/the-mon-ifesto-part-1-metrics-808f6c944765) -#### Major incidents & analysis reports +### Major incidents & analysis reports + * [Information on the Capital One Cyber Incident](https://www.capitalone.com/facts2019/) * [A Case Study of the Capital One Data Breach](http://web.mit.edu/smadnick/www/wp/2020-16.pdf) -#### Videos +### Videos + * [Banking on Continuous Delivery - Capital One](https://www.youtube.com/watch?v=_DnYSQEUTfo) * [Continuous Chaos in DevOps - Capital One](https://www.youtube.com/watch?v=U_Uh5RMCwPI) * [DevOps at Capital One: Focusing on Pipeline and Measurement](https://www.youtube.com/watch?v=6Q0mtVnnthQ) @@ -168,11 +188,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
DBS -#### Blog Posts +### Blog Posts + * [Site Reliability Engineering at DBS Bank](https://medium.com/dbs-tech-blog/site-reliability-engineering-at-dbs-bank-32c02228ccf4) * [Automating Configuration Management at Scale](https://medium.com/dbs-tech-blog/automating-configuration-management-at-scale-5c7927f83df3) -#### Videos +### Videos + * [SREcon Conversations Asia/Pacific with Koon Seng Lim, DBS](https://www.youtube.com/watch?v=URwkaRbOLxI&feature=emb_title)
@@ -180,7 +202,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
DeepSource -#### Blog Posts +### Blog Posts + * [Redis diskless replication: What, how, why and the caveats](https://deepsource.io/blog/redis-diskless-replication/) * [How to setup Vault with Kubernetes](https://deepsource.io/blog/setup-vault-kubernetes/) * [Breaking down zero downtime deployments in Kubernetes](https://deepsource.io/blog/zero-downtime-deployment/) @@ -190,11 +213,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Dropbox -#### Blog Posts +### Blog Posts + * [Monitoring server applications with Vortex](https://dropbox.tech/infrastructure/monitoring-server-applications-with-vortex) * [Athena: Our automated build health management system](https://dropbox.tech/infrastructure/athena-our-automated-build-health-management-system) -#### Videos +### Videos + * [Service Discovery Challenges at Scale](https://www.usenix.org/conference/srecon19americas/presentation/nigmatullin)
@@ -202,7 +227,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Facebook -#### Videos +### Videos + * [A Customer Service Approach to SRE](https://www.usenix.org/conference/srecon19emea/presentation/looney) * [How (Not) to Scale a Project: A Post-Mortem](https://www.usenix.org/conference/srecon19asia/presentation/bagnoli) * [Releasing the World's Largest Python Site Every 7 Minutes](https://www.usenix.org/conference/srecon19asia/presentation/wong-shuhong) @@ -213,7 +239,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Fastly -#### Videos +### Videos + * [SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager](https://www.usenix.org/conference/srecon19americas/presentation/wohlner) * [Resilience Engineering Mythbusting](https://www.usenix.org/conference/srecon19americas/presentation/gallego) @@ -222,13 +249,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
eBay -#### Blog Posts +### Blog Posts + * [Resiliency and Disaster Recovery with Kafka](https://tech.ebayinc.com/engineering/resiliency-and-disaster-recovery-with-kafka/) * [SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue](https://tech.ebayinc.com/engineering/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/) * [SRE Case Study: Mysterious Traffic Imbalance](https://tech.ebayinc.com/engineering/sre-case-study-mysterious-traffic-imbalance/) * [Zero Downtime, Instant Deployment and Rollback](https://tech.ebayinc.com/engineering/zero-downtime-instant-deployment-and-rollback/) ### Video + * [Madaari: Ordering for the Monkeys](https://www.usenix.org/conference/srecon19americas/presentation/raina)
@@ -236,14 +265,16 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Etsy -#### Blog Posts +### Blog Posts + * [Etsy’s Debriefing Facilitation Guide for Blameless Postmortems](https://codeascraft.com/2016/11/17/debriefing-facilitation-guide/) * [Opsweekly: Measuring on-call experience with alert classification](https://codeascraft.com/2014/06/19/opsweekly-measuring-on-call-experience-with-alert-classification/) * [Demystifying Site Outages](https://blog.etsy.com/news/2012/demystifying-site-outages/) * [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/) * [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/) -#### Videos +### Videos + * [Velocity 09: John Allspaw and Paul Hammond, "10+ Deploys Pe](https://www.youtube.com/watch?v=LdOe18KhtT4) * [Migrating a Monolith to the Cloud](https://www.usenix.org/conference/srecon19americas/presentation/govande) @@ -252,7 +283,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Expedia -#### Blog Posts +### Blog Posts + * [The Cost of 100% Reliability](https://medium.com/expedia-group-tech/the-cost-of-100-reliability-ecb2901f23a4) * [Creating Monitoring Dashboards](https://medium.com/expedia-group-tech/creating-monitoring-dashboards-1f3fbe0ae1ac) * [Using Bash for DevOps](https://medium.com/expedia-group-tech/using-bash-for-devops-7046eed1aa63) @@ -262,7 +294,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
GitHub -#### Blog Posts +### Blog Posts + * [Deployment reliability at GitHub](https://github.blog/2021-02-03-deployment-reliability-at-github/) * [Improving how we deploy GitHub](https://github.blog/2021-01-25-improving-how-we-deploy-github/) * [Building On-Call Culture at GitHub](https://github.blog/2021-01-06-building-on-call-culture-at-github/) @@ -271,7 +304,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [Getting started with DevOps automation](https://github.blog/2020-10-29-getting-started-with-devops-automation/) * [MySQL High Availability at GitHub](https://github.blog/2018-06-20-mysql-high-availability-at-github/) -#### Major incidents & analysis reports +### Major incidents & analysis reports + * [GitHub Availability Report: January 2021](https://github.blog/2021-02-02-github-availability-report-january-2021/) * [GitHub Availability Report: December 2020](https://github.blog/2021-01-06-github-availability-report-december-2020/) * [GitHub Availability Report: November 2020](https://github.blog/2020-12-02-availability-report-november-2020/) @@ -283,7 +317,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [February 28th DDoS Incident Report](https://github.blog/2018-03-01-ddos-incident-report/) * [Incident Report: Inadvertent Private Repository Disclosure](https://github.blog/2016-10-28-incident-report-inadvertent-private-repository-disclosure/) -#### Videos +### Videos + * [One on One SRE](https://www.usenix.org/conference/srecon19americas/presentation/tobey)
@@ -291,7 +326,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
GitLab -#### Blog Posts +### Blog Posts + * [This SRE attempted to roll out an HAProxy config change. You won't believe what happened next...](https://about.gitlab.com/blog/2021/01/14/this-sre-attempted-to-roll-out-an-haproxy-change/) * [My week shadowing a GitLab Site Reliability Engineer](https://about.gitlab.com/blog/2019/12/16/sre-shadow/) * [Update: Elasticsearch lessons learnt for Advanced Global Search](https://about.gitlab.com/blog/2020/04/28/elasticsearch-update/) @@ -307,7 +343,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
GoCardless -#### Blog Posts +### Blog Posts + * [Deploying Software at GoCardless: Open-Sourcing our “Getting Started” Tutorial](https://medium.com/gocardless-tech/deploying-software-at-gocardless-open-sourcing-our-getting-started-tutorial-ab857aa91c9e) * [How we compress Pub/Sub messages and more, saving a load of money](https://medium.com/gocardless-tech/how-we-compress-pub-sub-messages-and-more-saving-a-load-of-money-694b64c3458a) * [Fear-free PostgreSQL migrations for Rails](https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/) @@ -316,7 +353,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [Zero-downtime Postgres migrations - the hard parts](https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/) * [In search of performance - how we shaved 200ms off every POST request](https://gocardless.com/blog/in-search-of-performance-how-we-shaved-200ms-off-every-post-request/) -#### Major incidents & analysis reports +### Major incidents & analysis reports + * [Incident review: Service outage on 25 October 2020, Vault TLS expiry](https://gocardless.com/blog/incident-review-service-outage-on-25-october-2020/) * [Incident review: API and Dashboard outage on 10 October 2017](https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/) @@ -325,18 +363,21 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Google -#### Blog Posts +### Blog Posts + * [SRE Practices & Processes](https://sre.google/resources/#practicesandprocesses) * [Three months, 30x demand: How we scaled Google Meet during COVID-19](https://cloud.google.com/blog/products/g-suite/keeping-google-meet-ahead-of-usage-demand-during-covid-19) * [SRE Classroom: Distributed PubSub](https://sre.google/resources/practices-and-processes/distributed-pubsub/) -#### Books +### Books + * [Building Secure & Reliable Systems](https://static.googleusercontent.com/media/sre.google/en//static/pdf/building_secure_and_reliable_systems.pdf) * [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/) * [The Site Reliability Workbook](https://sre.google/workbook/table-of-contents/) * [Training Site Reliability Engineers](https://static.googleusercontent.com/media/sre.google/en//static/pdf/training-sre.pdf) -#### Videos +### Videos + * [What's the Difference Between DevOps and SRE? with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/uTEL8Ff1Zvk) * [Risk and Error Budgets’ with Seth Vargo and Liz Fong-Jones of Google](https://youtu.be/y2ILKr8kCJU) * [Pragmatic Automation’ with Max Luebbe of GCP](https://www.youtube.com/watch?v=oDcjAcFTFC0&t=0m56s) @@ -370,7 +411,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Gojek -#### Blog Posts +### Blog Posts + * [Why We Swear by the RCA](https://blog.gojekengineering.com/why-we-swear-by-the-rca-f535fd5abbcb)
@@ -378,7 +420,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Grab -#### Blog Posts +### Blog Posts + * [Our Journey to Continuous Delivery at Grab (Part 1)](https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab) * [Designing Resilient Systems: Circuit Breakers or Retries? (Part 1)](https://engineering.grab.com/designing-resilient-systems-part-1) * [Designing Resilient Systems: Circuit Breakers or Retries? (Part 2)](https://engineering.grab.com/designing-resilient-systems-part-2) @@ -392,7 +435,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Grammarly -#### Blog Posts +### Blog Posts + * [Security Operations in an AWS Environment](https://www.grammarly.com/blog/engineering/security-infrastructure-aws/)
@@ -400,7 +444,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Heroku -#### Blog Posts +### Blog Posts + * [Incident Response at Heroku](https://blog.heroku.com/incident-response-at-heroku-2020)
@@ -408,12 +453,14 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Indeed -#### Blog Posts +### Blog Posts + * [Being Just Reliable Enough](https://engineering.indeedblog.com/blog/2019/10/being-just-reliable-enough/) * [Automating Indeed’s Release Process](https://engineering.indeedblog.com/blog/2017/03/automating-release-process/) * [Sloth, a Tool for Inducing Network Failures’ with Preetha Appan of Indeed.com](https://www.usenix.org/conference/srecon17americas/program/presentation/appan) -#### Videos +### Videos + * [Are We Getting Better Yet? Progress Toward Safer Operations](https://www.usenix.org/conference/srecon20americas/presentation/elman)
@@ -421,7 +468,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Khan Academy -#### Blog Posts +### Blog Posts + * [How Khan Academy Successfully Handled 2.5x Traffic in a Week](https://blog.khanacademy.org/how-khan-academy-successfully-handled-2-5x-traffic-in-a-week/) * [Evolving our content infrastructure](https://blog.khanacademy.org/evolving-our-content-infrastructure/) @@ -430,7 +478,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
LinkedIn -#### Blog Posts +### Blog Posts + * [Insights into a Product SRE team at LinkedIn](https://www.linkedin.com/pulse/insights-product-sre-team-linkedin-zaina-afoulki/?trackingId=mxKJgZ3kp8l2WI9D4UZv7Q%3D%3D) * [Open source update: School of SRE](https://engineering.linkedin.com/blog/2021/open-source-update--school-of-sre) * [Fixing Linux filesystem performance regressions](https://engineering.linkedin.com/blog/2020/fixing-linux-filesystem-performance-regressions) @@ -452,7 +501,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [What Gets Measured Gets Fixed](https://engineering.linkedin.com/blog/2016/12/what-gets-measured-gets-fixed) * [Hiring SREs at LinkedIn](https://engineering.linkedin.com/engineering-culture/hiring-sres-linkedin) -#### Videos +### Videos + * [Growing the Site Reliability Team at LinkedIn: Hiring is Hard -- Greg Leffler](https://www.youtube.com/watch?v=ZemNg9GYvOA) * [9 Years of Failure: How Racing Crappy Cars Made Me a Better SRE](https://www.usenix.org/conference/srecon20americas/presentation/doherty) * [Weathering the Storm: How Early Warnings Save the Farm](https://www.usenix.org/conference/srecon19emea/presentation/sherwin) @@ -472,17 +522,19 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Mercari -## Mercari -#### Blog Posts +### Blog Posts + * [DevSecOps: What Is It and Why Is It Gaining Momentum in the Industry?](https://engineering.mercari.com/en/blog/entry/20201214-devsecops-what-is-it-and-why-is-it-gaining-momentum-in-the-industry/) * [How do we share troubleshooting skills](https://engineering.mercari.com/en/blog/entry/2020-01-28-143339/) * [Datadog Dashboard at Scale w / Terraform](https://engineering.mercari.com/en/blog/entry/2019-12-09-122134/) +
Microsoft -#### Videos +### Videos + * [SLI & Reliability Deep-Dive’ with David N. Blank-Edelman of Microsoft](https://www.youtube.com/watch?v=1iMo3SkdQqQ) * [Ironies of Automation: A Comedy in Three Parts’ with Tanner Lund of Microsoft](https://www.youtube.com/watch?v=U3ubcoNzx9k) * [Sustainable Software Engineering & SREs](https://www.usenix.org/conference/srecon20americas/presentation/johnson) @@ -493,13 +545,14 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [Availability—Thinking beyond 9s](https://www.usenix.org/conference/srecon19asia/presentation/srinivasamurthy) * [Ironies of Automation: A Comedy in Three Parts](https://www.usenix.org/conference/srecon19asia/presentation/lund-comedy) * [The Ops in Serverless](https://www.usenix.org/conference/srecon19americas/presentation/davis) +
MIRO -## MIRO -#### Blog Posts +### Blog Posts + * [Prometheus High Availability and Fault Tolerance strategy, long term storage with VictoriaMetrics](https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e) * [Managing hundreds of servers for load testing: Autoscaling, custom monitoring, DevOps culture](https://medium.com/miro-engineering/managing-hundreds-of-servers-for-load-testing-autoscaling-custom-monitoring-devops-culture-390fd1c7e699) * [Reliable load testing with regards to unexpected nuances](https://medium.com/miro-engineering/reliable-load-testing-with-regards-to-unexpected-nuances-6f38c82196a5) @@ -509,13 +562,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Monzo -#### Blog Posts +### Blog Posts + * [Autoscaling Monzo: How we optimise our platform to be just the right size](https://monzo.com/blog/2020/10/19/autoscaling-monzo) * [How we’ve evolved on-call at Monzo](https://monzo.com/blog/how-weve-evolved-on-call-at-monzo) * [How we respond to incidents](https://monzo.com/blog/2019/07/08/how-we-respond-to-incidents) * [How we monitor Monzo](https://monzo.com/blog/2018/07/27/how-we-monitor-monzo) -#### Videos +### Videos + * [Eventually Consistent Service Discovery](https://www.usenix.org/conference/srecon19emea/presentation/patel)
@@ -523,7 +578,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Netflix -#### Blog Posts +### Blog Posts + * [Building Netflix’s Distributed Tracing Infrastructure](https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304) * [Lessons from Building Observability Tools at Netflix](https://netflixtechblog.com/lessons-from-building-observability-tools-at-netflix-7cfafed6ab17) * [Edgar: Solving Mysteries Faster with Observability](https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f) @@ -542,10 +598,12 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [Announcing Security Monkey — AWS Security Configuration Monitoring and Analysis](https://netflixtechblog.com/announcing-security-monkey-aws-security-configuration-monitoring-and-analysis-1f2bfb001708) * [Lessons Netflix Learned from the AWS Outage](https://netflixtechblog.com/lessons-netflix-learned-from-the-aws-outage-deefe5fd0c04) -#### Major incidents & analysis reports +### Major incidents & analysis reports + * [Post-mortem of October 22, 2012 AWS degradation](https://netflixtechblog.com/post-mortem-of-october-22-2012-aws-degradation-efcee3ab40d5) -#### Videos +### Videos + * [AWS re:Invent 2019: A day in the life of a Netflix engineer (NFX202)](https://www.youtube.com/watch?v=0QS1TWLooo0) * [When /bin/sh Attacks: Revisiting "Automate All the Things"](https://www.usenix.org/conference/srecon20americas/presentation/reed) * [How Did Things Go Right? Learning More from Incidents](https://www.usenix.org/conference/srecon19americas/presentation/kitchens) @@ -571,7 +629,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
PayPal -#### Videos +### Videos + * [SREcon Conversations Asia/Pacific with Karthikeyan Selvaraj and Rajesh Ramachandran, PayPal](https://www.youtube.com/watch?v=XAIj567wBsU&feature=emb_title) * [SRE Then vs SRE Now: A Balancing Act between Reflexes and Intuitive Instincts at PayPal](https://www.usenix.org/conference/srecon19asia/presentation/sunder-vr) * [Detecting Service Degradation and Failures at Scale through Distributed Log Processing](https://www.usenix.org/conference/srecon19asia/presentation/narayanan) @@ -583,13 +642,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Pinterest -#### Blog Posts +### Blog Posts + * [Simplifying web deploys](https://medium.com/pinterest-engineering/simplifying-web-deploys-19244fe13737) * [Upgrading Pinterest operational metrics](https://medium.com/pinterest-engineering/upgrading-pinterest-operational-metrics-8718d058079a) * [Distributed tracing at Pinterest with new open source tools](https://medium.com/pinterest-engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b) * [Auto scaling Pinterest](https://medium.com/pinterest-engineering/auto-scaling-pinterest-df1d2beb4d64) -#### Videos +### Videos + * [Building Actionable Code Ownership](https://www.usenix.org/conference/srecon20americas/presentation/mukherji) * [Evolution of Observability Tools at Pinterest](https://www.usenix.org/conference/srecon19emea/presentation/abbas) * [Automating OS/Platform Upgrades for Service Owners](https://www.usenix.org/conference/srecon19asia/presentation/menezes) @@ -599,7 +660,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Postman -#### Blog Posts +### Blog Posts + * [Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana](https://medium.com/better-practices/chaos-d3ef238ec328)
@@ -607,7 +669,7 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Slalom Build -#### Blog Posts +### Blog Posts * [Beginners Guide to DevOps: How to Make It into the Industry](https://medium.com/slalom-build/beginners-guid-to-devops-how-to-make-it-into-the-industry-c1652d59807) * [GitHub Actions: Beyond CI/CD](https://medium.com/slalom-build/github-actions-beyond-ci-cd-cb3ddc6abaa) @@ -626,7 +688,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Scribd -#### Blog Posts +### Blog Posts + * [Learning from incidents: getting Sidekiq ready to serve a billion jobs](https://tech.scribd.com/blog/2020/sidekiq-incident-learnings.html) * [A testimonial for using PagerDuty at Scribd](https://tech.scribd.com/blog/2020/pagerduty-at-scribd.html) * [Assigning pager duty to developers](https://tech.scribd.com/blog/2019/managing-pagerduty-rotations.html) @@ -636,7 +699,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Shopify -#### Blog Posts +### Blog Posts + * [Resiliency Planning for High-Traffic Events](https://shopify.engineering/resiliency-planning-for-high-traffic-events) * [Capacity Planning at Scale](https://shopify.engineering/capacity-planning-shopify) * [Using DNS Traffic Management to Add Resiliency to Shopify’s Services](https://shopify.engineering/using-dns-traffic-management-add-resiliency-shopify-services) @@ -644,7 +708,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [Implementing ChatOps into our Incident Management Procedure](https://shopify.engineering/implementing-chatops-into-our-incident-management-procedure) * [StatsD at Shopify](https://shopify.engineering/17488320-statsd-at-shopify) -#### Videos +### Videos + * [Network Monitor: A Tale of ACKnowledging an Observability Gap](https://www.usenix.org/conference/srecon19emea/presentation/gedge) * [Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures](https://www.usenix.org/conference/srecon19emea/presentation/arthorne) * [Advanced Napkin Math: Estimating System Performance from First Principles](https://www.usenix.org/conference/srecon19emea/presentation/eskildsen) @@ -654,12 +719,15 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Slack -#### Blog Posts +### Blog Posts + * [Slack’s Outage on January 4th 2021](https://slack.engineering/slacks-outage-on-january-4th-2021/) * [A Terrible, Horrible, No-Good, Very Bad Day at Slack](https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-slack/) * [Deploys at Slack](https://slack.engineering/deploys-at-slack/) * [Disasterpiece Theater: Slack’s process for approachable Chaos Engineering](https://slack.engineering/disasterpiece-theater-slacks-process-for-approachable-chaos-engineering/) -#### Videos + +### Videos + * [Slack at the Edge](https://www.usenix.org/conference/srecon19asia/presentation/pemberton) * [What Breaks Our Systems: A Taxonomy of Black Swans](https://www.usenix.org/conference/srecon19americas/presentation/nolan-taxonomy) @@ -668,22 +736,25 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Soundcloud -## Soundcloud -#### Blog Posts +### Blog Posts + * [Alerting on SLOs like Pros](https://developers.soundcloud.com/blog/alerting-on-slos) * [Hands-Off Deployment with Canary](https://developers.soundcloud.com/blog/hands-off-deployment-with-canary) * [Prometheus has come of age – a reflection on the development of an open-source project](https://developers.soundcloud.com/blog/prometheus-has-come-of-age-a-reflection-on-the-development-of-an-open-source-project) * [Prometheus: Monitoring at SoundCloud](https://developers.soundcloud.com/blog/prometheus-monitoring-at-soundcloud) +
Spotify -#### Blog Posts +### Blog Posts + * [Techbytes: What The Industry Misses About Incidents and What You Can Do](https://engineering.atspotify.com/2020/02/26/techbytes-what-the-industry-misses-about-incidents-and-what-you-can-do/) * [Automated Incident Response Infrastructure in GCP](https://engineering.atspotify.com/2019/04/04/whacking-a-million-moles-automated-incident-response-infrastructure-in-gcp/) -#### Videos +### Videos + * [Tracing, Fast and Slow: Digging into and Improving Your Web Service's Performance](https://www.usenix.org/conference/srecon19americas/presentation/root)
@@ -691,10 +762,12 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Squarespace -#### Blog Posts +### Blog Posts + * [Under the Hood: Ensuring Site Reliability](https://engineering.squarespace.com/blog/2017/under-the-hood-ensuring-site-reliability) -#### Videos +### Videos + * [Pushing through Friction](https://www.usenix.org/conference/srecon19emea/presentation/na) * [How to SRE When Everything's Already on Fire](https://www.usenix.org/conference/srecon19emea/presentation/hidalgo) * [Case Study: Implementing SLOs for a New Service](https://www.usenix.org/conference/srecon19americas/presentation/lawson) @@ -705,11 +778,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Stack Overflow -#### Blog Posts +### Blog Posts + * [A deeper dive into our May 2019 security incident](https://stackoverflow.blog/2021/01/25/a-deeper-dive-into-our-may-2019-security-incident/) * [Guest Post - Failing over without falling over](https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/) -#### Videos +### Videos + * [Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline](https://www.usenix.org/conference/srecon20americas/presentation/limoncelli)
@@ -717,11 +792,13 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Stripe -#### Blog Posts +### Blog Posts + * [Fast and flexible observability with canonical log lines](https://stripe.com/blog/canonical-log-lines) * [Introducing Veneur: high performance and global aggregation for Datadog](https://stripe.com/blog/engineering/page/3) -#### Videos +### Videos + * [How Stripe Invests in Technical Infrastructure](https://www.usenix.org/conference/srecon19emea/presentation/larson) * [The AWS Billing Machine and Optimizing Cloud Costs](https://www.usenix.org/conference/srecon19asia/presentation/lopopolo) @@ -730,7 +807,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Target -#### Blog Posts +### Blog Posts + * [Ɔhaos Ǝnginǝǝring @ Target - Part 2](https://tech.target.com/2019/05/09/chaos-engineering-at-Target.html) * [Ɔhaos Ǝnginǝǝring @ Target - Part 1](https://tech.target.com/2019/02/05/chaos-engineering-at-Target.html) * [GoAlert - Your Future Open Source, On-Call Notification Product](https://tech.target.com/2019/02/25/introducing-goalert.html) @@ -743,7 +821,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Trivago -#### Blog Posts +### Blog Posts + * [How To Get Fooled By Metrics](https://tech.trivago.com/2020/12/04/how-to-get-fooled-by-metrics/)
@@ -751,34 +830,38 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Uber -#### Blog Posts +### Blog Posts + * [Disaster Recovery for Multi-Region Kafka at Uber](https://eng.uber.com/kafka/) * [Engineering Failover Handling in Uber’s Mobile Networking Infrastructure](https://eng.uber.com/eng-failover-handling/) * [Optimizing Observability with Jaeger, M3, and XYS at Uber](https://eng.uber.com/optimizing-observability/) +### Videos -#### Videos * [A Tale of Two Rotations: Building a Humane & Effective On-Call](https://www.usenix.org/conference/srecon19emea/presentation/lee) * [Testing in Production at Scale](https://www.usenix.org/conference/srecon19americas/presentation/gud) * [A History of SRE at Uber’ with Rick Boone of Uber](https://www.youtube.com/watch?v=qJnS-EfIIIE)
-
VGW -#### Blog Posts +### Blog Posts + * [The SRE Incident Response game](https://medium.com/@bruce_25864/the-sre-incident-response-game-db242fff391c) -#### Videos +### Videos + * [Level Up Your Incident Response With Gameplay](https://youtu.be/c2-52EP8_7c) +
Wikimedia Foundation -#### Videos +### Videos + * [Testing Encyclopedias in Production](https://www.usenix.org/conference/srecon20americas/presentation/mouzeli) * [What Happens When You Type en.wikipedia.org?](https://www.usenix.org/conference/srecon19emea/presentation/mouzeli) @@ -787,7 +870,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
Zerodha -#### Blog Posts +### Blog Posts + * [Infrastructure monitoring with Prometheus at Zerodha](https://zerodha.tech/blog/infra-monitoring-at-zerodha/)
@@ -795,7 +879,8 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools
SRECon Mix Playlist -#### Videos +### Videos + * [Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE](https://www.usenix.org/conference/srecon20americas/presentation/charagondla) * [Amdocs - SREs at Telecom and Media Industry: Bridging between Legacy and Cloud Native Apps](https://www.usenix.org/conference/srecon20americas/presentation/yitzhaki) * [Amazon - Confessions of a Systems Engineer: Learning from My 20+ Years of Failure](https://www.usenix.org/conference/srecon20americas/presentation/argent) @@ -824,6 +909,7 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools * [WeWork - Learning from Learnings: Anatomy of Three Incidents](https://www.usenix.org/conference/srecon19americas/presentation/shoup) * [Yelp - What I Wish I Knew before Going On-Call](https://www.usenix.org/conference/srecon19emea/presentation/shu) * [Zendesk - Latency and Availability Error Budgets Done Right at Scale](https://www.usenix.org/conference/srecon20americas/presentation/moyer) +
--- @@ -875,10 +961,9 @@ _Note to readers: This list refers to some of the articles, posts, videos, tools ## Other How They... repos -* [HowTheyTest](https://github.com/abhivaikar/howtheytest) -* [HowTheyDevOps](https://github.com/bregman-arie/howtheydevops) -* [HowTheyAWS](https://github.com/upgundecha/howtheyaws) - +* [Howtheytest](https://github.com/abhivaikar/howtheytest) +* [Howtheydevops](https://github.com/bregman-arie/howtheydevops) +* [Howtheyaws](https://github.com/upgundecha/howtheyaws) ## Contribute @@ -893,4 +978,4 @@ related or neighboring rights to this work. --- -If you decide to use this anywhere please give a credit to [@upgundecha](https://www.twitter.com/upgundecha) on twitter, also If you like my work, check out other projects on my Github. +If you decide to use this anywhere please give a credit to [@upgundecha](https://www.twitter.com/upgundecha) on twitter, also If you like my work, check out other projects on my Github. diff --git a/contributing.md b/contributing.md index 6bc0e4f..9f4c7d1 100644 --- a/contributing.md +++ b/contributing.md @@ -15,7 +15,6 @@ Ensure your pull request adheres to the following guidelines: Thank you for your suggestions! - ## Updating your PR A lot of times, making a PR adhere to the standards above can be difficult.