Deployed 4239ecf with MkDocs version: 1.2.3

This commit is contained in:
github-actions
2024-07-28 12:08:43 +00:00
parent f44a0152c4
commit a6af87660e
61 changed files with 1686 additions and 1410 deletions

View File

@@ -1063,10 +1063,10 @@
<li class="md-nav__item">
<a href="#fault-tolerance-failure-metrics" class="md-nav__link">
Fault Tolerance - Failure Metrics
Fault Tolerance: Failure Metrics
</a>
<nav class="md-nav" aria-label="Fault Tolerance - Failure Metrics">
<nav class="md-nav" aria-label="Fault Tolerance: Failure Metrics">
<ul class="md-nav__list">
<li class="md-nav__item">
@@ -1083,7 +1083,7 @@
<li class="md-nav__item">
<a href="#fault-tolerance-fault-isolation-terms" class="md-nav__link">
Fault Tolerance - Fault Isolation Terms
Fault Tolerance: Fault Isolation Terms
</a>
</li>
@@ -2184,10 +2184,10 @@
<li class="md-nav__item">
<a href="#fault-tolerance-failure-metrics" class="md-nav__link">
Fault Tolerance - Failure Metrics
Fault Tolerance: Failure Metrics
</a>
<nav class="md-nav" aria-label="Fault Tolerance - Failure Metrics">
<nav class="md-nav" aria-label="Fault Tolerance: Failure Metrics">
<ul class="md-nav__list">
<li class="md-nav__item">
@@ -2204,7 +2204,7 @@
<li class="md-nav__item">
<a href="#fault-tolerance-fault-isolation-terms" class="md-nav__link">
Fault Tolerance - Fault Isolation Terms
Fault Tolerance: Fault Isolation Terms
</a>
</li>
@@ -2260,10 +2260,10 @@
<p>Failures are not avoidable in any system and will happen all the time, hence we need to build systems that can tolerate failures or recover from them.</p>
<ul>
<li>In systems, failure is the norm rather than the exception.</li>
<li>"Anything that can go wrong will go wrong” -- Murphys Law</li>
<li>“Complex systems contain changing mixtures of failures latent within them” -- How Complex Systems Fail.</li>
<li>"Anything that can go wrong will go wrong”&mdash;Murphys Law</li>
<li>“Complex systems contain changing mixtures of failures latent within them”&mdash;How Complex Systems Fail.</li>
</ul>
<h3 id="fault-tolerance-failure-metrics">Fault Tolerance - Failure Metrics</h3>
<h3 id="fault-tolerance-failure-metrics">Fault Tolerance: Failure Metrics</h3>
<p>Common failure metrics that get measured and tracked for any system.</p>
<p><strong>Mean time to repair (MTTR):</strong> The average time to repair and restore a failed system. </p>
<p><strong>Mean time between failures (MTBF):</strong> The average operational time between one device failure or system breakdown and the next. </p>
@@ -2275,32 +2275,31 @@
<p><strong>Failure rate:</strong> Another reliability metric, which measures the frequency with which a component or system fails. It is expressed as a number of failures over a unit of time.</p>
<h4 id="refer">Refer</h4>
<ul>
<li>https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html</li>
<li><a href="https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html">https://www.splunk.com/en_us/data-insider/what-is-mean-time-to-repair.html</a></li>
</ul>
<h3 id="fault-tolerance-fault-isolation-terms">Fault Tolerance - Fault Isolation Terms</h3>
<h3 id="fault-tolerance-fault-isolation-terms">Fault Tolerance: Fault Isolation Terms</h3>
<p>Systems should have a short circuit. Say in our content sharing system, if “Notifications” is not working, the site should gracefully handle that failure by removing the functionality instead of taking the whole site down. </p>
<p>Swimlane is one of the commonly used fault isolation methodologies. Swimlane adds a barrier to the service from other services so that failure on either of them wont affect the other. Say we roll out a new feature Advertisement in our content sharing app.
We can have two architectures
<img alt="Swimlane" src="../images/swimlane-1.jpg" /></p>
<p>If Ads are generated on the fly synchronously during each Newsfeed request, the faults in the Ads feature get propagated to the Newsfeed feature. Instead if we swimlane the “Generation of Ads” service and use a shared storage to populate Newsfeed App, Ads failures wont cascade to Newsfeed, and worst case if Ads dont meet SLA , we can have Newsfeed without Ads.</p>
<p>Let's take another example, we have come up with a new model for our Content sharing App. Here we roll out an enterprise content sharing App where enterprises pay for the service and the content should never be shared outside the enterprise. </p>
We can have two architectures</p>
<p><img alt="Swimlane" src="../images/swimlane-1.jpg" /></p>
<p>If Ads are generated on the fly synchronously during each Newsfeed request, the faults in the Ads feature get propagated to the Newsfeed feature. Instead if we swimlane the “Generation of Ads” service and use a shared storage to populate Newsfeed App, Ads failures wont cascade to Newsfeed, and worst case if Ads dont meet SLA, we can have Newsfeed without Ads.</p>
<p>Let's take another example, we have come up with a new model for our Content sharing App. Here, we roll out an enterprise content sharing App where enterprises pay for the service and the content should never be shared outside the enterprise. </p>
<p><img alt="Swimlane-principles" src="../images/swimlane-2.jpg" /></p>
<h3 id="swimlane-principles">Swimlane Principles</h3>
<p><strong>Principle 1:</strong> Nothing is shared (also known as “share as little as possible”). The less that is shared within a swim lane, the more fault isolative the swim lane becomes. (as shown in Enterprise use-case)</p>
<p><strong>Principle 2:</strong> Nothing crosses a swim lane boundary. Synchronous (defined by expecting a request—not the transfer protocol) communication never crosses a swim lane boundary; if it does, the boundary is drawn incorrectly. (as shown in Ads feature)</p>
<p><strong>Principle 1:</strong> Nothing is shared (also known as “share as little as possible”). The less that is shared within a swimlane, the more fault isolative the swimlane becomes. (as shown in Enterprise use-case)</p>
<p><strong>Principle 2:</strong> Nothing crosses a swimlane boundary. Synchronous (defined by expecting a request—not the transfer protocol) communication never crosses a swimlane boundary; if it does, the boundary is drawn incorrectly. (as shown in Ads feature)</p>
<h3 id="swimlane-approaches">Swimlane Approaches</h3>
<p><strong>Approach 1:</strong> Swim lane the money-maker. Never allow your cash register to be compromised by other systems. (Tier 1 vs Tier 2 in enterprise use case)</p>
<p><strong>Approach 2:</strong> Swim lane the biggest sources of incidents. Identify the recurring causes of pain and isolate them. (if Ads feature is in code yellow, swim laning it is the best option)</p>
<p><strong>Approach 3:</strong> Swim lane natural barriers. Customer boundaries make good swim lanes. (Public vs Enterprise customers)</p>
<p><strong>Approach 1:</strong> Swimlane the money-maker. Never allow your cash register to be compromised by other systems. (Tier 1 vs Tier 2 in enterprise use case)</p>
<p><strong>Approach 2:</strong> Swimlane the biggest sources of incidents. Identify the recurring causes of pain and isolate them. (If Ads feature is in code yellow, swimlaning it is the best option.)</p>
<p><strong>Approach 3:</strong> Swimlane natural barriers. Customer boundaries make good swimlanes. (Public vs Enterprise customers)</p>
<h4 id="refer_1">Refer</h4>
<ul>
<li>https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch21.html#ch21</li>
<li><a href="https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch21.html#ch21">https://learning.oreilly.com/library/view/the-art-of/9780134031408/ch21.html#ch21</a></li>
</ul>
<h3 id="applications-in-sre-role">Applications in SRE role</h3>
<ol>
<li>Work with the DC tech or cloud team to distribute infrastructure such that its immune to switch or power failures by creating fault zones within a Data Center
https://docs.microsoft.com/en-us/azure/virtual-machines/manage-availability#use-availability-zones-to-protect-from-datacenter-level-failures</li>
<li>Work with the partners and design interaction between services such that one service breakdown is not amplified in a cascading fashion to all upstreams</li>
<li>Work with the DC tech or cloud team to distribute infrastructure such that it's immune to switch or power failures by creating fault zones within a Data Center (<a href="https://docs.microsoft.com/en-us/azure/virtual-machines/manage-availability#use-availability-zones-to-protect-from-datacenter-level-failures">https://docs.microsoft.com/en-us/azure/virtual-machines/manage-availability#use-availability-zones-to-protect-from-datacenter-level-failures</a>).</li>
<li>Work with the partners and design interaction between services such that one service breakdown is not amplified in a cascading fashion to all upstreams.</li>
</ol>