Deployed 4239ecf with MkDocs version: 1.2.3

This commit is contained in:
github-actions
2024-07-28 12:08:43 +00:00
parent f44a0152c4
commit a6af87660e
61 changed files with 1686 additions and 1410 deletions

View File

@@ -2109,11 +2109,11 @@
<p>Earlier we discussed different ways to collect key metric data points
from a service and its underlying infrastructure. This data gives us a
better understanding of how the service is performing. One of the main
objectives of monitoring is to detect any service degradations early
objectives of monitoring is to detect any service degradations early
(reduce Mean Time To Detect) and notify stakeholders so that the issues
are either avoided or can be fixed early, thus reducing Mean Time To
Recover (MTTR). For example, if you are notified when resource usage by
a service exceeds 90 percent, you can take preventive measures to avoid
a service exceeds 90%, you can take preventive measures to avoid
any service breakdown due to a shortage of resources. On the other hand,
when a service goes down due to an issue, early detection and
notification of such incidents can help you quickly fix the issue.</p>
@@ -2124,12 +2124,12 @@ notification of such incidents can help you quickly fix the issue.</p>
set up alerts on one or a combination of metrics to actively monitor the
service health. These alerts have a set of defined rules or conditions,
and when the rule is broken, you are notified. These rules can be as
simple as notifying when the metric value exceeds n to as complex as a
week over week (WoW) comparison of standard deviation over a period of
simple as notifying when the metric value exceeds <em>n</em> to as complex as a
week-over-week (WoW) comparison of standard deviation over a period of
time. Monitoring tools notify you about an active alert, and most of
these tools support instant messaging (IM) platforms, SMS, email, or
phone calls. Figure 8 shows a sample alert notification received on
Slack for memory usage exceeding 90 percent of total RAM space on the
Slack for memory usage exceeding 90% of total RAM space on the
host.</p>

View File

@@ -2110,22 +2110,22 @@
practices in mind.</p>
<ul>
<li>
<p><strong>Use the right metric type</strong> -- Most of the libraries available
<p><strong>Use the right metric type</strong>&mdash;Most of the libraries available
today offer various metric types. Choose the appropriate metric
type for monitoring your system. Following are the types of
metrics and their purposes.</p>
<ul>
<li>
<p><strong>Gauge --</strong> <em>Gauge</em> is a constant type of metric. After the
<p><strong>Gauge</strong>&mdash;<em>Gauge</em> is a constant type of metric. After the
metric is initialized, the metric value does not change unless
you intentionally update it.</p>
</li>
<li>
<p><strong>Timer --</strong> <em>Timer</em> measures the time taken to complete a
<p><strong>Timer</strong>&mdash;<em>Timer</em> measures the time taken to complete a
task.</p>
</li>
<li>
<p><strong>Counter --</strong> <em>Counter</em> counts the number of occurrences of a
<p><strong>Counter</strong>&mdash;<em>Counter</em> counts the number of occurrences of a
particular event.</p>
</li>
</ul>
@@ -2135,19 +2135,19 @@ practices in mind.</p>
Types</a>.</p>
<ul>
<li>
<p><strong>Avoid over-monitoring</strong> -- Monitoring can be a significant
engineering endeavor<strong><em>.</em></strong> Therefore, be sure not to spend too
<p><strong>Avoid over-monitoring</strong>&mdash;Monitoring can be a significant
engineering endeavor. Therefore, be sure not to spend too
much time and resources on monitoring services, yet make sure all
important metrics are captured.</p>
</li>
<li>
<p><strong>Prevent alert fatigue</strong> -- Set alerts for metrics that are
<p><strong>Prevent alert fatigue</strong>&mdash;Set alerts for metrics that are
important and actionable. If you receive too many non-critical
alerts, you might start ignoring alert notifications over time. As
a result, critical alerts might get overlooked.</p>
</li>
<li>
<p><strong>Have a runbook for alerts</strong> -- For every alert, make sure you have
<p><strong>Have a runbook for alerts</strong>&mdash;For every alert, make sure you have
a document explaining what actions and checks need to be performed
when the alert fires. This enables any engineer on the team to
handle the alert and take necessary actions, without any help from

View File

@@ -2112,28 +2112,28 @@ understand various subsystem statistics (CPU, memory, network, and so
on). Let's look at some of the tools that are predominantly used.</p>
<ul>
<li>
<p><code>ps/top</code>-- The process status command (ps) displays information
<p><strong><code>ps/top</code></strong>: The process status command (<code>ps</code>) displays information
about all the currently running processes in a Linux system. The
top command is similar to the ps command, but it periodically
top command is similar to the <code>ps</code> command, but it periodically
updates the information displayed until the program is terminated.
An advanced version of top, called htop, has a more user-friendly
An advanced version of top, called <code>htop</code>, has a more user-friendly
interface and some additional features. These command-line
utilities come with options to modify the operation and output of
the command. Following are some important options supported by the
ps command.</p>
<code>ps</code> command.</p>
<ul>
<li>
<p><code>-p &lt;pid1, pid2,...&gt;</code> -- Displays information about processes
<p><code>-p &lt;pid1, pid2,...&gt;</code>: Displays information about processes
that match the specified process IDs. Similarly, you can use
<code>-u &lt;uid&gt;</code> and <code>-g &lt;gid&gt;</code> to display information about
processes belonging to a specific user or group.</p>
</li>
<li>
<p><code>-a</code> -- Displays information about other users' processes, as well
<p><code>-a</code>: Displays information about other users' processes, as well
as one's own.</p>
</li>
<li>
<p><code>-x</code> -- When displaying processes matched by other options,
<p><code>-x</code>: When displaying processes matched by other options,
includes processes that do not have a controlling terminal.</p>
</li>
</ul>
@@ -2145,21 +2145,21 @@ on). Let's look at some of the tools that are predominantly used.</p>
</p>
<ul>
<li>
<p><code>ss</code> -- The socket statistics command (ss) displays information
<p><strong><code>ss</code></strong>: The socket statistics command (<code>ss</code>) displays information
about network sockets on the system. This tool is the successor of
<a href="https://man7.org/linux/man-pages/man8/netstat.8.html">netstat</a>,
which is deprecated. Following are some command-line options
supported by the ss command:</p>
supported by the <code>ss</code> command:</p>
<ul>
<li>
<p><code>-t</code> -- Displays the TCP socket. Similarly, <code>-u</code> displays UDP
<p><code>-t</code>: Displays the TCP socket. Similarly, <code>-u</code> displays UDP
sockets, <code>-x</code> is for UNIX domain sockets, and so on.</p>
</li>
<li>
<p><code>-l</code> -- Displays only listening sockets.</p>
<p><code>-l</code>: Displays only listening sockets.</p>
</li>
<li>
<p><code>-n</code> -- Instructs the command to not resolve service names.
<p><code>-n</code>: Instructs the command to not resolve service names.
Instead displays the port numbers.</p>
</li>
</ul>
@@ -2168,7 +2168,7 @@ on). Let's look at some of the tools that are predominantly used.</p>
<p><img alt="List of listening sockets on a system" src="../images/image8.png" /> <p align="center"> Figure
3: List of listening sockets on a system </p></p>
<ul>
<li><code>free</code> -- The free command displays memory usage statistics on the
<li><strong><code>free</code></strong>: The <code>free</code> command displays memory usage statistics on the
host like available memory, used memory, and free memory. Most often,
this command is used with the <code>-h</code> command-line option, which
displays the statistics in a human-readable format.</li>
@@ -2177,7 +2177,7 @@ on). Let's look at some of the tools that are predominantly used.</p>
<p align="center"> Figure 4: Memory statistics on a host in human-readable form </p>
<ul>
<li><code>df --</code> The df command displays disk space usage statistics. The
<li><strong><code>df</code></strong>: The <code>df</code> command displays disk space usage statistics. The
<code>-i</code> command-line option is also often used to display
<a href="https://en.wikipedia.org/wiki/Inode">inode</a> usage
statistics. The <code>-h</code> command-line option is used for displaying
@@ -2189,13 +2189,13 @@ on). Let's look at some of the tools that are predominantly used.</p>
<ul>
<li>
<p><code>sar</code> -- The sar utility monitors various subsystems, such as CPU
<p><strong><code>sar</code></strong>: The <code>sar</code> utility monitors various subsystems, such as CPU
and memory, in real time. This data can be stored in a file
specified with the <code>-o</code> option. This tool helps to identify
anomalies.</p>
</li>
<li>
<p><code>iftop</code> -- The interface top command (<code>iftop</code>) displays bandwidth
<p><strong><code>iftop</code></strong>: The interface top command (<code>iftop</code>) displays bandwidth
utilization by a host on an interface. This command is often used
to identify bandwidth usage by active connections. The <code>-i</code> option
specifies which network interface to watch.</p>
@@ -2209,31 +2209,31 @@ active connection on the host </p>
</p>
<ul>
<li>
<p><code>tcpdump</code> -- The tcpdump command is a network monitoring tool that
<p><strong><code>tcpdump</code></strong>: The <code>tcpdump</code> command is a network monitoring tool that
captures network packets flowing over the network and displays a
description of the captured packets. The following options are
available:</p>
<ul>
<li>
<p><code>-i &lt;interface&gt;</code> -- Interface to listen on</p>
<p><code>-i &lt;interface&gt;</code>: Interface to listen on</p>
</li>
<li>
<p><code>host &lt;IP/hostname&gt;</code> -- Filters traffic going to or from the
<p><code>host &lt;IP/hostname&gt;</code>: Filters traffic going to or from the
specified host</p>
</li>
<li>
<p><code>src/dst</code> -- Displays one-way traffic from the source (src) or to
<p><code>src/dst</code>: Displays one-way traffic from the source (src) or to
the destination (dst)</p>
</li>
<li>
<p><code>port &lt;port number&gt;</code> -- Filters traffic to or from a particular
<p><code>port &lt;port number&gt;</code>: Filters traffic to or from a particular
port</p>
</li>
</ul>
</li>
</ul>
<p><img alt="tcpdump of packets on an interface" src="../images/image10.png" /> </p>
<p align="center"> Figure 7: *tcpdump* of packets on *docker0*
<p align="center"> Figure 7: <code>tcpdump</code> of packets on <code>docker0</code>
interface on a host </p>

View File

@@ -2151,12 +2151,12 @@
<h1 id="conclusion">Conclusion</h1>
<p>A robust monitoring and alerting system is necessary for maintaining and
troubleshooting a system. A dashboard with key metrics can give you an
overview of service performance, all in one place. Well-defined alerts
overview of service performance, all in one place. Well-defined alerts
(with realistic thresholds and notifications) further enable you to
quickly identify any anomalies in the service infrastructure and in
resource saturation. By taking necessary actions, you can avoid any
service degradations and decrease MTTD for service breakdowns.</p>
<p>In addition to in-house monitoring, monitoring real user experience can
<p>In addition to in-house monitoring, monitoring real-user experience can
help you to understand service performance as perceived by the users.
Many modules are involved in serving the user, and most of them are out
of your control. Therefore, you need to have real-user monitoring in

View File

@@ -2208,7 +2208,7 @@ a system, analyzing the data to derive meaningful information, and
displaying the data to the users. In simple terms, you measure various
metrics regularly to understand the state of the system, including but
not limited to, user requests, latency, and error rate. <em>What gets
measured, gets fixed</em>---if you can measure something, you can reason
measured, gets fixed</em>&mdash;if you can measure something, you can reason
about it, understand it, discuss it, and act upon it with confidence.</p>
<h2 id="four-golden-signals-of-monitoring">Four golden signals of monitoring</h2>
<p>When setting up monitoring for a system, you need to decide what to
@@ -2237,7 +2237,7 @@ if you can measure only four metrics of your service, focus on these
four. Let's look at each of the four golden signals.</p>
<ul>
<li>
<p><strong>Traffic</strong> -- <em>Traffic</em> gives a better understanding of the service
<p><strong>Traffic</strong>&mdash;<em>Traffic</em> gives a better understanding of the service
demand. Often referred to as <em>service QPS</em> (queries per second),
traffic is a measure of requests served by the service. This
signal helps you to decide when a service needs to be scaled up to
@@ -2245,7 +2245,7 @@ four. Let's look at each of the four golden signals.</p>
cost-effective.</p>
</li>
<li>
<p><strong>Latency</strong> -- <em>Latency</em> is the measure of time taken by the service
<p><strong>Latency</strong>&mdash;<em>Latency</em> is the measure of time taken by the service
to process the incoming request and send the response. Measuring
service latency helps in the early detection of slow degradation
of the service. Distinguishing between the latency of successful
@@ -2258,7 +2258,7 @@ four. Let's look at each of the four golden signals.</p>
overall latency might result in misleading calculations.</p>
</li>
<li>
<p><strong>Error (rate)</strong> -- <em>Error</em> is the measure of failed client
<p><strong>Error (rate)</strong>&mdash;<em>Error</em> is the measure of failed client
requests. These failures can be easily identified based on the
response codes (<a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses">HTTP 5XX
error</a>).
@@ -2274,7 +2274,7 @@ four. Let's look at each of the four golden signals.</p>
in place to capture errors in addition to the response codes.</p>
</li>
<li>
<p><strong>Saturation</strong> -- <em>Saturation</em> is a measure of the resource
<p><strong>Saturation</strong>&mdash;<em>Saturation</em> is a measure of the resource
utilization by a service. This signal tells you the state of
service resources and how full they are. These resources include
memory, compute, network I/O, and so on. Service performance
@@ -2306,19 +2306,19 @@ can build intelligent applications to address specific needs. Some of
the key use cases follow:</p>
<ul>
<li>
<p><strong>Reduction in time to resolve issues</strong> -- With a good monitoring
<p><strong>Reduction in time to resolve issues</strong>&mdash;With a good monitoring
infrastructure in place, you can identify issues quickly and
resolve them, which reduces the impact caused by the issues.</p>
</li>
<li>
<p><strong>Business decisions</strong> -- Data collected over a period of time can
<p><strong>Business decisions</strong>&mdash;Data collected over a period of time can
help you make business decisions such as determining the product
release cycle, which features to invest in, and geographical areas
to focus on. Decisions based on long-term data can improve the
overall product experience.</p>
</li>
<li>
<p><strong>Resource planning</strong> -- By analyzing historical data, you can
<p><strong>Resource planning</strong>&mdash;By analyzing historical data, you can
forecast service compute-resource demands, and you can properly
allocate resources. This allows financially effective decisions,
with no compromise in end-user experience.</p>
@@ -2328,44 +2328,44 @@ the key use cases follow:</p>
terminologies.</p>
<ul>
<li>
<p><strong>Metric</strong> -- A metric is a quantitative measure of a particular
system attribute---for example, memory or CPU</p>
<p><strong>Metric</strong>&mdash;A metric is a quantitative measure of a particular
system attribute&mdash;for example, memory or CPU</p>
</li>
<li>
<p><strong>Node or host</strong> -- A physical server, virtual machine, or container
<p><strong>Node or host</strong>&mdash;A physical server, virtual machine, or container
where an application is running</p>
</li>
<li>
<p><strong>QPS</strong> -- <em>Queries Per Second</em>, a measure of traffic served by the
<p><strong>QPS</strong>&mdash;<em>Queries Per Second</em>, a measure of traffic served by the
service per second</p>
</li>
<li>
<p><strong>Latency</strong> -- The time interval between user action and the
response from the server---for example, time spent after sending a
<p><strong>Latency</strong>&mdash;The time interval between user action and the
response from the server&mdash;for example, time spent after sending a
query to a database before the first response bit is received</p>
</li>
<li>
<p><strong>Error</strong> <strong>rate</strong> -- Number of errors observed over a particular
<p><strong>Error</strong> <strong>rate</strong>&mdash;Number of errors observed over a particular
time period (usually a second)</p>
</li>
<li>
<p><strong>Graph</strong> -- In monitoring, a graph is a representation of one or
<p><strong>Graph</strong>&mdash;In monitoring, a graph is a representation of one or
more values of metrics collected over time</p>
</li>
<li>
<p><strong>Dashboard</strong> -- A dashboard is a collection of graphs that provide
<p><strong>Dashboard</strong>&mdash;A dashboard is a collection of graphs that provide
an overview of system health</p>
</li>
<li>
<p><strong>Incident</strong> -- An incident is an event that disrupts the normal
<p><strong>Incident</strong>&mdash;An incident is an event that disrupts the normal
operations of a system</p>
</li>
<li>
<p><strong>MTTD</strong> -- <em>Mean Time To Detect</em> is the time interval between the
<p><strong>MTTD</strong>&mdash;<em>Mean Time To Detect</em> is the time interval between the
beginning of a service failure and the detection of such failure</p>
</li>
<li>
<p><strong>MTTR</strong> -- Mean Time To Resolve is the time spent to fix a service
<p><strong>MTTR</strong>&mdash;Mean Time To Resolve is the time spent to fix a service
failure and bring the service back to its normal state</p>
</li>
</ul>
@@ -2382,7 +2382,7 @@ notifying concerned parties during any abnormal behavior. Let's look at
each of these infrastructure components:</p>
<ul>
<li>
<p><strong>Host metrics agent --</strong> A <em>host metrics agent</em> is a process
<p><strong>Host metrics agent</strong>&mdash;A <em>host metrics agent</em> is a process
running on the host that collects performance statistics for host
subsystems such as memory, CPU, and network. These metrics are
regularly relayed to a metrics collector for storage and
@@ -2392,7 +2392,7 @@ each of these infrastructure components:</p>
and <a href="https://www.elastic.co/beats/metricbeat">metricbeat</a>.</p>
</li>
<li>
<p><strong>Metric aggregator --</strong> A <em>metric aggregator</em> is a process running
<p><strong>Metric aggregator</strong>&mdash;A <em>metric aggregator</em> is a process running
on the host. Applications running on the host collect service
metrics using
<a href="https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)">instrumentation</a>.
@@ -2403,7 +2403,7 @@ each of these infrastructure components:</p>
<a href="https://github.com/statsd/statsd">StatsD</a>.</p>
</li>
<li>
<p><strong>Metrics collector --</strong> A <em>metrics collector</em> process collects all
<p><strong>Metrics collector</strong>&mdash;A <em>metrics collector</em> process collects all
the metrics from the metric aggregators running on multiple hosts.
The collector takes care of decoding and stores this data on the
database. Metric collection and storage might be taken care of by
@@ -2413,13 +2413,13 @@ each of these infrastructure components:</p>
daemons</a>.</p>
</li>
<li>
<p><strong>Storage --</strong> A time-series database stores all of these metrics.
<p><strong>Storage</strong>&mdash;A time-series database stores all of these metrics.
Examples are <a href="http://opentsdb.net/">OpenTSDB</a>,
<a href="https://graphite.readthedocs.io/en/stable/whisper.html">Whisper</a>,
and <a href="https://www.influxdata.com/">InfluxDB</a>.</p>
</li>
<li>
<p><strong>Metrics server --</strong> A <em>metrics server</em> can be as basic as a web
<p><strong>Metrics server</strong>&mdash;A <em>metrics server</em> can be as basic as a web
server that graphically renders metric data. In addition, the
metrics server provides aggregation functionalities and APIs for
fetching metric data programmatically. Some examples are
@@ -2427,7 +2427,7 @@ each of these infrastructure components:</p>
<a href="https://github.com/graphite-project/graphite-web">Graphite-Web</a>.</p>
</li>
<li>
<p><strong>Alert manager --</strong> The <em>alert manager</em> regularly polls metric data
<p><strong>Alert manager</strong>&mdash;The <em>alert manager</em> regularly polls metric data
available and, if there are any anomalies detected, notifies you.
Each alert has a set of rules for identifying such anomalies.
Today many metrics servers such as

View File

@@ -2107,7 +2107,7 @@
<h2 id="_1"></h2>
<h1 id="observability">Observability</h1>
<p>Engineers often use observability when referring to building reliable
systems. <em>Observability</em> is a term derived from control theory, It is a
systems. <em>Observability</em> is a term derived from control theory, it is a
measure of how well internal states of a system can be inferred from
knowledge of its external outputs. Service infrastructures used on a
daily basis are becoming more and more complex; proactive monitoring
@@ -2177,7 +2177,7 @@ value and take further action.</p>
Logstash, Kibana), which provides centralized log processing. Beats is a
collection of lightweight data shippers that can ship logs, audit data,
network data, and so on over the network. In this use case specifically,
we are using filebeat as a log shipper. Filebeat watches service log
we are using Filebeat as a log shipper. Filebeat watches service log
files and ships the log data to Logstash. Logstash parses these logs and
transforms the data, preparing it to store on Elasticsearch. Transformed
log data is stored on Elasticsearch and indexed for fast retrieval.

View File

@@ -2111,13 +2111,13 @@ addition, a number of companies such as
<a href="https://www.datadoghq.com/">Datadog</a> offer
monitoring-as-a-service. In this section, we are not covering
monitoring-as-a-service in depth.</p>
<p>In recent years, more and more people have access to the internet. Many
<p>In recent years, more and more people have access to the Internet. Many
services are offered online to cater to the increasing user base. As a
result, web pages are becoming larger, with increased client-side
scripts. Users want these services to be fast and error-free. From the
service point of view, when the response body is composed, an HTTP 200
OK response is sent, and everything looks okay. But there might be
errors during transmission or on the client side. As previously
errors during transmission or on the client-side. As previously
mentioned, monitoring services from within the service infrastructure
give good visibility into service health, but this is not enough. You
need to monitor user experience, specifically the availability of
@@ -2131,7 +2131,7 @@ service is globally accessible. Other third-party monitoring solutions
for real user monitoring (RUM) provide performance statistics such as
service uptime and response time, from different geographical locations.
This allows you to monitor the user experience from these locations,
which might have different internet backbones, different operating
which might have different Internet backbones, different operating
systems, and different browsers and browser versions. <a href="https://pages.catchpoint.com/overview-video">Catchpoint
Global Monitoring
Network</a> is a