mirror of
https://github.com/linkedin/school-of-sre
synced 2026-01-21 07:58:03 +00:00
Deployed 4239ecf with MkDocs version: 1.2.3
This commit is contained in:
@@ -2109,11 +2109,11 @@
|
||||
<p>Earlier we discussed different ways to collect key metric data points
|
||||
from a service and its underlying infrastructure. This data gives us a
|
||||
better understanding of how the service is performing. One of the main
|
||||
objectives of monitoring is to detect any service degradations early
|
||||
objectives of monitoring is to detect any service degradations early
|
||||
(reduce Mean Time To Detect) and notify stakeholders so that the issues
|
||||
are either avoided or can be fixed early, thus reducing Mean Time To
|
||||
Recover (MTTR). For example, if you are notified when resource usage by
|
||||
a service exceeds 90 percent, you can take preventive measures to avoid
|
||||
a service exceeds 90%, you can take preventive measures to avoid
|
||||
any service breakdown due to a shortage of resources. On the other hand,
|
||||
when a service goes down due to an issue, early detection and
|
||||
notification of such incidents can help you quickly fix the issue.</p>
|
||||
@@ -2124,12 +2124,12 @@ notification of such incidents can help you quickly fix the issue.</p>
|
||||
set up alerts on one or a combination of metrics to actively monitor the
|
||||
service health. These alerts have a set of defined rules or conditions,
|
||||
and when the rule is broken, you are notified. These rules can be as
|
||||
simple as notifying when the metric value exceeds n to as complex as a
|
||||
week over week (WoW) comparison of standard deviation over a period of
|
||||
simple as notifying when the metric value exceeds <em>n</em> to as complex as a
|
||||
week-over-week (WoW) comparison of standard deviation over a period of
|
||||
time. Monitoring tools notify you about an active alert, and most of
|
||||
these tools support instant messaging (IM) platforms, SMS, email, or
|
||||
phone calls. Figure 8 shows a sample alert notification received on
|
||||
Slack for memory usage exceeding 90 percent of total RAM space on the
|
||||
Slack for memory usage exceeding 90% of total RAM space on the
|
||||
host.</p>
|
||||
|
||||
|
||||
|
||||
@@ -2110,22 +2110,22 @@
|
||||
practices in mind.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Use the right metric type</strong> -- Most of the libraries available
|
||||
<p><strong>Use the right metric type</strong>—Most of the libraries available
|
||||
today offer various metric types. Choose the appropriate metric
|
||||
type for monitoring your system. Following are the types of
|
||||
metrics and their purposes.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Gauge --</strong> <em>Gauge</em> is a constant type of metric. After the
|
||||
<p><strong>Gauge</strong>—<em>Gauge</em> is a constant type of metric. After the
|
||||
metric is initialized, the metric value does not change unless
|
||||
you intentionally update it.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Timer --</strong> <em>Timer</em> measures the time taken to complete a
|
||||
<p><strong>Timer</strong>—<em>Timer</em> measures the time taken to complete a
|
||||
task.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Counter --</strong> <em>Counter</em> counts the number of occurrences of a
|
||||
<p><strong>Counter</strong>—<em>Counter</em> counts the number of occurrences of a
|
||||
particular event.</p>
|
||||
</li>
|
||||
</ul>
|
||||
@@ -2135,19 +2135,19 @@ practices in mind.</p>
|
||||
Types</a>.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Avoid over-monitoring</strong> -- Monitoring can be a significant
|
||||
engineering endeavor<strong><em>.</em></strong> Therefore, be sure not to spend too
|
||||
<p><strong>Avoid over-monitoring</strong>—Monitoring can be a significant
|
||||
engineering endeavor. Therefore, be sure not to spend too
|
||||
much time and resources on monitoring services, yet make sure all
|
||||
important metrics are captured.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Prevent alert fatigue</strong> -- Set alerts for metrics that are
|
||||
<p><strong>Prevent alert fatigue</strong>—Set alerts for metrics that are
|
||||
important and actionable. If you receive too many non-critical
|
||||
alerts, you might start ignoring alert notifications over time. As
|
||||
a result, critical alerts might get overlooked.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Have a runbook for alerts</strong> -- For every alert, make sure you have
|
||||
<p><strong>Have a runbook for alerts</strong>—For every alert, make sure you have
|
||||
a document explaining what actions and checks need to be performed
|
||||
when the alert fires. This enables any engineer on the team to
|
||||
handle the alert and take necessary actions, without any help from
|
||||
|
||||
@@ -2112,28 +2112,28 @@ understand various subsystem statistics (CPU, memory, network, and so
|
||||
on). Let's look at some of the tools that are predominantly used.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>ps/top</code>-- The process status command (ps) displays information
|
||||
<p><strong><code>ps/top</code></strong>: The process status command (<code>ps</code>) displays information
|
||||
about all the currently running processes in a Linux system. The
|
||||
top command is similar to the ps command, but it periodically
|
||||
top command is similar to the <code>ps</code> command, but it periodically
|
||||
updates the information displayed until the program is terminated.
|
||||
An advanced version of top, called htop, has a more user-friendly
|
||||
An advanced version of top, called <code>htop</code>, has a more user-friendly
|
||||
interface and some additional features. These command-line
|
||||
utilities come with options to modify the operation and output of
|
||||
the command. Following are some important options supported by the
|
||||
ps command.</p>
|
||||
<code>ps</code> command.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>-p <pid1, pid2,...></code> -- Displays information about processes
|
||||
<p><code>-p <pid1, pid2,...></code>: Displays information about processes
|
||||
that match the specified process IDs. Similarly, you can use
|
||||
<code>-u <uid></code> and <code>-g <gid></code> to display information about
|
||||
processes belonging to a specific user or group.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>-a</code> -- Displays information about other users' processes, as well
|
||||
<p><code>-a</code>: Displays information about other users' processes, as well
|
||||
as one's own.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>-x</code> -- When displaying processes matched by other options,
|
||||
<p><code>-x</code>: When displaying processes matched by other options,
|
||||
includes processes that do not have a controlling terminal.</p>
|
||||
</li>
|
||||
</ul>
|
||||
@@ -2145,21 +2145,21 @@ on). Let's look at some of the tools that are predominantly used.</p>
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>ss</code> -- The socket statistics command (ss) displays information
|
||||
<p><strong><code>ss</code></strong>: The socket statistics command (<code>ss</code>) displays information
|
||||
about network sockets on the system. This tool is the successor of
|
||||
<a href="https://man7.org/linux/man-pages/man8/netstat.8.html">netstat</a>,
|
||||
which is deprecated. Following are some command-line options
|
||||
supported by the ss command:</p>
|
||||
supported by the <code>ss</code> command:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>-t</code> -- Displays the TCP socket. Similarly, <code>-u</code> displays UDP
|
||||
<p><code>-t</code>: Displays the TCP socket. Similarly, <code>-u</code> displays UDP
|
||||
sockets, <code>-x</code> is for UNIX domain sockets, and so on.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>-l</code> -- Displays only listening sockets.</p>
|
||||
<p><code>-l</code>: Displays only listening sockets.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>-n</code> -- Instructs the command to not resolve service names.
|
||||
<p><code>-n</code>: Instructs the command to not resolve service names.
|
||||
Instead displays the port numbers.</p>
|
||||
</li>
|
||||
</ul>
|
||||
@@ -2168,7 +2168,7 @@ on). Let's look at some of the tools that are predominantly used.</p>
|
||||
<p><img alt="List of listening sockets on a system" src="../images/image8.png" /> <p align="center"> Figure
|
||||
3: List of listening sockets on a system </p></p>
|
||||
<ul>
|
||||
<li><code>free</code> -- The free command displays memory usage statistics on the
|
||||
<li><strong><code>free</code></strong>: The <code>free</code> command displays memory usage statistics on the
|
||||
host like available memory, used memory, and free memory. Most often,
|
||||
this command is used with the <code>-h</code> command-line option, which
|
||||
displays the statistics in a human-readable format.</li>
|
||||
@@ -2177,7 +2177,7 @@ on). Let's look at some of the tools that are predominantly used.</p>
|
||||
<p align="center"> Figure 4: Memory statistics on a host in human-readable form </p>
|
||||
|
||||
<ul>
|
||||
<li><code>df --</code> The df command displays disk space usage statistics. The
|
||||
<li><strong><code>df</code></strong>: The <code>df</code> command displays disk space usage statistics. The
|
||||
<code>-i</code> command-line option is also often used to display
|
||||
<a href="https://en.wikipedia.org/wiki/Inode">inode</a> usage
|
||||
statistics. The <code>-h</code> command-line option is used for displaying
|
||||
@@ -2189,13 +2189,13 @@ on). Let's look at some of the tools that are predominantly used.</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>sar</code> -- The sar utility monitors various subsystems, such as CPU
|
||||
<p><strong><code>sar</code></strong>: The <code>sar</code> utility monitors various subsystems, such as CPU
|
||||
and memory, in real time. This data can be stored in a file
|
||||
specified with the <code>-o</code> option. This tool helps to identify
|
||||
anomalies.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>iftop</code> -- The interface top command (<code>iftop</code>) displays bandwidth
|
||||
<p><strong><code>iftop</code></strong>: The interface top command (<code>iftop</code>) displays bandwidth
|
||||
utilization by a host on an interface. This command is often used
|
||||
to identify bandwidth usage by active connections. The <code>-i</code> option
|
||||
specifies which network interface to watch.</p>
|
||||
@@ -2209,31 +2209,31 @@ active connection on the host </p>
|
||||
</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>tcpdump</code> -- The tcpdump command is a network monitoring tool that
|
||||
<p><strong><code>tcpdump</code></strong>: The <code>tcpdump</code> command is a network monitoring tool that
|
||||
captures network packets flowing over the network and displays a
|
||||
description of the captured packets. The following options are
|
||||
available:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><code>-i <interface></code> -- Interface to listen on</p>
|
||||
<p><code>-i <interface></code>: Interface to listen on</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>host <IP/hostname></code> -- Filters traffic going to or from the
|
||||
<p><code>host <IP/hostname></code>: Filters traffic going to or from the
|
||||
specified host</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>src/dst</code> -- Displays one-way traffic from the source (src) or to
|
||||
<p><code>src/dst</code>: Displays one-way traffic from the source (src) or to
|
||||
the destination (dst)</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>port <port number></code> -- Filters traffic to or from a particular
|
||||
<p><code>port <port number></code>: Filters traffic to or from a particular
|
||||
port</p>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<p><img alt="tcpdump of packets on an interface" src="../images/image10.png" /> </p>
|
||||
<p align="center"> Figure 7: *tcpdump* of packets on *docker0*
|
||||
<p align="center"> Figure 7: <code>tcpdump</code> of packets on <code>docker0</code>
|
||||
interface on a host </p>
|
||||
|
||||
|
||||
|
||||
@@ -2151,12 +2151,12 @@
|
||||
<h1 id="conclusion">Conclusion</h1>
|
||||
<p>A robust monitoring and alerting system is necessary for maintaining and
|
||||
troubleshooting a system. A dashboard with key metrics can give you an
|
||||
overview of service performance, all in one place. Well-defined alerts
|
||||
overview of service performance, all in one place. Well-defined alerts
|
||||
(with realistic thresholds and notifications) further enable you to
|
||||
quickly identify any anomalies in the service infrastructure and in
|
||||
resource saturation. By taking necessary actions, you can avoid any
|
||||
service degradations and decrease MTTD for service breakdowns.</p>
|
||||
<p>In addition to in-house monitoring, monitoring real user experience can
|
||||
<p>In addition to in-house monitoring, monitoring real-user experience can
|
||||
help you to understand service performance as perceived by the users.
|
||||
Many modules are involved in serving the user, and most of them are out
|
||||
of your control. Therefore, you need to have real-user monitoring in
|
||||
|
||||
@@ -2208,7 +2208,7 @@ a system, analyzing the data to derive meaningful information, and
|
||||
displaying the data to the users. In simple terms, you measure various
|
||||
metrics regularly to understand the state of the system, including but
|
||||
not limited to, user requests, latency, and error rate. <em>What gets
|
||||
measured, gets fixed</em>---if you can measure something, you can reason
|
||||
measured, gets fixed</em>—if you can measure something, you can reason
|
||||
about it, understand it, discuss it, and act upon it with confidence.</p>
|
||||
<h2 id="four-golden-signals-of-monitoring">Four golden signals of monitoring</h2>
|
||||
<p>When setting up monitoring for a system, you need to decide what to
|
||||
@@ -2237,7 +2237,7 @@ if you can measure only four metrics of your service, focus on these
|
||||
four. Let's look at each of the four golden signals.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Traffic</strong> -- <em>Traffic</em> gives a better understanding of the service
|
||||
<p><strong>Traffic</strong>—<em>Traffic</em> gives a better understanding of the service
|
||||
demand. Often referred to as <em>service QPS</em> (queries per second),
|
||||
traffic is a measure of requests served by the service. This
|
||||
signal helps you to decide when a service needs to be scaled up to
|
||||
@@ -2245,7 +2245,7 @@ four. Let's look at each of the four golden signals.</p>
|
||||
cost-effective.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Latency</strong> -- <em>Latency</em> is the measure of time taken by the service
|
||||
<p><strong>Latency</strong>—<em>Latency</em> is the measure of time taken by the service
|
||||
to process the incoming request and send the response. Measuring
|
||||
service latency helps in the early detection of slow degradation
|
||||
of the service. Distinguishing between the latency of successful
|
||||
@@ -2258,7 +2258,7 @@ four. Let's look at each of the four golden signals.</p>
|
||||
overall latency might result in misleading calculations.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Error (rate)</strong> -- <em>Error</em> is the measure of failed client
|
||||
<p><strong>Error (rate)</strong>—<em>Error</em> is the measure of failed client
|
||||
requests. These failures can be easily identified based on the
|
||||
response codes (<a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses">HTTP 5XX
|
||||
error</a>).
|
||||
@@ -2274,7 +2274,7 @@ four. Let's look at each of the four golden signals.</p>
|
||||
in place to capture errors in addition to the response codes.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Saturation</strong> -- <em>Saturation</em> is a measure of the resource
|
||||
<p><strong>Saturation</strong>—<em>Saturation</em> is a measure of the resource
|
||||
utilization by a service. This signal tells you the state of
|
||||
service resources and how full they are. These resources include
|
||||
memory, compute, network I/O, and so on. Service performance
|
||||
@@ -2306,19 +2306,19 @@ can build intelligent applications to address specific needs. Some of
|
||||
the key use cases follow:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Reduction in time to resolve issues</strong> -- With a good monitoring
|
||||
<p><strong>Reduction in time to resolve issues</strong>—With a good monitoring
|
||||
infrastructure in place, you can identify issues quickly and
|
||||
resolve them, which reduces the impact caused by the issues.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Business decisions</strong> -- Data collected over a period of time can
|
||||
<p><strong>Business decisions</strong>—Data collected over a period of time can
|
||||
help you make business decisions such as determining the product
|
||||
release cycle, which features to invest in, and geographical areas
|
||||
to focus on. Decisions based on long-term data can improve the
|
||||
overall product experience.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Resource planning</strong> -- By analyzing historical data, you can
|
||||
<p><strong>Resource planning</strong>—By analyzing historical data, you can
|
||||
forecast service compute-resource demands, and you can properly
|
||||
allocate resources. This allows financially effective decisions,
|
||||
with no compromise in end-user experience.</p>
|
||||
@@ -2328,44 +2328,44 @@ the key use cases follow:</p>
|
||||
terminologies.</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Metric</strong> -- A metric is a quantitative measure of a particular
|
||||
system attribute---for example, memory or CPU</p>
|
||||
<p><strong>Metric</strong>—A metric is a quantitative measure of a particular
|
||||
system attribute—for example, memory or CPU</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Node or host</strong> -- A physical server, virtual machine, or container
|
||||
<p><strong>Node or host</strong>—A physical server, virtual machine, or container
|
||||
where an application is running</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>QPS</strong> -- <em>Queries Per Second</em>, a measure of traffic served by the
|
||||
<p><strong>QPS</strong>—<em>Queries Per Second</em>, a measure of traffic served by the
|
||||
service per second</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Latency</strong> -- The time interval between user action and the
|
||||
response from the server---for example, time spent after sending a
|
||||
<p><strong>Latency</strong>—The time interval between user action and the
|
||||
response from the server—for example, time spent after sending a
|
||||
query to a database before the first response bit is received</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Error</strong> <strong>rate</strong> -- Number of errors observed over a particular
|
||||
<p><strong>Error</strong> <strong>rate</strong>—Number of errors observed over a particular
|
||||
time period (usually a second)</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Graph</strong> -- In monitoring, a graph is a representation of one or
|
||||
<p><strong>Graph</strong>—In monitoring, a graph is a representation of one or
|
||||
more values of metrics collected over time</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Dashboard</strong> -- A dashboard is a collection of graphs that provide
|
||||
<p><strong>Dashboard</strong>—A dashboard is a collection of graphs that provide
|
||||
an overview of system health</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Incident</strong> -- An incident is an event that disrupts the normal
|
||||
<p><strong>Incident</strong>—An incident is an event that disrupts the normal
|
||||
operations of a system</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>MTTD</strong> -- <em>Mean Time To Detect</em> is the time interval between the
|
||||
<p><strong>MTTD</strong>—<em>Mean Time To Detect</em> is the time interval between the
|
||||
beginning of a service failure and the detection of such failure</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>MTTR</strong> -- Mean Time To Resolve is the time spent to fix a service
|
||||
<p><strong>MTTR</strong>—Mean Time To Resolve is the time spent to fix a service
|
||||
failure and bring the service back to its normal state</p>
|
||||
</li>
|
||||
</ul>
|
||||
@@ -2382,7 +2382,7 @@ notifying concerned parties during any abnormal behavior. Let's look at
|
||||
each of these infrastructure components:</p>
|
||||
<ul>
|
||||
<li>
|
||||
<p><strong>Host metrics agent --</strong> A <em>host metrics agent</em> is a process
|
||||
<p><strong>Host metrics agent</strong>—A <em>host metrics agent</em> is a process
|
||||
running on the host that collects performance statistics for host
|
||||
subsystems such as memory, CPU, and network. These metrics are
|
||||
regularly relayed to a metrics collector for storage and
|
||||
@@ -2392,7 +2392,7 @@ each of these infrastructure components:</p>
|
||||
and <a href="https://www.elastic.co/beats/metricbeat">metricbeat</a>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Metric aggregator --</strong> A <em>metric aggregator</em> is a process running
|
||||
<p><strong>Metric aggregator</strong>—A <em>metric aggregator</em> is a process running
|
||||
on the host. Applications running on the host collect service
|
||||
metrics using
|
||||
<a href="https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)">instrumentation</a>.
|
||||
@@ -2403,7 +2403,7 @@ each of these infrastructure components:</p>
|
||||
<a href="https://github.com/statsd/statsd">StatsD</a>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Metrics collector --</strong> A <em>metrics collector</em> process collects all
|
||||
<p><strong>Metrics collector</strong>—A <em>metrics collector</em> process collects all
|
||||
the metrics from the metric aggregators running on multiple hosts.
|
||||
The collector takes care of decoding and stores this data on the
|
||||
database. Metric collection and storage might be taken care of by
|
||||
@@ -2413,13 +2413,13 @@ each of these infrastructure components:</p>
|
||||
daemons</a>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Storage --</strong> A time-series database stores all of these metrics.
|
||||
<p><strong>Storage</strong>—A time-series database stores all of these metrics.
|
||||
Examples are <a href="http://opentsdb.net/">OpenTSDB</a>,
|
||||
<a href="https://graphite.readthedocs.io/en/stable/whisper.html">Whisper</a>,
|
||||
and <a href="https://www.influxdata.com/">InfluxDB</a>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Metrics server --</strong> A <em>metrics server</em> can be as basic as a web
|
||||
<p><strong>Metrics server</strong>—A <em>metrics server</em> can be as basic as a web
|
||||
server that graphically renders metric data. In addition, the
|
||||
metrics server provides aggregation functionalities and APIs for
|
||||
fetching metric data programmatically. Some examples are
|
||||
@@ -2427,7 +2427,7 @@ each of these infrastructure components:</p>
|
||||
<a href="https://github.com/graphite-project/graphite-web">Graphite-Web</a>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>Alert manager --</strong> The <em>alert manager</em> regularly polls metric data
|
||||
<p><strong>Alert manager</strong>—The <em>alert manager</em> regularly polls metric data
|
||||
available and, if there are any anomalies detected, notifies you.
|
||||
Each alert has a set of rules for identifying such anomalies.
|
||||
Today many metrics servers such as
|
||||
|
||||
@@ -2107,7 +2107,7 @@
|
||||
<h2 id="_1"></h2>
|
||||
<h1 id="observability">Observability</h1>
|
||||
<p>Engineers often use observability when referring to building reliable
|
||||
systems. <em>Observability</em> is a term derived from control theory, It is a
|
||||
systems. <em>Observability</em> is a term derived from control theory, it is a
|
||||
measure of how well internal states of a system can be inferred from
|
||||
knowledge of its external outputs. Service infrastructures used on a
|
||||
daily basis are becoming more and more complex; proactive monitoring
|
||||
@@ -2177,7 +2177,7 @@ value and take further action.</p>
|
||||
Logstash, Kibana), which provides centralized log processing. Beats is a
|
||||
collection of lightweight data shippers that can ship logs, audit data,
|
||||
network data, and so on over the network. In this use case specifically,
|
||||
we are using filebeat as a log shipper. Filebeat watches service log
|
||||
we are using Filebeat as a log shipper. Filebeat watches service log
|
||||
files and ships the log data to Logstash. Logstash parses these logs and
|
||||
transforms the data, preparing it to store on Elasticsearch. Transformed
|
||||
log data is stored on Elasticsearch and indexed for fast retrieval.
|
||||
|
||||
@@ -2111,13 +2111,13 @@ addition, a number of companies such as
|
||||
<a href="https://www.datadoghq.com/">Datadog</a> offer
|
||||
monitoring-as-a-service. In this section, we are not covering
|
||||
monitoring-as-a-service in depth.</p>
|
||||
<p>In recent years, more and more people have access to the internet. Many
|
||||
<p>In recent years, more and more people have access to the Internet. Many
|
||||
services are offered online to cater to the increasing user base. As a
|
||||
result, web pages are becoming larger, with increased client-side
|
||||
scripts. Users want these services to be fast and error-free. From the
|
||||
service point of view, when the response body is composed, an HTTP 200
|
||||
OK response is sent, and everything looks okay. But there might be
|
||||
errors during transmission or on the client side. As previously
|
||||
errors during transmission or on the client-side. As previously
|
||||
mentioned, monitoring services from within the service infrastructure
|
||||
give good visibility into service health, but this is not enough. You
|
||||
need to monitor user experience, specifically the availability of
|
||||
@@ -2131,7 +2131,7 @@ service is globally accessible. Other third-party monitoring solutions
|
||||
for real user monitoring (RUM) provide performance statistics such as
|
||||
service uptime and response time, from different geographical locations.
|
||||
This allows you to monitor the user experience from these locations,
|
||||
which might have different internet backbones, different operating
|
||||
which might have different Internet backbones, different operating
|
||||
systems, and different browsers and browser versions. <a href="https://pages.catchpoint.com/overview-video">Catchpoint
|
||||
Global Monitoring
|
||||
Network</a> is a
|
||||
|
||||
Reference in New Issue
Block a user