Initial commit to metrics and monitoring course
@@ -20,6 +20,7 @@ In this course, we are focusing on building strong foundational skills. The cour
|
|||||||
- [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
|
- [NoSQL concepts](https://linkedin.github.io/school-of-sre/databases_nosql/intro/)
|
||||||
- [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
|
- [Big Data](https://linkedin.github.io/school-of-sre/big_data/intro/)
|
||||||
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
|
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
|
||||||
|
- [Metrics and Monitoring](metrics_and_monitoring/introduction.md)
|
||||||
- [Security](https://linkedin.github.io/school-of-sre/security/intro/)
|
- [Security](https://linkedin.github.io/school-of-sre/security/intro/)
|
||||||
|
|
||||||
We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.
|
We believe continuous learning will help in acquiring deeper knowledge and competencies in order to expand your skill sets, every module has added references which could be a guide for further learning. Our hope is that by going through these modules we should be able to build the essential skills required for a Site Reliability Engineer.
|
||||||
|
|||||||
28
courses/metrics_and_monitoring/alerts.md
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
##
|
||||||
|
|
||||||
|
# Proactive monitoring using alerts
|
||||||
|
Earlier we discussed different ways to collect key metric data points
|
||||||
|
from a service and its underlying infrastructure. This data gives us a
|
||||||
|
better understanding of how the service is performing. One of the main
|
||||||
|
objectives of monitoring is to detect any service degradations early
|
||||||
|
(reduce Mean Time To Detect) and notify stakeholders so that the issues
|
||||||
|
are either avoided or can be fixed early, thus reducing Mean Time To
|
||||||
|
Recover (MTTR). For example, if you are notified when resource usage by
|
||||||
|
a service exceeds 90 percent, you can take preventive measures to avoid
|
||||||
|
any service breakdown due to a shortage of resources. On the other hand,
|
||||||
|
when a service goes down due to an issue, early detection and
|
||||||
|
notification of such incidents can help you quickly fix the issue.
|
||||||
|
|
||||||
|
 <p align="center"> Figure 8: An alert notification received on Slack </p>
|
||||||
|
|
||||||
|
Today most of the monitoring services available provide a mechanism to
|
||||||
|
set up alerts on one or a combination of metrics to actively monitor the
|
||||||
|
service health. These alerts have a set of defined rules or conditions,
|
||||||
|
and when the rule is broken, you are notified. These rules can be as
|
||||||
|
simple as notifying when the metric value exceeds n to as complex as a
|
||||||
|
week over week (WoW) comparison of standard deviation over a period of
|
||||||
|
time. Monitoring tools notify you about an active alert, and most of
|
||||||
|
these tools support instant messaging (IM) platforms, SMS, email, or
|
||||||
|
phone calls. Figure 8 shows a sample alert notification received on
|
||||||
|
Slack for memory usage exceeding 90 percent of total RAM space on the
|
||||||
|
host.
|
||||||
40
courses/metrics_and_monitoring/best_practices.md
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
##
|
||||||
|
|
||||||
|
# Best practices for monitoring
|
||||||
|
|
||||||
|
When setting up monitoring for a service, keep the following best
|
||||||
|
practices in mind.
|
||||||
|
|
||||||
|
- **Use the right metric type** -- Most of the libraries available
|
||||||
|
today offer various metric types. Choose the appropriate metric
|
||||||
|
type for monitoring your system. Following are the types of
|
||||||
|
metrics and their purposes.
|
||||||
|
|
||||||
|
- **Gauge --** *Gauge* is a constant type of metric. After the
|
||||||
|
metric is initialized, the metric value does not change unless
|
||||||
|
you intentionally update it.
|
||||||
|
|
||||||
|
- **Timer --** *Timer* measures the time taken to complete a
|
||||||
|
task.
|
||||||
|
|
||||||
|
- **Counter --** *Counter* counts the number of occurrences of a
|
||||||
|
particular event.
|
||||||
|
|
||||||
|
For more information about these metric types, see [Data
|
||||||
|
Types](https://statsd.readthedocs.io/en/v0.5.0/types.html).
|
||||||
|
|
||||||
|
- **Avoid over-monitoring** -- Monitoring can be a significant
|
||||||
|
engineering endeavor***.*** Therefore, be sure not to spend too
|
||||||
|
much time and resources on monitoring services, yet make sure all
|
||||||
|
important metrics are captured.
|
||||||
|
|
||||||
|
- **Prevent alert fatigue** -- Set alerts for metrics that are
|
||||||
|
important and actionable. If you receive too many non-critical
|
||||||
|
alerts, you might start ignoring alert notifications over time. As
|
||||||
|
a result, critical alerts might get overlooked.
|
||||||
|
|
||||||
|
- **Have a runbook for alerts** -- For every alert, make sure you have
|
||||||
|
a document explaining what actions and checks need to be performed
|
||||||
|
when the alert fires. This enables any engineer on the team to
|
||||||
|
handle the alert and take necessary actions, without any help from
|
||||||
|
others.
|
||||||
98
courses/metrics_and_monitoring/command-line_tools.md
Normal file
@@ -0,0 +1,98 @@
|
|||||||
|
##
|
||||||
|
|
||||||
|
# Command-line tools
|
||||||
|
Most of the Linux distributions today come with a set of tools that
|
||||||
|
monitor the system's performance. These tools help you measure and
|
||||||
|
understand various subsystem statistics (CPU, memory, network, and so
|
||||||
|
on). Let's look at some of the tools that are predominantly used.
|
||||||
|
|
||||||
|
- `ps/top `-- The process status command (ps) displays information
|
||||||
|
about all the currently running processes in a Linux system. The
|
||||||
|
top command is similar to the ps command, but it periodically
|
||||||
|
updates the information displayed until the program is terminated.
|
||||||
|
An advanced version of top, called htop, has a more user-friendly
|
||||||
|
interface and some additional features. These command-line
|
||||||
|
utilities come with options to modify the operation and output of
|
||||||
|
the command. Following are some important options supported by the
|
||||||
|
ps command.
|
||||||
|
|
||||||
|
- `-p <pid1, pid2,...>` -- Displays information about processes
|
||||||
|
that match the specified process IDs. Similarly, you can use
|
||||||
|
`-u <uid>` and `-g <gid>` to display information about
|
||||||
|
processes belonging to a specific user or group.
|
||||||
|
|
||||||
|
- `-a` -- Displays information about other users' processes, as well
|
||||||
|
as one's own.
|
||||||
|
|
||||||
|
- `-x` -- When displaying processes matched by other options,
|
||||||
|
includes processes that do not have a controlling terminal.
|
||||||
|
|
||||||
|
 <p align="center"> Figure 2: Results of top command </p>
|
||||||
|
|
||||||
|
- `ss` -- The socket statistics command (ss) displays information
|
||||||
|
about network sockets on the system. This tool is the successor of
|
||||||
|
[netstat](https://man7.org/linux/man-pages/man8/netstat.8.html),
|
||||||
|
which is deprecated. Following are some command-line options
|
||||||
|
supported by the ss command:
|
||||||
|
|
||||||
|
- `-t` -- Displays the TCP socket. Similarly, `-u` displays UDP
|
||||||
|
sockets, `-x` is for UNIX domain sockets, and so on.
|
||||||
|
|
||||||
|
- `-l` -- Displays only listening sockets.
|
||||||
|
|
||||||
|
- `-n` -- Instructs the command to not resolve service names.
|
||||||
|
Instead displays the port numbers.
|
||||||
|
|
||||||
|
 <p align="center"> Figure
|
||||||
|
3: List of listening sockets on a system </p>
|
||||||
|
|
||||||
|
- `free` -- The free command displays memory usage statistics on the
|
||||||
|
host like available memory, used memory, and free memory. Most often,
|
||||||
|
this command is used with the `-h` command-line option, which
|
||||||
|
displays the statistics in a human-readable format.
|
||||||
|
|
||||||
|
 <p align="center"> Figure 4: Memory
|
||||||
|
statistics on a host in human-readable form </p>
|
||||||
|
|
||||||
|
- `df --` The df command displays disk space usage statistics. The
|
||||||
|
`-i` command-line option is also often used to display
|
||||||
|
[inode](https://en.wikipedia.org/wiki/Inode) usage
|
||||||
|
statistics. The `-h` command-line option is used for displaying
|
||||||
|
statistics in a human-readable format.
|
||||||
|
|
||||||
|
 <p align="center"> Figure 5:
|
||||||
|
Disk usage statistics on a system in human-readable form </p>
|
||||||
|
|
||||||
|
- `sar` -- The sar utility monitors various subsystems, such as CPU
|
||||||
|
and memory, in real time. This data can be stored in a file
|
||||||
|
specified with the `-o` option. This tool helps to identify
|
||||||
|
anomalies.
|
||||||
|
|
||||||
|
- `iftop` -- The interface top command (`iftop`) displays bandwidth
|
||||||
|
utilization by a host on an interface. This command is often used
|
||||||
|
to identify bandwidth usage by active connections. The `-i` option
|
||||||
|
specifies which network interface to watch.
|
||||||
|
|
||||||
|
 <p align="center"> Figure 6: Network bandwidth usage by
|
||||||
|
active connection on the host </p>
|
||||||
|
|
||||||
|
- `tcpdump` -- The tcpdump command is a network monitoring tool that
|
||||||
|
captures network packets flowing over the network and displays a
|
||||||
|
description of the captured packets. The following options are
|
||||||
|
available:
|
||||||
|
|
||||||
|
- `-i <interface>` -- Interface to listen on
|
||||||
|
|
||||||
|
- `host <IP/hostname>` -- Filters traffic going to or from the
|
||||||
|
specified host
|
||||||
|
|
||||||
|
- `src/dst` -- Displays one-way traffic from the source (src) or to
|
||||||
|
the destination (dst)
|
||||||
|
|
||||||
|
- `port <port number>` -- Filters traffic to or from a particular
|
||||||
|
port
|
||||||
|
|
||||||
|
 <p align="center"> Figure 7: *tcpdump* of packets on *docker0*
|
||||||
|
interface on a host </p>
|
||||||
50
courses/metrics_and_monitoring/conclusion.md
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
# Conclusion
|
||||||
|
|
||||||
|
A robust monitoring and alerting system is necessary for maintaining and
|
||||||
|
troubleshooting a system. A dashboard with key metrics can give you an
|
||||||
|
overview of service performance, all in one place. Well-defined alerts
|
||||||
|
(with realistic thresholds and notifications) further enable you to
|
||||||
|
quickly identify any anomalies in the service infrastructure and in
|
||||||
|
resource saturation. By taking necessary actions, you can avoid any
|
||||||
|
service degradations and decrease MTTD for service breakdowns.
|
||||||
|
|
||||||
|
In addition to in-house monitoring, monitoring real user experience can
|
||||||
|
help you to understand service performance as perceived by the users.
|
||||||
|
Many modules are involved in serving the user, and most of them are out
|
||||||
|
of your control. Therefore, you need to have real-user monitoring in
|
||||||
|
place.
|
||||||
|
|
||||||
|
Metrics give very abstract details on service performance. To get a
|
||||||
|
better understanding of the system and for faster recovery during
|
||||||
|
incidents, you might want to implement the other two pillars of
|
||||||
|
observability: logs and tracing. Logs and trace data can help you
|
||||||
|
understand what led to service failure or degradation.
|
||||||
|
|
||||||
|
Following are some resources to learn more about monitoring and
|
||||||
|
observability:
|
||||||
|
|
||||||
|
- [Google SRE book: Monitoring Distributed
|
||||||
|
Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
|
||||||
|
|
||||||
|
- [Mastering Distributed Tracing by Yuri
|
||||||
|
Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
|
||||||
|
|
||||||
|
- Engineering blogs on
|
||||||
|
[LinkedIn](https://engineering.linkedin.com/blog/topic/monitoring),
|
||||||
|
[Grafana](https://grafana.com/blog/),
|
||||||
|
[Elastic.co](https://www.elastic.co/blog/),
|
||||||
|
[OpenTelemetry](https://medium.com/opentelemetry)
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- [Google SRE book: Monitoring Distributed
|
||||||
|
Systems](https://sre.google/sre-book/monitoring-distributed-systems/)
|
||||||
|
|
||||||
|
- [Mastering Distributed Tracing, by Yuri
|
||||||
|
Shkuro](https://learning.oreilly.com/library/view/mastering-distributed-tracing/9781788628464/)
|
||||||
|
|
||||||
|
- [Monitoring and
|
||||||
|
Observability](https://copyconstruct.medium.com/monitoring-and-observability-8417d1952e1c)
|
||||||
|
|
||||||
|
- [Three PIllars with Zero
|
||||||
|
Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8)
|
||||||
BIN
courses/metrics_and_monitoring/images/image1.jpg
Normal file
|
After Width: | Height: | Size: 39 KiB |
BIN
courses/metrics_and_monitoring/images/image10.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
courses/metrics_and_monitoring/images/image11.png
Normal file
|
After Width: | Height: | Size: 49 KiB |
BIN
courses/metrics_and_monitoring/images/image12.png
Normal file
|
After Width: | Height: | Size: 126 KiB |
BIN
courses/metrics_and_monitoring/images/image2.png
Normal file
|
After Width: | Height: | Size: 23 KiB |
BIN
courses/metrics_and_monitoring/images/image3.jpg
Normal file
|
After Width: | Height: | Size: 45 KiB |
BIN
courses/metrics_and_monitoring/images/image4.jpg
Normal file
|
After Width: | Height: | Size: 27 KiB |
BIN
courses/metrics_and_monitoring/images/image5.jpg
Normal file
|
After Width: | Height: | Size: 34 KiB |
BIN
courses/metrics_and_monitoring/images/image6.png
Normal file
|
After Width: | Height: | Size: 9.3 KiB |
BIN
courses/metrics_and_monitoring/images/image7.png
Normal file
|
After Width: | Height: | Size: 307 KiB |
BIN
courses/metrics_and_monitoring/images/image8.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
courses/metrics_and_monitoring/images/image9.png
Normal file
|
After Width: | Height: | Size: 19 KiB |
280
courses/metrics_and_monitoring/introduction.md
Normal file
@@ -0,0 +1,280 @@
|
|||||||
|
##
|
||||||
|
|
||||||
|
# Prerequisites
|
||||||
|
|
||||||
|
- [Linux Basics](https://linkedin.github.io/school-of-sre/linux_basics/intro/)
|
||||||
|
|
||||||
|
- [Python and the Web](https://linkedin.github.io/school-of-sre/python_web/intro/)
|
||||||
|
|
||||||
|
- [Systems Design](https://linkedin.github.io/school-of-sre/systems_design/intro/)
|
||||||
|
|
||||||
|
- [Linux Networking Fundamentals](https://linkedin.github.io/school-of-sre/linux_networking/intro/)
|
||||||
|
|
||||||
|
|
||||||
|
## What to expect from this course
|
||||||
|
|
||||||
|
Monitoring is an integral part of any system. As an SRE, you need to
|
||||||
|
have a basic understanding of monitoring a service infrastructure. By
|
||||||
|
the end of this course, you will gain a better understanding of the
|
||||||
|
following topics:
|
||||||
|
|
||||||
|
- What is monitoring?
|
||||||
|
|
||||||
|
- What needs to be measured
|
||||||
|
|
||||||
|
- How the metrics gathered can be used to improve business decisions and overall reliability
|
||||||
|
|
||||||
|
- Proactive monitoring with alerts
|
||||||
|
|
||||||
|
- Log processing and its importance
|
||||||
|
|
||||||
|
- What is observability?
|
||||||
|
|
||||||
|
- Distributed tracing
|
||||||
|
|
||||||
|
- Logs
|
||||||
|
|
||||||
|
- Metrics
|
||||||
|
|
||||||
|
## What is not covered in this course
|
||||||
|
|
||||||
|
- Guide to setting up a monitoring infrastructure
|
||||||
|
|
||||||
|
- Deep dive into different monitoring technologies and benchmarking or comparison of any tools
|
||||||
|
|
||||||
|
|
||||||
|
## Course content
|
||||||
|
|
||||||
|
- [Introduction](#introduction)
|
||||||
|
|
||||||
|
- [Four golden signals of monitoring](#four-golden-signals-of-monitoring)
|
||||||
|
|
||||||
|
- [Why is monitoring important?](#why-is-monitoring-important)
|
||||||
|
|
||||||
|
- [Command-line tools](command-line_tools.md)
|
||||||
|
|
||||||
|
- [Third-party monitoring](third-party_monitoring.md)
|
||||||
|
|
||||||
|
- [Proactive monitoring using alerts](alerts.md)
|
||||||
|
|
||||||
|
- [Best practices for monitoring](best_practices.md)
|
||||||
|
|
||||||
|
- [Observability](observability.md)
|
||||||
|
|
||||||
|
- [Logs](observability.md#logs)
|
||||||
|
- [Tracing](observability.md#tracing)
|
||||||
|
|
||||||
|
[Conclusion](conclusion.md)
|
||||||
|
|
||||||
|
|
||||||
|
##
|
||||||
|
|
||||||
|
# Introduction
|
||||||
|
|
||||||
|
Monitoring is a process of collecting real-time performance metrics from
|
||||||
|
a system, analyzing the data to derive meaningful information, and
|
||||||
|
displaying the data to the users. In simple terms, you measure various
|
||||||
|
metrics regularly to understand the state of the system, including but
|
||||||
|
not limited to, user requests, latency, and error rate. *What gets
|
||||||
|
measured, gets fixed*---if you can measure something, you can reason
|
||||||
|
about it, understand it, discuss it, and act upon it with confidence.
|
||||||
|
|
||||||
|
|
||||||
|
## Four golden signals of monitoring
|
||||||
|
|
||||||
|
When setting up monitoring for a system, you need to decide what to
|
||||||
|
measure. The four golden signals of monitoring provide a good
|
||||||
|
understanding of service performance and lay a foundation for monitoring
|
||||||
|
a system. These four golden signals are
|
||||||
|
|
||||||
|
- Traffic
|
||||||
|
|
||||||
|
- Latency
|
||||||
|
|
||||||
|
- Error
|
||||||
|
|
||||||
|
- Saturation
|
||||||
|
|
||||||
|
These metrics help you to understand the system performance and
|
||||||
|
bottlenecks, and to create a better end-user experience. As discussed in
|
||||||
|
the [Google SRE
|
||||||
|
book](https://sre.google/sre-book/monitoring-distributed-systems/),
|
||||||
|
if you can measure only four metrics of your service, focus on these
|
||||||
|
four. Let's look at each of the four golden signals.
|
||||||
|
|
||||||
|
- **Traffic** -- *Traffic* gives a better understanding of the service
|
||||||
|
demand. Often referred to as *service QPS* (queries per second),
|
||||||
|
traffic is a measure of requests served by the service. This
|
||||||
|
signal helps you to decide when a service needs to be scaled up to
|
||||||
|
handle increasing customer demand and scaled down to be
|
||||||
|
cost-effective.
|
||||||
|
|
||||||
|
- **Latency** -- *Latency* is the measure of time taken by the service
|
||||||
|
to process the incoming request and send the response. Measuring
|
||||||
|
service latency helps in the early detection of slow degradation
|
||||||
|
of the service. Distinguishing between the latency of successful
|
||||||
|
requests and the latency of failed requests is important. For
|
||||||
|
example, an [HTTP 5XX
|
||||||
|
error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)
|
||||||
|
triggered due to loss of connection to a database or other
|
||||||
|
critical backend might be served very quickly. However, because an
|
||||||
|
HTTP 500 error indicates a failed request, factoring 500s into
|
||||||
|
overall latency might result in misleading calculations.
|
||||||
|
|
||||||
|
- **Error (rate)** -- *Error* is the measure of failed client
|
||||||
|
requests. These failures can be easily identified based on the
|
||||||
|
response codes ([HTTP 5XX
|
||||||
|
error](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses)).
|
||||||
|
There might be cases where the response is considered erroneous
|
||||||
|
due to wrong result data or due to policy violations. For example,
|
||||||
|
you might get an [HTTP
|
||||||
|
200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200)
|
||||||
|
response, but the body has incomplete data, or response time is
|
||||||
|
breaching the agreed-upon
|
||||||
|
[SLA](https://en.wikipedia.org/wiki/Service-level_agreement)s.
|
||||||
|
Therefore, you need to have other mechanisms (code logic or
|
||||||
|
[instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)))
|
||||||
|
in place to capture errors in addition to the response codes.
|
||||||
|
|
||||||
|
- **Saturation** -- *Saturation* is a measure of the resource
|
||||||
|
utilization by a service. This signal tells you the state of
|
||||||
|
service resources and how full they are. These resources include
|
||||||
|
memory, compute, network I/O, and so on. Service performance
|
||||||
|
slowly degrades even before resource utilization is at 100
|
||||||
|
percent. Therefore, having a utilization target is important. An
|
||||||
|
increase in latency is a good indicator of saturation; measuring
|
||||||
|
the [99th
|
||||||
|
percentile](https://medium.com/@ankur_anand/an-in-depth-introduction-to-99-percentile-for-programmers-22e83a00caf)
|
||||||
|
of latency can help in the early detection of saturation.
|
||||||
|
|
||||||
|
Depending on the type of service, you can measure these signals in
|
||||||
|
different ways. For example, you might measure queries per second served
|
||||||
|
for a web server. In contrast, for a database server, transactions
|
||||||
|
performed and database sessions created give you an idea about the
|
||||||
|
traffic handled by the database server. With the help of additional code
|
||||||
|
logic (monitoring libraries and instrumentation), you can measure these
|
||||||
|
signals periodically and store them for future analysis. Although these
|
||||||
|
metrics give you an idea about the performance at the service end, you
|
||||||
|
need to also ensure that the same user experience is delivered at the
|
||||||
|
client end. Therefore, you might need to monitor the service from
|
||||||
|
outside the service infrastructure, which is discussed under third-party
|
||||||
|
monitoring.
|
||||||
|
|
||||||
|
## Why is monitoring important?
|
||||||
|
|
||||||
|
Monitoring plays a key role in the success of a service. As discussed
|
||||||
|
earlier, monitoring provides performance insights for understanding
|
||||||
|
service health. With access to historical data collected over time, you
|
||||||
|
can build intelligent applications to address specific needs. Some of
|
||||||
|
the key use cases follow:
|
||||||
|
|
||||||
|
- **Reduction in time to resolve issues** -- With a good monitoring
|
||||||
|
infrastructure in place, you can identify issues quickly and
|
||||||
|
resolve them, which reduces the impact caused by the issues.
|
||||||
|
|
||||||
|
- **Business decisions** -- Data collected over a period of time can
|
||||||
|
help you make business decisions such as determining the product
|
||||||
|
release cycle, which features to invest in, and geographical areas
|
||||||
|
to focus on. Decisions based on long-term data can improve the
|
||||||
|
overall product experience.
|
||||||
|
|
||||||
|
- **Resource planning** -- By analyzing historical data, you can
|
||||||
|
forecast service compute-resource demands, and you can properly
|
||||||
|
allocate resources. This allows financially effective decisions,
|
||||||
|
with no compromise in end-user experience.
|
||||||
|
|
||||||
|
Before we dive deeper into monitoring, let's understand some basic
|
||||||
|
terminologies.
|
||||||
|
|
||||||
|
- **Metric** -- A metric is a quantitative measure of a particular
|
||||||
|
system attribute---for example, memory or CPU
|
||||||
|
|
||||||
|
- **Node or host** -- A physical server, virtual machine, or container
|
||||||
|
where an application is running
|
||||||
|
|
||||||
|
- **QPS** -- *Queries Per Second*, a measure of traffic served by the
|
||||||
|
service per second
|
||||||
|
|
||||||
|
- **Latency** -- The time interval between user action and the
|
||||||
|
response from the server---for example, time spent after sending a
|
||||||
|
query to a database before the first response bit is received
|
||||||
|
|
||||||
|
- **Error** **rate** -- Number of errors observed over a particular
|
||||||
|
time period (usually a second)
|
||||||
|
|
||||||
|
- **Graph** -- In monitoring, a graph is a representation of one or
|
||||||
|
more values of metrics collected over time
|
||||||
|
|
||||||
|
- **Dashboard** -- A dashboard is a collection of graphs that provide
|
||||||
|
an overview of system health
|
||||||
|
|
||||||
|
- **Incident** -- An incident is an event that disrupts the normal
|
||||||
|
operations of a system
|
||||||
|
|
||||||
|
- **MTTD** -- *Mean Time To Detect* is the time interval between the
|
||||||
|
beginning of a service failure and the detection of such failure
|
||||||
|
|
||||||
|
- **MTTR** -- Mean Time To Resolve is the time spent to fix a service
|
||||||
|
failure and bring the service back to its normal state
|
||||||
|
|
||||||
|
Before we discuss monitoring an application, let us look at the
|
||||||
|
monitoring infrastructure. Following is an illustration of a basic
|
||||||
|
monitoring system.
|
||||||
|
|
||||||
|
 <p align="center"> Figure 1: Illustration of a monitoring infrastructure </p>
|
||||||
|
|
||||||
|
Figure 1 shows a monitoring infrastructure mechanism for aggregating
|
||||||
|
metrics on the system, and collecting and storing the data for display.
|
||||||
|
In addition, a monitoring infrastructure includes alert subsystems for
|
||||||
|
notifying concerned parties during any abnormal behavior. Let's look at
|
||||||
|
each of these infrastructure components:
|
||||||
|
|
||||||
|
- **Host metrics agent --** A *host metrics agent* is a process
|
||||||
|
running on the host that collects performance statistics for host
|
||||||
|
subsystems such as memory, CPU, and network. These metrics are
|
||||||
|
regularly relayed to a metrics collector for storage and
|
||||||
|
visualization. Some examples are
|
||||||
|
[collectd](https://collectd.org/),
|
||||||
|
[telegraf](https://www.influxdata.com/time-series-platform/telegraf/),
|
||||||
|
and [metricbeat](https://www.elastic.co/beats/metricbeat).
|
||||||
|
|
||||||
|
- **Metric aggregator --** A *metric aggregator* is a process running
|
||||||
|
on the host. Applications running on the host collect service
|
||||||
|
metrics using
|
||||||
|
[instrumentation](https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)).
|
||||||
|
Collected metrics are sent either to the aggregator process or
|
||||||
|
directly to the metrics collector over API, if available. Received
|
||||||
|
metrics are aggregated periodically and relayed to the metrics
|
||||||
|
collector in batches. An example is
|
||||||
|
[StatsD](https://github.com/statsd/statsd).
|
||||||
|
|
||||||
|
- **Metrics collector --** A *metrics collector* process collects all
|
||||||
|
the metrics from the metric aggregators running on multiple hosts.
|
||||||
|
The collector takes care of decoding and stores this data on the
|
||||||
|
database. Metric collection and storage might be taken care of by
|
||||||
|
one single service such as
|
||||||
|
[InfluxDB](https://www.influxdata.com/), which we discuss
|
||||||
|
next. An example is [carbon
|
||||||
|
daemons](https://graphite.readthedocs.io/en/latest/carbon-daemons.html).
|
||||||
|
|
||||||
|
- **Storage --** A time-series database stores all of these metrics.
|
||||||
|
Examples are [OpenTSDB](http://opentsdb.net/),
|
||||||
|
[Whisper](https://graphite.readthedocs.io/en/stable/whisper.html),
|
||||||
|
and [InfluxDB](https://www.influxdata.com/).
|
||||||
|
|
||||||
|
- **Metrics server --** A *metrics server* can be as basic as a web
|
||||||
|
server that graphically renders metric data. In addition, the
|
||||||
|
metrics server provides aggregation functionalities and APIs for
|
||||||
|
fetching metric data programmatically. Some examples are
|
||||||
|
[Grafana](https://github.com/grafana/grafana) and
|
||||||
|
[Graphite-Web](https://github.com/graphite-project/graphite-web).
|
||||||
|
|
||||||
|
- **Alert manager --** The *alert manager* regularly polls metric data
|
||||||
|
available and, if there are any anomalies detected, notifies you.
|
||||||
|
Each alert has a set of rules for identifying such anomalies.
|
||||||
|
Today many metrics servers such as
|
||||||
|
[Grafana](https://github.com/grafana/grafana) support alert
|
||||||
|
management. We discuss alerting [in detail
|
||||||
|
later](#proactive-monitoring-using-alerts). Examples are
|
||||||
|
[Grafana](https://github.com/grafana/grafana) and
|
||||||
|
[Icinga](https://icinga.com/).
|
||||||
151
courses/metrics_and_monitoring/observability.md
Normal file
@@ -0,0 +1,151 @@
|
|||||||
|
##
|
||||||
|
|
||||||
|
# Observability
|
||||||
|
|
||||||
|
Engineers often use observability when referring to building reliable
|
||||||
|
systems. *Observability* is a term derived from control theory, It is a
|
||||||
|
measure of how well internal states of a system can be inferred from
|
||||||
|
knowledge of its external outputs. Service infrastructures used on a
|
||||||
|
daily basis are becoming more and more complex; proactive monitoring
|
||||||
|
alone is not sufficient to quickly resolve issues causing application
|
||||||
|
failures. With monitoring, you can keep known past failures from
|
||||||
|
recurring, but with a complex service architecture, many unknown factors
|
||||||
|
can cause potential problems. To address such cases, you can make the
|
||||||
|
service observable. An observable system provides highly granular
|
||||||
|
insights into the implicit failure modes. In addition, an observable
|
||||||
|
system furnishes ample context about its inner workings, which unlocks
|
||||||
|
the ability to uncover deeper systemic issues.
|
||||||
|
|
||||||
|
Monitoring enables failure detection; observability helps in gaining a
|
||||||
|
better understanding of the system. Among engineers, there is a common
|
||||||
|
misconception that monitoring and observability are two different
|
||||||
|
things. Actually, observability is the superset to monitoring; that is,
|
||||||
|
monitoring improves service observability. The goal of observability is
|
||||||
|
not only to detect problems, but also to understand where the issue is
|
||||||
|
and what is causing it. In addition to metrics, observability has two
|
||||||
|
more pillars: logs and traces, as shown in Figure 9. Although these
|
||||||
|
three components do not make a system 100 percent observable, these are
|
||||||
|
the most important and powerful components that give a better
|
||||||
|
understanding of the system. Each of these pillars has its flaws, which
|
||||||
|
are described in [Three Pillars with Zero
|
||||||
|
Answers](https://medium.com/lightstephq/three-pillars-with-zero-answers-2a98b36358b8).
|
||||||
|
|
||||||
|
 <p align="center"> Figure 9:
|
||||||
|
Three pillars of observability </p>
|
||||||
|
|
||||||
|
Because we have covered metrics already, let's look at the other two
|
||||||
|
pillars (logs and traces).
|
||||||
|
|
||||||
|
#### Logs
|
||||||
|
|
||||||
|
Logs (often referred to as *events*) are a record of activities
|
||||||
|
performed by a service during its run time, with a corresponding
|
||||||
|
timestamp. Metrics give abstract information about degradations in a
|
||||||
|
system, and logs give a detailed view of what is causing these
|
||||||
|
degradations. Logs created by the applications and infrastructure
|
||||||
|
components help in effectively understanding the behavior of the system
|
||||||
|
by providing details on application errors, exceptions, and event
|
||||||
|
timelines. Logs help you to go back in time to understand the events
|
||||||
|
that led to a failure. Therefore, examining logs is essential to
|
||||||
|
troubleshooting system failures.
|
||||||
|
|
||||||
|
Log processing involves the aggregation of different logs from
|
||||||
|
individual applications and their subsequent shipment to central
|
||||||
|
storage. Moving logs to central storage helps to preserve the logs, in
|
||||||
|
case the application instances are inaccessible, or the application
|
||||||
|
crashes due to a failure. After the logs are available in a central
|
||||||
|
place, you can analyze the logs to derive sensible information from
|
||||||
|
them. For audit and compliance purposes, you archive these logs on the
|
||||||
|
central storage for a certain period of time. Log analyzers fetch useful
|
||||||
|
information from log lines, such as request user information, request
|
||||||
|
URL (feature), and response headers (such as content length) and
|
||||||
|
response time. This information is grouped based on these attributes and
|
||||||
|
made available to you through a visualization tool for quick
|
||||||
|
understanding.
|
||||||
|
|
||||||
|
You might be wondering how this log information helps. This information
|
||||||
|
gives a holistic view of activities performed on all the involved
|
||||||
|
entities. For example, let's say someone is performing a DoS (denial of
|
||||||
|
service) attack on a web application. With the help of log processing,
|
||||||
|
you can quickly look at top client IPs derived from access logs and
|
||||||
|
identify where the attack is coming from.
|
||||||
|
|
||||||
|
Similarly, if a feature in an application is causing a high error rate
|
||||||
|
when accessed with a particular request parameter value, the results of
|
||||||
|
log analysis can help you to quickly identify the misbehaving parameter
|
||||||
|
value and take further action.
|
||||||
|
|
||||||
|

|
||||||
|
<p align="center"> Figure 10: Log processing and analysis using ELK stack </p>
|
||||||
|
|
||||||
|
Figure 10 shows a log processing platform using ELK (Elasticsearch,
|
||||||
|
Logstash, Kibana), which provides centralized log processing. Beats is a
|
||||||
|
collection of lightweight data shippers that can ship logs, audit data,
|
||||||
|
network data, and so on over the network. In this use case specifically,
|
||||||
|
we are using filebeat as a log shipper. Filebeat watches service log
|
||||||
|
files and ships the log data to Logstash. Logstash parses these logs and
|
||||||
|
transforms the data, preparing it to store on Elasticsearch. Transformed
|
||||||
|
log data is stored on Elasticsearch and indexed for fast retrieval.
|
||||||
|
Kibana searches and displays log data stored on Elasticsearch. Kibana
|
||||||
|
also provides a set of visualizations for graphically displaying
|
||||||
|
summaries derived from log data.
|
||||||
|
|
||||||
|
Storing logs is expensive. And extensive logging of every event on the
|
||||||
|
server is costly and takes up more storage space. With an increasing
|
||||||
|
number of services, this cost can increase proportionally to the number
|
||||||
|
of services.
|
||||||
|
|
||||||
|
#### Tracing
|
||||||
|
|
||||||
|
So far, we covered the importance of metrics and logging. Metrics give
|
||||||
|
an abstract overview of the system, and logging gives a record of events
|
||||||
|
that occurred. Imagine a complex distributed system with multiple
|
||||||
|
microservices, where a user request is processed by multiple
|
||||||
|
microservices in the system. Metrics and logging give you some
|
||||||
|
information about how these requests are being handled by the system,
|
||||||
|
but they fail to provide detailed information across all the
|
||||||
|
microservices and how they affect a particular client request. If a slow
|
||||||
|
downstream microservice is leading to increased response times, you need
|
||||||
|
to have detailed visibility across all involved microservices to
|
||||||
|
identify such microservice. The answer to this need is a request tracing
|
||||||
|
mechanism.
|
||||||
|
|
||||||
|
A trace is a series of spans, where each span is a record of events
|
||||||
|
performed by different microservices to serve the client's request. In
|
||||||
|
simple terms, a trace is a log of client-request serving derived from
|
||||||
|
various microservices across different physical machines. Each span
|
||||||
|
includes span metadata such as trace ID and span ID, and context, which
|
||||||
|
includes information about transactions performed.
|
||||||
|
|
||||||
|

|
||||||
|
<p align="center"> Figure 11: Trace and spans for a URL shortener request </p>
|
||||||
|
|
||||||
|
Figure 11 is a graphical representation of a trace captured on the [URL
|
||||||
|
shortener](https://linkedin.github.io/school-of-sre/python_web/url-shorten-app/)
|
||||||
|
example we covered earlier while learning Python.
|
||||||
|
|
||||||
|
Similar to monitoring, the tracing infrastructure comprises a few
|
||||||
|
modules for collecting traces, storing them, and accessing them. Each
|
||||||
|
microservice runs a tracing library that collects traces in the
|
||||||
|
background, creates in-memory batches, and submits the tracing backend.
|
||||||
|
The tracing backend normalizes received trace data and stores it on
|
||||||
|
persistent storage. Tracing data comes from multiple different
|
||||||
|
microservices; therefore, trace storage is often organized to store data
|
||||||
|
incrementally and is indexed by trace identifier. This organization
|
||||||
|
helps in the reconstruction of trace data and in visualization. Figure
|
||||||
|
12 illustrates the anatomy of the distributed system.
|
||||||
|
|
||||||
|

|
||||||
|
<p align="center"> Figure 12: Anatomy of distributed tracing </p>
|
||||||
|
|
||||||
|
Today a set of tools and frameworks are available for building
|
||||||
|
distributed tracing solutions. Following are some of the popular tools:
|
||||||
|
|
||||||
|
- [OpenTelemetry](https://opentelemetry.io/): Observability
|
||||||
|
framework for cloud-native software
|
||||||
|
|
||||||
|
- [Jaeger](https://www.jaegertracing.io/): Open-source
|
||||||
|
distributed tracing solution
|
||||||
|
|
||||||
|
- [Zipkin](https://zipkin.io/): Open-source distributed tracing
|
||||||
|
solution
|
||||||
37
courses/metrics_and_monitoring/third-party_monitoring.md
Normal file
@@ -0,0 +1,37 @@
|
|||||||
|
##
|
||||||
|
|
||||||
|
# Third-party monitoring
|
||||||
|
|
||||||
|
Today most cloud providers offer a variety of monitoring solutions. In
|
||||||
|
addition, a number of companies such as
|
||||||
|
[Datadog](https://www.datadoghq.com/) offer
|
||||||
|
monitoring-as-a-service. In this section, we are not covering
|
||||||
|
monitoring-as-a-service in depth.
|
||||||
|
|
||||||
|
In recent years, more and more people have access to the internet. Many
|
||||||
|
services are offered online to cater to the increasing user base. As a
|
||||||
|
result, web pages are becoming larger, with increased client-side
|
||||||
|
scripts. Users want these services to be fast and error-free. From the
|
||||||
|
service point of view, when the response body is composed, an HTTP 200
|
||||||
|
OK response is sent, and everything looks okay. But there might be
|
||||||
|
errors during transmission or on the client side. As previously
|
||||||
|
mentioned, monitoring services from within the service infrastructure
|
||||||
|
give good visibility into service health, but this is not enough. You
|
||||||
|
need to monitor user experience, specifically the availability of
|
||||||
|
services for clients. A number of third-party services such asf
|
||||||
|
[Catchpoint](https://www.catchpoint.com/),
|
||||||
|
[Pingdom](https://www.pingdom.com/), and so on are available for
|
||||||
|
achieving this goal.
|
||||||
|
|
||||||
|
Third-party monitoring services can generate synthetic traffic
|
||||||
|
simulating user requests from various parts of the world, to ensure the
|
||||||
|
service is globally accessible. Other third-party monitoring solutions
|
||||||
|
for real user monitoring (RUM) provide performance statistics such as
|
||||||
|
service uptime and response time, from different geographical locations.
|
||||||
|
This allows you to monitor the user experience from these locations,
|
||||||
|
which might have different internet backbones, different operating
|
||||||
|
systems, and different browsers and browser versions. [Catchpoint
|
||||||
|
Global Monitoring
|
||||||
|
Network](https://pages.catchpoint.com/overview-video) is a
|
||||||
|
comprehensive 3-minute video that explains the importance of monitoring
|
||||||
|
the client experience.
|
||||||
@@ -57,6 +57,14 @@ nav:
|
|||||||
- Availability: systems_design/availability.md
|
- Availability: systems_design/availability.md
|
||||||
- Fault Tolerance: systems_design/fault-tolerance.md
|
- Fault Tolerance: systems_design/fault-tolerance.md
|
||||||
- Conclusion: systems_design/conclusion.md
|
- Conclusion: systems_design/conclusion.md
|
||||||
|
- Metrics and Monitoring:
|
||||||
|
- Introduction: metrics_and_monitoring/introduction.md
|
||||||
|
- Command-line Tools: metrics_and_monitoring/command-line_tools.md
|
||||||
|
- Third-party Monitoring: metrics_and_monitoring/third-party_monitoring.md
|
||||||
|
- Proactive Monitoring with Alerts: metrics_and_monitoring/alerts.md
|
||||||
|
- Best Practices for Monitoring: metrics_and_monitoring/best_practices.md
|
||||||
|
- Observability: metrics_and_monitoring/observability.md
|
||||||
|
- Conclusion: metrics_and_monitoring/conclusion.md
|
||||||
- Security:
|
- Security:
|
||||||
- Introduction: security/intro.md
|
- Introduction: security/intro.md
|
||||||
- Fundamentals of Security: security/fundamentals.md
|
- Fundamentals of Security: security/fundamentals.md
|
||||||
|
|||||||